Adversarial Watermarking T ransformer: Towards T racing Text Provenance with Data Hiding Sahar Abdelnabi, Mario Fritz CISPA Helmholtz Center for Information Security arXiv (Submitted on 7 Sep 2020) Slides by Honai Ueoka 1
Summary • This paper proposed Transformer based watermarking model • Discriminator as adversarial training improved the Watermarking system • Fine-tuning with multiple language loss improved the output text quality 2
Related Work • Language Watermarking • Linguistic Steganography • Sequence-to-sequence Model • Model Watermarking • Neural Text Detection 3
What is Watermarking (透かし)? Visible (recognizable) watermarking (Physical & Digital) https://www.boj.or.jp/note_tfjgs/note/security/miwake.pdf https://helpx.adobe.com/jp/acrobat/kb/3242.html 5
What is Watermarking (透かし)? Invisible (unrecognizable) watermarking (Physical & Digital) https://www.hitachi-sis.co.jp/service/security/eshimon/MATAG https://www.imatag.com/ 6
Difference from Cryptography (暗号), Steganography Watermarking Steganography Cryptography Hiding the existence of Hiding some data in a Goal media, the data is related the data over other media Hiding the content of the (data is not always related data to the media to the media) Required decoding Depends on the case 100% accuracy (trade-off with robustness or media quality) Robustness against Required (suppose attacks modifying the media / Usually not required data to remove the watermark) References: [Chang, Clark 2014], [Ziegler et al. 2019] 7
Language Watermarking Edit text with some rule to embed information Encoding Decoding Decoded message 1010 Input message 1010 It also should be robust to 8
Motivation • Recent advances in natural language generation • Powerful language models with high-quality output text (like GPT-*) • Concern about using the models for malicious purpose • Spreading neural-generated fake news / misinformation • Language watermarking as a better mark and trace the provenance of text 10
Usage Scenario Language Tool (text generation, translation, …) Fake news? Machine-generated? Tool output Internet Tool users (malicious) 11
Usage Scenario Language Tool (text generation, translation, …) This text is generated by Model our model (e.g., GPT-3) Tool Owner Model output Watermark Encoder Watermark Decoded Watermark Decoder message message Black-box for users Fake news? Machine-generated? Tool output (watermarked) Internet Tool users (malicious) 12
Usage Scenario Language Tool News platforms can cooperate with (text generation, translation, …) tool owner to detect machine- generated articles Model Watermark also can be used for (e.g., GPT-3) denial [Zhang et al. 2020] arXiv Tool Owner Model output Watermark Decoded Watermark Encoder message message News Platform Owner Black-box for users Tool output (watermarked) Watermark Decoder Internet Tool users (malicious) News Platform 13
Existing Approaches • Rule-based language watermarking • e.g., synonym substitution • They evaluates synonym substitution method as a baseline • Data hiding with neural model • There are some works on the image classification model • No previous work with language model • Neural text detection • Train classifier to detect the machine-generated text • Easily dropped by future progress in language models, like arms race (軍拡競争、いたちごっこ) 14
AWT – Similar Architecture [Shetty et al. 2018], [Zhu et al. 2018] R. Shetty, B. Schiele, and M. Fritz, “A4nt: author attribute anonymity by adversarial training of neural machine translation,” in 27th USENIX Security Symposium (USENIX Security 18), 2018. J. Zhu, R. Kaplan, J. Johnson, and L. Fei-Fei, “HiDDeN: PDF Hiding data with deep networks,” in European Conference on Computer Vision (ECCV), 2018. arXiv 18
AWT – Input / Output Flow Input message 1010 Output sentence Data Hiding Data Revealing Decoded message Network 1010 Network Input sentence (watermarked) Binary Classification Discriminator watermarked not watermarked (not watermarked) InferSent AWD-LSTM Fine-tuning Loss 19
AWT – 1. Discriminator • Classify if the sentence is watermarked / not watermarked or • Trained with binary cross-entropy loss 𝐴 : discriminator 𝑆 : input (not watermarked) sentence 𝑆 : output (watermarked) sentence Adversarial loss A is for training data hiding network 20
AWT – 1. Discriminator – T raining Input message Output sentence 1010 Data Hiding Data Revealing Network Network Input sentence (watermarked) Binary Classification Discriminator watermarked not watermarked (not watermarked) Binary cross-entropy loss Fine-tuning Loss is not used 21
AWT – 2. Data Revealing Network • Output dimension: q (= message length) • Similar to Transformer-based multi-class classifier • Message reconstruction loss L im binary cross-entropy loss over all bits 22
AWT – 2. Data Revealing Network – T raining Input message Output sentence 1010 Decoded message Data Hiding Data Revealing Network Network 1011 Input sentence (watermarked) Message reconstruction loss (not watermarked) Fine-tuning Loss Discriminator are not used 23
AWT – 3. Data Hiding Network A)Add input message to encoded A embeddings C B) Transformer auto-encoder (the decoder takes shifted input sentence) C) Gumbel-softmax to train jointly with other components Text reconstruction loss rec cross entropy loss of input & output sequence B 24
AWT – 3. Data Hiding Network – T raining Input message Output sentence 1010 Decoded message Data Hiding Data Revealing Network Network 1010 Input sentence (watermarked) Binary Classification Discriminator watermarked (not watermarked) not watermarked 𝐿 1 𝑤 𝑟𝑒𝑐 𝐿 𝑟𝑒𝑐 + 𝑤 𝐿𝑚+ 𝑤 𝑚 𝐴 𝐴 𝑤∗is weight for each loss Trained to 1) Reconstruct the input sentence , 2) Reconstruct the message and 3) Fooling the adversary . These losses are competing. 25
AWT – 4. Fine-tuning Loss Watermarked sentence Loss Not watermarked sentence 𝑆 : input (not watermarked) sentence 𝑆 : output (watermarked) sentence B) Preserving Sentence Correctness ASGD Weight-Dropped LSTM, independently trained on the dataset used as input texts (not watermarked texts) Watermarked Loss sentence 𝑊 : the itword in watermarked sentence 𝑖 26
Experiment Setup • Dataset • WikiText-2 (Wikipedia) • 2 million words in the training set • Implementation • Dimension size = 512 • Transformer blocks: 3 identical layers, 4 attention heads 29
Evaluation Methods 1. Effectiveness Evaluation By evaluating text utility & message bit accuracy 2. Secrecy Evaluation By training watermark classifier 3. Robustness Evaluation By performing 3 attacks: Random word replace Random word removing Denoising autoencoder 4. Human Evaluation 30
1. Effectiveness Evaluation • Text Utility (テキストの可用性) • Watermarking should not change the text semantic • Meteor (higher is better) • SBERT distance (Lower is better) • Bit Accuracy • Bitwise message accuracy averaged across all test dataset • Random Chance: 50% 31
1. Effectiveness Evaluation – Result Model Bit accuracy Meteor SBERT distance Base + Discriminator + Fine-tuning (AWT) 97% A 0.96 1.25 Base + Discriminator 96% 0.94 1.73 Base 95% 0.94 2.28 B A) Fine-tuning improved both metrics → Helps to preserve text semantic B) Discriminator decreases SBERT distance → Discriminator helps to improve the output’s quality, in addition to its secrecy advantages 32
1. Effectiveness Evaluation – Contribution of Discriminator ← Systematic fixed changes that inserts less likely tokens, seen in the model without discriminator ← Top words count Original Dataset Output of AWT (Base + Disc + FT) Output of AWT (Base only) 34
3. Robustness Evaluation • Random changes • Replace / Remove words randomly in a watermarked sentence • Training counter-models • Trained transformer-based denoising autoencoder (DAE) • Apply 2 types of noise to the input (watermarked) sentence • Embedding dropout • Random word replacement 39
3. Robustness Evaluation – Result How to read the graph ← e m n c s s t Watermarked ← Watermark is lost Text Bit accuracy is decreased a bit, The goal of attack is “remove the watermark with minimal changes to the text” SBERT distance is increased significantly → Robust to the attacks 40
4. Human Evaluation Asked 6 judges to rate the sentence. Sentence is randomly selected from non-watermarked text, AWT output, synonym baseline output. 42
Conclusion • New framework for language watermarking as a solution towards marking and tracing the provenance of machine-generated text • First end-to-end data hiding solution for natural text. • Discriminator as an adversary improved the watermark system. • Fine-tuning with additional language losses improved the output text quality. 45