honai.me / slides

Adversarial Watermarking Transformer

埋め込みコード (iframe)

Adversarial Watermarking Transformer のスクリプト

  1. Adversarial Watermarking T ransformer: Towards T racing Text Provenance with Data Hiding Sahar Abdelnabi, Mario Fritz CISPA Helmholtz Center for Information Security arXiv (Submitted on 7 Sep 2020) Slides by Honai Ueoka 1
  2. Summary • This paper proposed Transformer based watermarking model • Discriminator as adversarial training improved the Watermarking system • Fine-tuning with multiple language loss improved the output text quality 2
  3. Related Work • Language Watermarking • Linguistic Steganography • Sequence-to-sequence Model • Model Watermarking • Neural Text Detection 3
  4. • About Watermarking Contents • Motivation • Proposed Method • Evaluation • Conclusion 4
  5. What is Watermarking (透かし)? Visible (recognizable) watermarking (Physical & Digital) https://www.boj.or.jp/note_tfjgs/note/security/miwake.pdf https://helpx.adobe.com/jp/acrobat/kb/3242.html 5
  6. What is Watermarking (透かし)? Invisible (unrecognizable) watermarking (Physical & Digital) https://www.hitachi-sis.co.jp/service/security/eshimon/MATAG https://www.imatag.com/ 6
  7. Difference from Cryptography (暗号), Steganography Watermarking Steganography Cryptography Hiding the existence of Hiding some data in a Goal media, the data is related the data over other media Hiding the content of the (data is not always related data to the media to the media) Required decoding Depends on the case 100% accuracy (trade-off with robustness or media quality) Robustness against Required (suppose attacks modifying the media / Usually not required data to remove the watermark) References: [Chang, Clark 2014], [Ziegler et al. 2019] 7
  8. Language Watermarking Edit text with some rule to embed information Encoding Decoding Decoded message 1010 Input message 1010 It also should be robust to 8
  9. • About Watermarking Contents • Motivation • Proposed Method • Evaluation • Conclusion 9
  10. Motivation • Recent advances in natural language generation • Powerful language models with high-quality output text (like GPT-*) • Concern about using the models for malicious purpose • Spreading neural-generated fake news / misinformation • Language watermarking as a better mark and trace the provenance of text 10
  11. Usage Scenario Language Tool (text generation, translation, …) Fake news? Machine-generated? Tool output Internet Tool users (malicious) 11
  12. Usage Scenario Language Tool (text generation, translation, …) This text is generated by Model our model (e.g., GPT-3) Tool Owner Model output Watermark Encoder Watermark Decoded Watermark Decoder message message Black-box for users Fake news? Machine-generated? Tool output (watermarked) Internet Tool users (malicious) 12
  13. Usage Scenario Language Tool News platforms can cooperate with (text generation, translation, …) tool owner to detect machine- generated articles Model Watermark also can be used for (e.g., GPT-3) denial [Zhang et al. 2020] arXiv Tool Owner Model output Watermark Decoded Watermark Encoder message message News Platform Owner Black-box for users Tool output (watermarked) Watermark Decoder Internet Tool users (malicious) News Platform 13
  14. Existing Approaches • Rule-based language watermarking • e.g., synonym substitution • They evaluates synonym substitution method as a baseline • Data hiding with neural model • There are some works on the image classification model • No previous work with language model • Neural text detection • Train classifier to detect the machine-generated text • Easily dropped by future progress in language models, like arms race (軍拡競争、いたちごっこ) 14
  15. • About Watermarking Contents • Motivation • Proposed Method • Evaluation • Conclusion 15
  16. AWT: Adversarial Watermarking T ransformer 16
  17. AWT: Adversarial Watermarking T ransformer Data Hiding Network Data Revealing Network Discriminator Fine-tuning Loss 17
  18. AWT – Similar Architecture [Shetty et al. 2018], [Zhu et al. 2018] R. Shetty, B. Schiele, and M. Fritz, “A4nt: author attribute anonymity by adversarial training of neural machine translation,” in 27th USENIX Security Symposium (USENIX Security 18), 2018. J. Zhu, R. Kaplan, J. Johnson, and L. Fei-Fei, “HiDDeN: PDF Hiding data with deep networks,” in European Conference on Computer Vision (ECCV), 2018. arXiv 18
  19. AWT – Input / Output Flow Input message 1010 Output sentence Data Hiding Data Revealing Decoded message Network 1010 Network Input sentence (watermarked) Binary Classification Discriminator watermarked not watermarked (not watermarked) InferSent AWD-LSTM Fine-tuning Loss 19
  20. AWT – 1. Discriminator • Classify if the sentence is watermarked / not watermarked or • Trained with binary cross-entropy loss 𝐴 : discriminator 𝑆 : input (not watermarked) sentence 𝑆 : output (watermarked) sentence Adversarial loss A is for training data hiding network 20
  21. AWT – 1. Discriminator – T raining Input message Output sentence 1010 Data Hiding Data Revealing Network Network Input sentence (watermarked) Binary Classification Discriminator watermarked not watermarked (not watermarked) Binary cross-entropy loss Fine-tuning Loss is not used 21
  22. AWT – 2. Data Revealing Network • Output dimension: q (= message length) • Similar to Transformer-based multi-class classifier • Message reconstruction loss L im binary cross-entropy loss over all bits 22
  23. AWT – 2. Data Revealing Network – T raining Input message Output sentence 1010 Decoded message Data Hiding Data Revealing Network Network 1011 Input sentence (watermarked) Message reconstruction loss (not watermarked) Fine-tuning Loss Discriminator are not used 23
  24. AWT – 3. Data Hiding Network A)Add input message to encoded A embeddings C B) Transformer auto-encoder (the decoder takes shifted input sentence) C) Gumbel-softmax to train jointly with other components Text reconstruction loss rec cross entropy loss of input & output sequence B 24
  25. AWT – 3. Data Hiding Network – T raining Input message Output sentence 1010 Decoded message Data Hiding Data Revealing Network Network 1010 Input sentence (watermarked) Binary Classification Discriminator watermarked (not watermarked) not watermarked 𝐿 1 𝑤 𝑟𝑒𝑐 𝐿 𝑟𝑒𝑐 + 𝑤 𝐿𝑚+ 𝑤 𝑚 𝐴 𝐴 𝑤∗is weight for each loss Trained to 1) Reconstruct the input sentence , 2) Reconstruct the message and 3) Fooling the adversary . These losses are competing. 25
  26. AWT – 4. Fine-tuning Loss Watermarked sentence Loss Not watermarked sentence 𝑆 : input (not watermarked) sentence 𝑆 : output (watermarked) sentence B) Preserving Sentence Correctness ASGD Weight-Dropped LSTM, independently trained on the dataset used as input texts (not watermarked texts) Watermarked Loss sentence 𝑊 : the itword in watermarked sentence 𝑖 26
  27. AWT – Fine-tuning Input message Output sentence 1010 Data Hiding Data Revealing Decoded message Network Network 1010 Input sentence (watermarked) Binary Classification Discriminator watermarked (not watermarked) not watermarked Fine-tuned to: InferSent AWD-LSTM 1) Reconstruct input sentence 2) Reconstruct message 3) Fool the adversary 𝐿 = 𝐿 + 𝑤 𝐿 + 𝑤 𝐿 4) Preserve semantic 2 1 𝑠𝑒𝑚 𝑠𝑒𝑚 𝐿𝑀 𝐿𝑀 5) Preserve grammar, structure 27
  28. • About Watermarking Contents • Motivation • Proposed Method • Evaluation 1. Effectiveness 2. Secrecy 3. Robustness 4. Human • Conclusion 28
  29. Experiment Setup • Dataset • WikiText-2 (Wikipedia) • 2 million words in the training set • Implementation • Dimension size = 512 • Transformer blocks: 3 identical layers, 4 attention heads 29
  30. Evaluation Methods 1. Effectiveness Evaluation By evaluating text utility & message bit accuracy 2. Secrecy Evaluation By training watermark classifier 3. Robustness Evaluation By performing 3 attacks: Random word replace Random word removing Denoising autoencoder 4. Human Evaluation 30
  31. 1. Effectiveness Evaluation • Text Utility (テキストの可用性) • Watermarking should not change the text semantic • Meteor (higher is better) • SBERT distance (Lower is better) • Bit Accuracy • Bitwise message accuracy averaged across all test dataset • Random Chance: 50% 31
  32. 1. Effectiveness Evaluation – Result Model Bit accuracy Meteor SBERT distance Base + Discriminator + Fine-tuning (AWT) 97% A 0.96 1.25 Base + Discriminator 96% 0.94 1.73 Base 95% 0.94 2.28 B A) Fine-tuning improved both metrics → Helps to preserve text semantic B) Discriminator decreases SBERT distance → Discriminator helps to improve the output’s quality, in addition to its secrecy advantages 32
  33. 1. Effectiveness Evaluation – vs. Baseline • Baseline by [Topkara et al. 2006] Watermarking texts with synonym substitution with WordNet 33
  34. 1. Effectiveness Evaluation – Contribution of Discriminator ← Systematic fixed changes that inserts less likely tokens, seen in the model without discriminator ← Top words count Original Dataset Output of AWT (Base + Disc + FT) Output of AWT (Base only) 34
  35. 1. Effectiveness Evaluation – Contribution of Fine-tuning 35
  36. 1. Effectiveness Evaluation – Visualization Words are not fixedly replaced 36
  37. 2. Secrecy Evaluation Base only Base + Discriminator Base + Discriminator + Fine-tuning • Discriminator decreases the F1 score (improves secrecy) 37
  38. 2. Secrecy Evaluation – vs. Baseline • Baseline by [Topkara et al. 2006] Watermarking texts with synonym substitution with WordNet 38
  39. 3. Robustness Evaluation • Random changes • Replace / Remove words randomly in a watermarked sentence • Training counter-models • Trained transformer-based denoising autoencoder (DAE) • Apply 2 types of noise to the input (watermarked) sentence • Embedding dropout • Random word replacement 39
  40. 3. Robustness Evaluation – Result How to read the graph ← e m n c s s t Watermarked ← Watermark is lost Text Bit accuracy is decreased a bit, The goal of attack is “remove the watermark with minimal changes to the text” SBERT distance is increased significantly → Robust to the attacks 40
  41. 3. Robustness Evaluation – vs. Baseline AWT keeps higher bit accuracy after remove / replace attacks compared to synonym substitution baseline. 41
  42. 4. Human Evaluation Asked 6 judges to rate the sentence. Sentence is randomly selected from non-watermarked text, AWT output, synonym baseline output. 42
  43. 4. Human Evaluation – Result • AWT output texts are rated highly than baseline texts. 43
  44. • About Watermarking Contents • Motivation • Proposed Method • Evaluation • Conclusion 44
  45. Conclusion • New framework for language watermarking as a solution towards marking and tracing the provenance of machine-generated text • First end-to-end data hiding solution for natural text. • Discriminator as an adversary improved the watermark system. • Fine-tuning with additional language losses improved the output text quality. 45