Skip to content

Code for "Extractive Memorization in Constrained Sequence Generation Tasks"

Notifications You must be signed in to change notification settings

vyraun/Finding-Memo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Finding Memo: Extractive Memorization in Constrained Sequence Generation Tasks

Teaser Image for the Paper

Code for the Paper "Finding Memo: Extractive Memorization in Constrained Sequence Generation Tasks" by Vikas Raunak and Arul Menezes.

This repo provides:

  • Data, Model and Code to Replicate the results in the paper
  • Scripts to Train and Run the experiments on your own dataset
  • Pointers to modify the underlying algorithms

If you find our code or paper useful, please cite the paper:

@inproceedings{raunak-etal-finding-memo,
    title = "Finding Memo: Extractive Memorization in Constrained Sequence Generation Tasks",
    author = "Raunak, Vikas and Arul, Menezes",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    publisher = "Association for Computational Linguistics",
    year = "2022"
}

Replicate the Results in the Paper

The code is tested in a conda environment with python=3.8.

pip install -r requirements.txt
wget https://www.dropbox.com/s/dxnlwziqzr9iz7i/data.zip
unzip data.zip
bash run_experiment.sh

Train and Run the Experiments on your own Model

The directory and file paths could be set as per your custom setup.

bash train_spm.sh
bash train.sh

Modify the Underlying Algorithms for Further Experiments

Change the threshold for Memorization Extraction (Algorithm 1)

This threshold could be set in src/compute_dist.py.

Apply Memorization Extraction (Algorithm 1) on CJKT languages

CJKT flag could be set to true in src/parse_memorized.py for working with CJKT languages on the source side.

Change the Masked Language Model to Estimate Neighborhood Effect (Algorithm 2)

Different Masked Language Models pipelines (BERT, Roberta, Multilingual BERT) are defined in src/substitutions.py and consumed in src/get_substitutions.py.

Obtain the Data for Model Finetuning through the Memorization Mitigation (Algorithm 3)

The finetuning data could be obtained by running scripts/augment.sh.

Change the Recovery Symbol for Memorization Mitigation (Algorithm 3)

The recovery symbol is set in scripts/augment.sh as the symbol variable.

Transferable Memorization Attack Examples

A couple of examples on Microsoft Bing Translator and Google Translator are presented at this link.

Please leave issues for any questions about the paper or the code.

The above code release README format is borrowed from https://github.com/Alrope123/rethinking-demonstrations.

About

Code for "Extractive Memorization in Constrained Sequence Generation Tasks"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published