Skip to content

Code and data for the paper "Exploiting Biased Models to De-bias Text: A Gender-fair Rewriting Model"

Notifications You must be signed in to change notification settings

textshuttle/exploiting-bias-to-debias

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

exploiting-bias-to-debias

Code and data for the paper "Exploiting Biased Models to De-bias Text: A Gender-Fair Rewriting Model"

Installation

First, setup a new virtual environment and install all necessary dependencies:

bash utils/create_venv.sh

Reproduce Automatic Evaluation Results

We release the source and target files for our automatic evaluations as well as all model outputs in the automatic_evaluation/data folder. The files are named as in Table 1 and Table 2 in the paper.

To compute the WER, call the following script with the corresponding paths to the HYPOTHESIS and REFERENCE files:

python automatic_evaluation/evaluate.py -r REFERENCE -h1 HYPOTHESIS

To compare two models and compute statistical significance, call the script like this:

python automatic_evaluation/evaluate.py -r REFERENCE -h1 HYPOTHESIS_1 -h2 HYPOTHESIS_2

Filter Data for Gender-fair Forms

The scripts/extract.sh script uses grep filters to identify lines that contain gender-fair forms in a given folder with files. Note that if you extract from OSCAR, a single line is actually a document rather than an individual sentence. Sentence splitting and more fine-grained filtering of gender-fair forms will be done in the next step.

Variables you need to configure:

  • DATA_FOLDER: where the original data is stored
  • OUTPUT_FOLDER: where you want the extracted lines to be stored
  • LANG: the language of interest, choices: de, en and en-fw (for reproducing Forward Augmentation for English)
  • FILETYPE: whether the files are txt or gz files (for OSCAR)

Create Pseudo Data

Next, you will use scripts/prepare.sh to create biased pseudo source segments of the gender-fair target segments you extracted in the step above. The script will output parallel files of segments that contain gender-fair forms (end in gf.src and gf.trg) and parallel files of segments that only contain non-gendered forms (source and target are copies of each other - end in ngf.src and ngf.trg).

Variables you need to configure:

  • PATH_TO_VIRTUAL_ENVIRONMENT: path to your venv
  • DATA_FOLDER: where the extracted data is stored
  • LANG: the language of interest, choices: de, en and en-fw (for reproducing Forward Augmentation for English)
  • CREATION_TYPE: how you want to create sources, either rule-based or round-trip
  • FILETYPE: whether the files are txt, json (for LM output) or jsonl files (for OSCAR)

The script automatically creates concatenated training files of all files in the folder called train.gf.src, train.gf.trg, train.nfg.src and train.ngf.trg and saves them in the repository directory.

Filtering Parallel Data

The scripts/filter.sh script will deduplicate and filter the parallel data so that we can train our gender-fair rewriting models on cleaner data.

Variables you need to configure:

  • PATH_TO_VIRTUAL_ENVIRONMENT: path to your venv
  • LANG: the language of interest, choices: de, en (also use en for en-fw)

You can now use train.filtered.gf.src and train.filtered.gf.trg as the training data for your gender-fair rewriting model. You may want to add some data that does not contain gendered forms from train.filtered.ngf.src and train.filtered.ngf.trg before you start the training. A ratio of 70-30 gendered to non-gendered was used for all experiments in the paper.

Train Rule-Based Rewriter

We provide our sockeye training configuration in scripts/train.sh and decoding script in scripts/decode.sh.

Generating More Data Using LM

If you want to create additional training data, you can prompt large language models (as we did for singular forms in our paper). To reproduce the results in our paper download GerPT2-large from HuggingFace:

git clone https://huggingface.co/benjamin/gerpt2-large

Now you can generate more content based on the German seed nouns in data/de_seeds.json:

SEED=0 # change this to get different results each time you run the generations
python3 scripts/generate_with_lm.py -d data/de_seeds.json -o lm_generations_$SEED.json -s $SEED

You can then run prepare.sh as usual on the new data but don't forget to set the filetype to 'json'.

Finetuning MT Model On Pair Forms

If you want to fine-tune the EN-DE machine translation model on gender-tagged segments based on pair forms, first download the training data from HuggingFace:

git clone https://huggingface.co/datasets/wmt19

Now, create a tagged version of the data using:

python scripts/tag_dataset.py

Finally, you can finetune the already downloaded checkpoint of the wmt19-en-de model on the tagged data wmt19-tagged using the script scripts/finetune.sh. We finetune our model in the paper for 50k steps.

Variables you need to configure:

  • PATH_TO_TRANSFORMERS: Path to your local installation of the Transformers library

Original Licensing Information

Data:

Code:

Models:

Citation

If you use this code or data, please cite our paper:

@inproceedings{amrhein-etal-2023-exploiting,
title = "Exploiting Biased Models to De-bias Text: A Gender-Fair Rewriting Model",
author = {Amrhein, Chantal  and
  Schottmann, Florian  and
  Sennrich, Rico  and
  L{\"a}ubli, Samuel},
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.246",
doi = "10.18653/v1/2023.acl-long.246",
pages = "4486--4506",
}

About

Code and data for the paper "Exploiting Biased Models to De-bias Text: A Gender-fair Rewriting Model"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published