Factorized question generation for controlled MRC training data generation.
This is an unofficial, self-contained version of the code for the paper "Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus". Please cite this paper if it is useful for your projects.
This code does not come with any warranties. Changes to the original code were kept to a minimum and mainly directed at making the code self-contained and compatible with the corresponding Dockerfile and Polyaxon. Other changes deemed notable are listed in the corresponding subsection of this README.
The following script takes care of downloading all necessary files as well as setting up the required directory
structure so you don't need to worry about a thing. Simply create a directory <data_directory>
in which all
ACS-QG-related data will be stored and run
./setup.sh <data_directory>
This will also build a docker image and create a script start_docker_container.sh
which will start a docker
container with the appropriate parameters.
In the docker container, you can now run the experiments as given in the original code.
Within the docker container you can run the following experiments:
-
Check your environment
Run./experiments_0_debug.sh
It should run through without errors. If it does it means your environment is set up correctly.
-
Train models
Run the following scripts in parallel on different GPUs to save time./experiments_1_ET_train.sh # runs run_glue.py ./experiments_1_QG_train_gpt2.sh # runs QG_gpt2_train.py ./experiments_1_QG_train_seq2seq.sh # runs QG_main.py
-
Perform data augmentation
Sequentially run./experiments_2-DA_file2sents.sh # runs DA_main.py ./experiments_3_DA_sents2augsents.sh # runs DA_main.py
-
Generate questions
Generate questions using the GPT2 model or the seq2seq model. The experiments can be run in parallel.
Note: This step requires a GPU to efficiently process the data../experiments_4_QG_generate_gpt2.sh # runs QG_gpt2_generate.py ./experiments_4_QG_generate_seq2seq.sh # runs QG_augment_main.py
-
Remove duplicate data
To remove duplicate data run./experiments_5_uniq_seq2seq.sh
-
Post-process data
Post-process seq2seq results to handle the repeat problem. It is not required if you use gpt2../experiments_6_postprocess_seq2seq.sh
-
Append entailment scores
Use trained entailment model to append entailment score column. In the original code, this is included inexperiments_3_repeat_da_de.sh
./experiments_7_ET.sh
-
Perform data evaluation
Perform data evaluation to filter low-quality data samples and tag data samples with quality metrics: language model, entailment model, language complexity. In the original code, this is included inexperiments_3_repeat_da_de.sh
./experiments_8_DE.sh
If you want to use Polyaxon to run your experiments,
you can do so using the scripts and commands described below.
After an experiment has completed, download the output files and move them to the appropriate directory.
Refer to the Input & output section to find the appropriate directory.
You might also need to change the DATA_PATH
in common/constants.py
such that it points to a valid Polyaxon
data directory.
-
Train models
To train the models add the following commands topolyaxonfile.yaml
, where<data_directory>
is the directory in which all your ACS-QG-related data is stored:python3 experiments_1_ET_train_poly.py <data_directory> python3 experiments_1_QG_train_gpt2_poly.py <data_directory> python3 experiments_1_QG_train_seq2seq_poly.py
-
Generate questions
To generate questions using the models add the following commands topolyaxonfile.yaml
:python3 experiments_4_QG_generate_seq2seq_poly.py <data_directory> python3 experiments_4_QG_generate_gpt2_poly.py <data_directory>
TODO: add more Polyaxon commands
This section lists for each expeciment the input files that will be read and the output files that will be created. The lists might not be exhaustive.
Reads:
<data_directory>/glue_data/MRPC/ # --data_dir
Writes:
<data_directory>/output/ET/xlnet-base-cased/added_tokens.json # --output_dir
<data_directory>/output/ET/xlnet-base-cased/config.json # --output_dir
<data_directory>/output/ET/xlnet-base-cased/eval_results.txt # --output_dir
<data_directory>/output/ET/xlnet-base-cased/pytorch_model.bin # --output_dir
<data_directory>/output/ET/xlnet-base-cased/special_tokens_map.json # --output_dir
<data_directory>/output/ET/xlnet-base-cased/spiece.model # --output_dir
<data_directory>/output/ET/xlnet-base-cased/test_results.txt # --output_dir
<data_directory>/output/ET/xlnet-base-cased/training_args.bin # --output_dir
<data_directory>/output/ET/xlnet-base-cased/checkpoint-100/config.json # --output_dir
<data_directory>/output/ET/xlnet-base-cased/checkpoint-100/pytorch_model.bin # --output_dir
<data_directory>/output/ET/xlnet-base-cased/checkpoint-100/training_args.bin # --output_dir
<data_directory>/output/ET/xlnet-base-cased/checkpoint-200/[same as for checkpoint-100] # --output_dir
<data_directory>/output/ET/xlnet-base-cased/checkpoint-300/[same as for checkpoint-100] # --output_dir
<data_directory>/output/ET/xlnet-base-cased/checkpoint-400/[same as for checkpoint-100] # --output_dir
<data_directory>/output/ET/xlnet-base-cased/checkpoint-500/[same as for checkpoint-100] # --output_dir
Reads:
<data_directory>/original/SQuAD1.1-Zhou/train.txt # --train_dataset_path
<data_directory>/original/SQuAD1.1-Zhou/dev.txt # --dev_dataset_path
Writes:
<data_directory>/output/QG/gpt2_question_generation/[...] # --output_dir
Reads:
<data_directory>/original/SQuAD1.1-Zhou/train.txt # --train_file
<data_directory>/original/SQuAD1.1-Zhou/dev.txt # --dev_file
<data_directory>/original/SQuAD1.1-Zhou/test.txt # --test_file
?
Writes:
<data_directory>/output/checkpoint/FQG_squad_hard-oov_1_128_False_False_True_True_False_False_True_20/FQG_checkpoint_epoch<x>....pth.tar # --checkpoint_dir
<data_directory>/output/checkpoint/FQG_squad_hard-oov_1_128_False_False_True_True_False_False_True_20/model_best.pth.tar # --checkpoint_dir
<data_directory>/processed/SQuAD1.1-Zhou/counters.pkl # --counters_file
<data_directory>/processed/SQuAD1.1-Zhou/dev-eval.pkl # --dev_eval_file
<data_directory>/processed/SQuAD1.1-Zhou/dev-examples.pkl # --dev_examples_file
<data_directory>/processed/SQuAD1.1-Zhou/dev-meta.pkl # --dev_meta_file
<data_directory>/processed/SQuAD1.1-Zhou/emb_dicts.pkl # --emb_dicts_file
<data_directory>/processed/SQuAD1.1-Zhou/emb_mats.pkl # --emb_mats_file
<data_directory>/processed/SQuAD1.1-Zhou/related_words_dict.pkl # --related_words_dict_file
<data_directory>/processed/SQuAD1.1-Zhou/related_words_ids_mat.pkl # --related_words_ids_mat_file
<data_directory>/processed/SQuAD1.1-Zhou/test-eval.pkl # --test_eval_file
<data_directory>/processed/SQuAD1.1-Zhou/test-examples.pkl # --test_examples_file
<data_directory>/processed/SQuAD1.1-Zhou/test-meta.pkl # --test_meta_file
<data_directory>/processed/SQuAD1.1-Zhou/train-eval.pkl # --train_eval_file
<data_directory>/processed/SQuAD1.1-Zhou/train-examples.pkl # --train_examples_file
<data_directory>/processed/SQuAD1.1-Zhou/train-meta.pkl # --train_meta_file
Reads:
<data_directory>/original/Wiki10000/SQuAD2.0/train-v2.0.json # --da_input_file
<data_directory>/original/Wiki10000/wiki10000.json # --da_input_file
Writes:
<data_directory>/processed/SQuAD2.0/train.sentences.txt # --da_sentences_file
<data_directory>/processed/SQuAD2.0/train.paragraphs.txt # --da_paragraphs_file
<data_directory>/processed/Wiki10000/wiki10000.sentences.txt # --da_sentences_file
<data_directory>/processed/Wiki10000/wiki10000.paragraphs.txt # --da_paragraphs_file
<data_directory>/processed/SQuAD1.1-Zhou/squad_ans_clue_style_info.pkl
<data_directory>/processed/SQuAD1.1-Zhou/squad_sample_probs.pkl
Reads:
<data_directory>/processed/SQuAD2.0/train.sentences.txt # --da_sentences_file
<data_directory>/processed/Wiki10000/wiki10000.sentences.txt # --da_sentences_file
<data_directory>/processed/SQuAD1.1-Zhou/squad_sample_probs.pkl
Writes:
<data_directory>/processed/SQuAD2.0/train.sentences.augmented.<x>_<y>.pkl # --da_augmented_sentences_file
<data_directory>/processed/Wiki10000/wiki10000.sentences.augmented.<x>_<y>.pkl # --da_augmented_sentences_file
Reads:
<data_directory>/output/QG/gpt2_question_generation/4epochs/2batchsize/[...] # --model_name_or_path
<data_directory>/processed/SQuAD2.0/train.sentences.augmented.<x>_<y>.pkl # --filename
<data_directory>/processed/Wiki10000/wiki10000.sentences.augmented.<x>_<y>.pkl # --filename
Writes:
<data_directory>/processed/SQuAD2.0/train.sentences.augmented.<x>_<y>.cache.qg.gpt2.pth # --filecache
<data_directory>/processed/SQuAD2.0/train.qa.<x>_<y>.qg.generated.gpt2.json # --output_file
<data_directory>/processed/Wiki10000/wiki10000.sentences.augmented.<x>_<y>.cache.qg.gpt2.pth # --filecache
<data_directory>/processed/Wiki10000/wiki10000.qa.<x>_<y>.qg.generated.gpt2.json # --output_file
Reads:
<data_directory>/output/checkpoint/ # --checkpoint_dir
<data_directory>/processed/SQuAD2.0/train.paragraphs.txt # --da_paragraphs_file
<data_directory>/processed/SQuAD2.0/train.sentences.augmented.<x>_<y>.pkl # --da_augmented_sentences_file
<data_directory>/processed/Wiki10000/wiki10000.paragraphs.txt # --da_paragraphs_file
<data_directory>/processed/Wiki10000/wiki10000.sentences.augmented.<x>_<y>.pkl # --da_augmented_sentences_file
Writes:
<data_directory>/processed/SQuAD2.0/train.sentences.augmented.<x>_<y>.processed.pkl # --qg_augmented_sentences_file
<data_directory>/processed/SQuAD2.0/train.sentences.augmented.<x>_<y>.output.txt # --qg_result_file
<data_directory>/processed/SQuAD2.0/train.qa.<x>_<y>.txt # --qa_data_file
<data_directory>/processed/Wiki10000/wiki10000.sentences.augmented.<x>_<y>.processed.pkl # --qg_augmented_sentences_file
<data_directory>/processed/Wiki10000/wiki10000.sentences.augmented.<x>_<y>.output.txt # --qg_result_file
<data_directory>/processed/Wiki10000/wiki10000.qa.<x>_<y>.txt # --qa_data_file
Reads:
<data_directory>/processed/SQuAD2.0/train.qa.<x>_<y>.txt
<data_directory>/processed/Wiki10000/wiki10000.qa.<x>_<y>.txt
Writes:
<data_directory>/processed/SQuAD2.0/train.qa.<x>_<y>.uniq.txt
<data_directory>/processed/Wiki10000/wiki10000.qa.<x>_<y>.uniq.txt
Reads:
<data_directory>/processed/SQuAD2.0/train.qa.<x>_<y>.uniq.txt # --input_file
<data_directory>/processed/Wiki10000/wiki10000.qa.<x>_<y>.uniq.txt # --input_file
Writes:
<data_directory>/processed/SQuAD2.0/train.qa.<x>_<y>.uniq.postprocessed.txt # --output_file
<data_directory>/processed/Wiki10000/wiki10000.qa.<x>_<y>.uniq.postprocessed.txt # --output_file
Reads:
? <data_directory>/glue_data/MRPC/[...] # --data_dir
? <data_directory>/output/ET/xlnet-base-cased/[...] # --output_dir
<data_directory>/processed/SQuAD2.0/train.qa.<x>_<y>.uniq.postprocessed.txt # --context_question_answer_file
<data_directory>/processed/Wiki10000/wiki10000.qa.<x>_<y>.uniq.postprocessed.txt # --context_question_answer_file
Writes:
<data_directory>/processed/SQuAD2.0/train.qa.<x>_<y>.entail.txt # --context_question_answer_score_file
<data_directory>/processed/Wiki10000/wiki10000.qa.<x>_<y>.entail.txt # --context_question_answer_score_file
Reads:
<data_directory>/processed/SQuAD2.0/train.qa.<x>_<y>.entail.txt # --input_file
<data_directory>/processed/SQuAD2.0/train.sentences.augmented.<x>_<y>.processed.pkl # --input_augmented_pkl_file
<data_directory>/processed/Wiki10000/wiki10000.qa.<x>_<y>.entail.txt # --input_file
<data_directory>/processed/Wiki10000/wiki10000.sentences.augmented.<x>_<y>.processed.pkl # --input_augmented_pkl_file
Writes:
<data_directory>/processed/SQuAD2.0/train.qa.<x>_<y>.entail.de.txt # --output_file
<data_directory>/processed/Wiki10000/wiki10000.qa.<x>_<y>.entail.de.txt # --output_file
- I couldn't find
dev.tsv
for the MRPC dataset anywhere, so instead I split the test set into two parts,test.tsv
anddev.tsv
. This is most likely not the original split.