This repository maintains a machine reading comprehension baseline based on BERT. The implementations follow the baseline system descriptions in the following two papers.
- Improving Question Answering with External Knowledge
- Probing Prior Knowledge Needed in Challenging Chinese Machine Reading Comprehension
If you find this code useful, please consider citing the following papers.
@article{sun2019probing,
title={Probing Prior Knowledge Needed in Challenging Chinese Machine Reading Comprehension},
author={Sun, Kai and Yu, Dian and Yu, Dong and Cardie, Claire},
journal={CoRR},
volume={cs.CL/1904.09679v2},
url={https://arxiv.org/abs/1904.09679v2}
year={2019}
}
@article{pan2019improving,
title={Improving Question Answering with External Knowledge},
author={Pan, Xiaoman and Sun, Kai and Yu, Dian and Ji, Heng and Yu, Dong},
journal={CoRR},
volume={cs.CL/1902.00993v1}
url={https://arxiv.org/abs/1902.00993v1},
year={2019}
}
Here, we show the usage of this baseline using a demo designed for DREAM, a dialogue-based three-choice machine reading comprehension task.
- Download and unzip the pre-trained language model from https://github.com/google-research/bert. and set up the environment variable for BERT by
export BERT_BASE_DIR=/PATH/TO/BERT/DIR
. - Copy the data folder
data
from the DREAM repo tobert/
. - In
bert
, executepython convert_tf_checkpoint_to_pytorch.py --tf_checkpoint_path=$BERT_BASE_DIR/bert_model.ckpt --bert_config_file=$BERT_BASE_DIR/bert_config.json --pytorch_dump_path=$BERT_BASE_DIR/pytorch_model.bin
- Execute
python run_classifier.py --task_name dream --do_train --do_eval --data_dir . --vocab_file $BERT_BASE_DIR/vocab.txt --bert_config_file $BERT_BASE_DIR/bert_config.json --init_checkpoint $BERT_BASE_DIR/pytorch_model.bin --max_seq_length 512 --train_batch_size 24 --learning_rate 2e-5 --num_train_epochs 8.0 --output_dir dream_finetuned --gradient_accumulation_steps 3
- The resulting fine-tuned model, predictions, and evaluation results are stored in
bert/dream_finetuned
.
Results on DREAM:
We run the experiments five times with different random seeds and report the best development set performance and the corresponding test set performance.
Method/Language Model | Batch Size | Learning Rate | Epochs | Dev | Test |
---|---|---|---|---|---|
BERT-Base, Uncased | 24 | 2e-5 | 8 | 63.4 | 63.2 |
BERT-Large, Uncased | 24 | 2e-5 | 16 | 66.0 | 66.8 |
Human Performance | 93.9 | 95.5 | |||
Ceiling Performance | 98.7 | 98.6 |
Environment: The code has been tested with Python 3.6 and PyTorch 1.0