Skip to content

Latest commit

 

History

History
66 lines (49 loc) · 5.89 KB

README.md

File metadata and controls

66 lines (49 loc) · 5.89 KB

Exploring Contrast Consistency of Open-Domain Question Answering Systems on Minimally Edited Questions

This repository contains the data and code for the paper Exploring Contrast Consistency of Open-Domain Question Answering Systems on Minimally Edited Questions in TACL 2023. In this study, we explored the problem of contrast consistency in open-domain question answering by collecting Minimally Edited Questions (MEQs) as challenging contrast sets to the popular Natural Questions (NQ) benchmark, in addition to its standard test set. Through our experiments, we find that the widely used dense passage retrieval (DPR) model performs poorly on distinguishing training questions and their minimally-edited contrast set questions. Moving a step forward, we improved the contrast consistency of DPR model via data augmentation and a query-side contrastive learning objective.

Example

Data

Data can be found at this Google Drive link. Data includes:

  • dataset/json: the MEQ contrast sets in text form. Q1 is the original question from NQ training set. Q2 is the corresponding MEQ, either retrieved from AmbigQA or generated by InstructGPT. Their answers are A1 and A2, respectively.
  • dataset/retrieval: the same data with the json files, but in the format of DPR retrieval input. Questions are listed in pairs. The odd lines are original NQ training questions and the even lines are the corresponding MEQs.
  • dataset/ranking: data used for ranking evaluation. We provide both the original training questions and their corresponding MEQ contrast sets to test the contrast consistency of the retrieval model (i.e., compare the performance difference between the original question and the MEQ). ambigqa-ranking.json contains 623 examples from MEQ-AmbigQA with their gold evidence passages. gpt-ranking.json contains 1229 examples from MEQ-GPT with their gold evidence passages. Files with name nq-train are the corresponding training questions from NQ.
  • dataset/train contains the training data and the batch data indices for the model. nq-contrastive-augment-train-dpr.jsonl is the data used to train the model, including the original NQ data and augmented MEQs from PAQ. contrastive-augment-33k-train-batches64_idx.jsonl is the pre-computed data indices for each batch during the training process. This is used to carefully schedule the positive and negative questions used in the query-side contrastive loss. This data can be used for training on 1, 2 or 4 GPUs. If you are using 8 GPUs, use contrastive-augment-33k-train-batches64_idx-8gpu.jsonl instead (a re-arranged version of the same indices).
  • dataset/dev dev set data used in training the model.

Model

Model

Besides the data, this repo also contains the code of training the improved DPR model mentioned in the paper, which is equipped with additional augmented data from PAQ and a query-side contrastive learning objective. The pipeline of "training the model + generating Wikipedia passage embeddings + retrieving passages from Wikipedia + evaluating the retrieval results" is in scripts/train_dpr.sh.

Model

Environment

The Python environment mainly follows the one used by the original DPR repo.

  1. Install PyTorch:
pip install torch==1.8.2 torchvision==0.9.2 torchaudio==0.8.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu111
  1. Install the other dependencies:
pip install -r requirements.txt

Checkpoints

Checkpoints of the models used in the paper can be found at this Box link. It contains the following checkpoints:

  • DPR_base_contrastive_33k_batch64_lr1e-5_epoch40_loss0.5_start5: best DPR-base checkpoint in ranking evaluation. This model is trained with the InfoNCE loss with weight 0.5, and the contrastive loss starts at epoch 5. This equals to setting QUESTION_LOSS=contrastive HINGE_MARGIN=0 CONTRAST_LOSS=0.5 CONTRAST_START_EPOCH=5 in scripts/train_dpr.sh.
  • DPR_base_dot_33k_batch64_lr1e-5_epoch40_loss0.03: best DPR-base checkpoint in retrieval and QA evaluation. This model is trained with the dot product loss with weight 0.03. This equals to setting QUESTION_LOSS=dot HINGE_MARGIN=0 CONTRAST_LOSS=0.03 CONTRAST_START_EPOCH=0 in scripts/train_dpr.sh.
  • DPR_large_PAQ_dot_33k_batch32_lr1e-5_epoch40_loss0.003: best DPR-large checkpoint in ranking evaluation. This model is trained with the dot product loss with weight 0.003. This equals to setting QUESTION_LOSS=dot HINGE_MARGIN=0 CONTRAST_LOSS=0.003 CONTRAST_START_EPOCH=0 in scripts/train_dpr.sh. Add option encoder.pretrained_model_cfg=bert-large-uncased to switch to the BERT-large encoder.
  • DPR_large_PAQ_dot_33k_batch32_lr1e-5_epoch40_loss0.03: best DPR-large checkpoint in retrieval and QA evaluation. This model is trained with the dot product loss with weight 0.03. This equals to setting QUESTION_LOSS=dot HINGE_MARGIN=0 CONTRAST_LOSS=0.03 CONTRAST_START_EPOCH=0 in scripts/train_dpr.sh. Add option encoder.pretrained_model_cfg=bert-large-uncased to switch to the BERT-large encoder.

Citation

If you use our data or code, please kindly cite our paper:

@article{zhang2023exploring,
  author={Zhihan Zhang and Wenhao Yu and Zheng Ning and Mingxuan Ju and Meng Jiang},
  title={Exploring Contrast Consistency of Open-Domain Question Answering Systems on Minimally Edited Questions},
  journal={Transactions of the Association for Computational Linguistics},
  volume={11},
  year={2023},
  publisher={MIT Press}
}