This is the official repository for the paper "LIQUID: A Framework for List Question Answering Dataset Generation" (presented at AAAI 2023). This repository provides the implementation of the LIQUID model, guidelines on how to run the model to synthesize list QA data. Also, you can download generated datasets without having to create them from scratch (see here).
LIQUID is an automated framework for generating list QA datasets from unlabeled corpora. Generated datasets by LIQUID can be used to improve list QA performance by supplementing insufficient human-labeled data. When training a list QA model using the generated data and then fine-tuning it on the target training data, we achieved a new state-of-the-art performance on MultiSpanQA and outperformed baselines on several benchmakrs including Quoref and BioASQ.
LIQUID comprises the following four stages (please refer to our paper for details).
- (1) Answer extraction: the named entities belonging to the same entity type (e.g., organization type) in a summary are extracted by an NER model and used as candidate answers.
- (2) Question generation: the candidate answers and the original passage are fed into a QG model to generate list questions.
- (3) Iterative filtering: incorrect answers (e.g., Hanszen) are iteratively filtered based on the confidence score assigned by a QA model.
- (4) Answer expansion: correct but omitted answers (e.g., Yale) are identified by the QA model.
Use the links below to download the synthetic datasets without having to create a dataset from scratch. ✶ indicates they are the same data used in our experiments. Our data format follows that of SQuAD-v1.1.
Name | Corpus | Size | Link |
---|---|---|---|
liquid-wiki-140k (✶) | Wikipedia | 140k | http://nlp.dmis.korea.edu/projects/liquid-lee-et-al-2023/liquid-wiki-140k.json |
liquid-pubmed-140k (✶) | PubMed | 140k | http://nlp.dmis.korea.edu/projects/liquid-lee-et-al-2023/liquid-pubmed-140k.json |
Download this repository and set up an environment as follows.
# Clone the repository
git clone https://github.com/sylee0520/LIQUID.git
cd LIQUID
# Create a conda virtual environment
conda create -n liquid python=3.8
conda activate liquid
# Install all requirements
pip install -r requirements.txt
Download an unlabeled source corpus to be annotated and extract/unpack it to the correct directory. Choose either Wikipedia or PubMed depending on your target domain. ✶ indicates they are the same data used in our experiments.
Description | Directory | Link |
---|---|---|
2018-12-20 version of Wikipedia (✶) | ./data/unlabeled/wiki/ |
http://nlp.dmis.korea.edu/projects/liquid-lee-et-al-2023/wiki181220.zip |
2019-01-02 version of PubMed (✶) | ./data/unlabeled/pubmed/ |
http://nlp.dmis.korea.edu/projects/liquid-lee-et-al-2023/pubmed190102.zip |
Note that passages in each file have not been shuffled. You will have to randomly sample passages from the entire corpus files (e.g., "0000.json" to "5620.json" for Wikipedia) if you want to use sampled passages.
In LIQUID, two types of NER models are used to extract candidate answers for the general and biomedical domains, respectively. Please refer to the instructions below to install the NER models.
- For the general domain, run
python -m spacy download en_core_web_sm
to install spaCy NER system. - For the biomedican domain, install BERN2 from the official GitHub repository (link). After installation is complete, refer to the instructions below and run the model in the background. Note that you need to create a new conda environment for BERN2, instead of reusing the environment for LIQUID.
# Run BERN2 model
export CUDA_VISIBLE_DEVICES=0
conda activate BERN2
cd BERN2/scripts
# For Linux and MacOS
bash run_bern2.sh
# For Windows
bash run_bern2_windows.sh
Once you have installed all the requirements, you are ready to create your list QA datasets. Please see the example script below.
export CUDA_VISIBLE_DEVICES=0
export DATA_FILE=./data/unlabeled/wiki/0000.json
export OUTPUT_FILE=./data/synthetic/wiki/0000.json
python generate.py \
--data_file ${DATA_FILE} \
--output_file ${OUTPUT_FILE} \
--batch_size 8 \
--summary_min_length 64 \
--summary_max_length 128 \
--summary_model_name_or_path facebook/bart-large-cnn \
--qg_min_length 64 \
--qg_max_length 128 \
--qg_model_name_or_path mrm8488/t5-base-finetuned-question-generation-ap \
--qa_model_name_or_path thatdramebaazguy/roberta-base-squad \
--do_summary \
--device 0
batch_size
: Number of passages to process simultaneously in one batch.summary_min_length
,summary_max_length
,qg_min_length
, andqg_max_length
: Minimum and maximum lengths of the output summary and question, respectively.summary_model_name_or_path
,qg_model_name_or_path
, andqa_model_name_or_path
: Model path for loading the summarization model, question-generation model, and question-answering model, respectively. For the biomedical domain, you can usedmis-lab/biobert-base-cased-v1.1-squad
as the QA model.is_biomedical
: Use this option when the target domain is biomedicine.do_summary
: (Recommended) Use this option if you want to summarize input passages and extract candidate answers from the summaries.device
: Set to0
if you want to use our framework on GPU; otherwise-1
.
To be updated soon.
Please cite our paper if it is helpful or relevant to your work.
@article{lee2023liquid,
title={LIQUID: A Framework for List Question Answering Dataset Generation},
author={Lee, Seongyun and Kim, Hyunjae and Kang, Jaewoo},
journal={arXiv preprint arXiv:2302.01691},
year={2023}
}
Feel free to email us (sy-lee@korea.ac.kr
and hyunjae-kim@korea.ac.kr
) if you have any!