Skip to content

jjeongah/Open_Domain_Question_Answering

 
 

Repository files navigation

Open-Domain Question Answering

1️⃣ Introduction

This project handles the Open-Domain Question Answering (ODQA) task, which aims to answer questions based on world knowledge when no specific passage is given.
The model consists of two stages: the "retriever" stage, which finds relevant documents related to the question, and the "reader" stage, which reads and identifies appropriate answers within the retrieved documents.

- Evaluation metrics: Exact Match (EM), F1 Score
Project Tree
.
├─ README.md
├─ assets
│  ├─ mrc.png
│  └─ odqa.png
├─ inference.py # for testing
├─ install
│  └─ install_requirements.sh
├─ mlm.py
├─ model
│  └─ Retrieval
│     ├─ BertEncoder.py
│     └─ RobertaEncoder.py
├─ mrc.py
├─ notebooks
│  ├─ EDA.ipynb
│  ├─ EM_compare.ipynb
│  ├─ compare_json.ipynb
│  ├─ ensemble_inference.ipynb
│  ├─ korquad_json_to_datset.ipynb
│  ├─ nbest_analyze.ipynb
│  └─ readme.md
├─ postprocess.py
├─ retrieval.py # for comparing retrievers' performance
├─ train.py # for training the reader
├─ train_dpr.py # for training the dense retriever
├─ trainer
│  └─ DenseRetrievalTrainer.py
└─ utils
   ├─ arguments.py
   ├─ run_mlm.py
   ├─ trainer_qa.py
   └─ utils_qa.py

2️⃣ Team Introduction

김별희 이원재 이정아 임성근 정준녕
Github Github Github Github Github

3️⃣ Data

ODQA_data

For the MRC (Machine Reading Comprehension) data, you can access it using the datasets library provided by HuggingFace. After storing the directory as dataset_name, you can load it as follows:

# To load the train_dataset
from datasets import load_from_disk
dataset = load_from_disk("./data/train_dataset/")
print(dataset)

The corpus used for retrieval, which is the set of documents, is stored in ./data/wikipedia_documents.json. It consists of approximately 57,000 unique documents. The dataset is stored in the pyarrow format for convenience using the datasets library. The following is the directory structure of ./data:

# Entire dataset
./data/
    # Dataset used for training. Consists of train and validation sets.
    ./train_dataset/
    # Dataset used for submission. Consists of the validation set.
    ./test_dataset/
    # Wikipedia document corpus used for retrieval.
    ./wikipedia_documents.json

Data Example

ex

Label Description
- id: Unique id of the question
- question: The question
- answers: Information about the answer. Each question has only one answer.
- answer_start : The starting position of the answer
- text: The text of the answer
- context: The document containing the answer
- title: The title of the document
- document_id: Unique id of the document

4️⃣ Model Description

Reader

The Reader identifies potential sub-strings within the given context sentence that can serve as answers to the query sentence.
The Reader utilizes the ModelForQuestionAnswering structure from the transformers library to compute the probabilities for each token in the context indicating the likelihood of it being the start or end point of the answer.
The maximum length of the answer can be specified using config.utils.max_answer_length.

Retriever

The Retriever retrieves relevant documents from the database for the given query sentence.
The number of documents retrieved can be specified using config.retriever.topk.

1. Sparse

To use Sparse Embedding, select sparse as config.path.type.

(1) TF-IDF

Embeds the query sentence and context documents using Scikit-learn's TfidfVectorizer. The maximum size of the tf-idf vector can be specified using config.retriever.sparse.tfidf_num_features. After fitting, the tf-idf vectorized context documents and the TfidfVectorizer object will be stored in the config.path.context folder where the context sentences are stored.

(2) BM25

2. Dense

To use Dense Embedding, specify dense as config.retriever.type.

Faiss

You can decide whether to use Faiss for retrieval by setting config.retriever.faiss.use_faiss to True. You can adjust the number of clusters created by IndexIVFScalarQuantizer using config.retriever.faiss.num_clusters, and the quantizer method used for indexing and distance calculation can be set using config.retriever.faiss.metric.

5️⃣ How to Run

Environment Setup

$ bash install/install_requirements.sh

Config

In this template, all training and inference settings can be adjusted using the config.yaml file. You can specify the config file to use using --config or -c in the command line (default: custom_config.yaml).

Training

Retriever

Dense

python train_dpr.py -c base_config

Reader

python train.py -c base_config

In train.py, the MRC (Machine Reading Comprehension) reader is trained and validated (refer to mrc.py for MRC related details).
The pretrained model to be used as the reader can be specified using config.model.name_or_path.
config.model.name_or_path should contain the name of the model registered in the HuggingFace hub (e.g., nlpotato/roberta-base-e5) or the checkpoint path of a locally saved pretrained model (e.g., saved_models/nlpotato/roberta-base-e5/LWJ_12-23-22-11/checkpoint-9500).

The arguments required for the Trainer can be set in config.train, and optimizer-related settings can be adjusted using config.optimizer.
For detailed explanations of trainer settings, refer to the HuggingFace official documentation.
Tokenizer-related settings can be adjusted using config.tokenizer, and the tokenizer model is the same as the one specified in config.model.name_or_path.

The trained language model and tokenizer files will be saved in the path specified by config.train.output_dir.
If output_dir is not specified separately, an output folder named after the pretrained model used and the unique run_id indicating the start time of training will be created in "saved_models/model_name/run_id" for each training.
To resume training, specify the path of the folder where the trained trainer checkpoint is stored in config.path.resume. To upload the trained model and tokenizer to the HuggingFace Hub, set config.hf_hb.push_to_hub to True and specify the model name to be registered in config.hf_hub.save_name.
To share it on the Hub, run huggingface-cli login in the terminal to register your HuggingFace account information.

Testing

python inference.py -c base_config

About

Retriever and Reader models for Open-Domain Question Answering task

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 63.3%
  • Python 36.6%
  • Shell 0.1%