This project handles the Open-Domain Question Answering (ODQA) task, which aims to answer questions based on world knowledge when no specific passage is given.
The model consists of two stages: the "retriever" stage, which finds relevant documents related to the question, and the "reader" stage, which reads and identifies appropriate answers within the retrieved documents.
Project Tree
.
├─ README.md
├─ assets
│ ├─ mrc.png
│ └─ odqa.png
├─ inference.py # for testing
├─ install
│ └─ install_requirements.sh
├─ mlm.py
├─ model
│ └─ Retrieval
│ ├─ BertEncoder.py
│ └─ RobertaEncoder.py
├─ mrc.py
├─ notebooks
│ ├─ EDA.ipynb
│ ├─ EM_compare.ipynb
│ ├─ compare_json.ipynb
│ ├─ ensemble_inference.ipynb
│ ├─ korquad_json_to_datset.ipynb
│ ├─ nbest_analyze.ipynb
│ └─ readme.md
├─ postprocess.py
├─ retrieval.py # for comparing retrievers' performance
├─ train.py # for training the reader
├─ train_dpr.py # for training the dense retriever
├─ trainer
│ └─ DenseRetrievalTrainer.py
└─ utils
├─ arguments.py
├─ run_mlm.py
├─ trainer_qa.py
└─ utils_qa.py
김별희 | 이원재 | 이정아 | 임성근 | 정준녕 |
---|---|---|---|---|
Github | Github | Github | Github | Github |
For the MRC (Machine Reading Comprehension) data, you can access it using the datasets library provided by HuggingFace. After storing the directory as dataset_name, you can load it as follows:
# To load the train_dataset
from datasets import load_from_disk
dataset = load_from_disk("./data/train_dataset/")
print(dataset)
The corpus used for retrieval, which is the set of documents, is stored in ./data/wikipedia_documents.json
. It consists of approximately 57,000 unique documents.
The dataset is stored in the pyarrow format for convenience using the datasets library. The following is the directory structure of ./data
:
# Entire dataset
./data/
# Dataset used for training. Consists of train and validation sets.
./train_dataset/
# Dataset used for submission. Consists of the validation set.
./test_dataset/
# Wikipedia document corpus used for retrieval.
./wikipedia_documents.json
Label Description
- id: Unique id of the question
- question: The question
- answers: Information about the answer. Each question has only one answer.
- answer_start : The starting position of the answer
- text: The text of the answer
- context: The document containing the answer
- title: The title of the document
- document_id: Unique id of the document
The Reader identifies potential sub-strings within the given context sentence that can serve as answers to the query sentence.
The Reader utilizes the ModelForQuestionAnswering structure from the transformers library to compute the probabilities for each token in the context indicating the likelihood of it being the start or end point of the answer.
The maximum length of the answer can be specified using config.utils.max_answer_length
.
The Retriever retrieves relevant documents from the database for the given query sentence.
The number of documents retrieved can be specified using config.retriever.topk
.
To use Sparse Embedding, select sparse
as config.path.type
.
Embeds the query sentence and context documents using Scikit-learn's TfidfVectorizer. The maximum size of the tf-idf vector can be specified using config.retriever.sparse.tfidf_num_features
. After fitting, the tf-idf vectorized context documents and the TfidfVectorizer object will be stored in the config.path.context
folder where the context sentences are stored.
To use Dense Embedding, specify dense
as config.retriever.type
.
You can decide whether to use Faiss for retrieval by setting config.retriever.faiss.use_faiss
to True
. You can adjust the number of clusters created by IndexIVFScalarQuantizer using config.retriever.faiss.num_clusters
, and the quantizer method used for indexing and distance calculation can be set using config.retriever.faiss.metric
.
$ bash install/install_requirements.sh
In this template, all training and inference settings can be adjusted using the config.yaml
file. You can specify the config file to use using --config
or -c
in the command line (default: custom_config.yaml
).
Dense
python train_dpr.py -c base_config
python train.py -c base_config
In train.py, the MRC (Machine Reading Comprehension) reader is trained and validated (refer to mrc.py for MRC related details).
The pretrained model to be used as the reader can be specified using config.model.name_or_path
.
config.model.name_or_path
should contain the name of the model registered in the HuggingFace hub (e.g., nlpotato/roberta-base-e5) or the checkpoint path of a locally saved pretrained model (e.g., saved_models/nlpotato/roberta-base-e5/LWJ_12-23-22-11/checkpoint-9500).
The arguments required for the Trainer can be set in config.train
, and optimizer-related settings can be adjusted using config.optimizer
.
For detailed explanations of trainer settings, refer to the HuggingFace official documentation.
Tokenizer-related settings can be adjusted using config.tokenizer
, and the tokenizer model is the same as the one specified in config.model.name_or_path
.
The trained language model and tokenizer files will be saved in the path specified by config.train.output_dir
.
If output_dir
is not specified separately, an output folder named after the pretrained model used and the unique run_id indicating the start time of training will be created in "saved_models/model_name/run_id" for each training.
To resume training, specify the path of the folder where the trained trainer checkpoint is stored in config.path.resume
. To upload the trained model and tokenizer to the HuggingFace Hub, set config.hf_hb.push_to_hub
to True
and specify the model name to be registered in config.hf_hub.save_name
.
To share it on the Hub, run huggingface-cli login
in the terminal to register your HuggingFace account information.
python inference.py -c base_config