Skip to content

Latest commit

 

History

History
115 lines (82 loc) · 6.16 KB

README.md

File metadata and controls

115 lines (82 loc) · 6.16 KB

mcQA : Multiple Choice Questions Answering

CircleCI codecov Codacy Badge PyPI Version

Answering multiple choice questions with Language Models.

News 📢

  • 🚧 This project is currently under development. Stay tuned ! 🤩

Jun 6th, 2020

  • Refactored data subpackage, the library now supports RACE, Synonym, Swag and ARC data sets.
  • Upgrade to transformers==2.10.0.

Installation

With pip

pip install mcqa

From source

git clone https://github.com/mcqa-suite/mcqa.git
cd mcQA
pip install -e .

Getting started

Data preparation

To train a mcQA model, you need to create a csv file with n+2 columns, n being the number of choices for each question. The first column should be the context sentence, the n following columns should be the choices for that question and the last column is the selected answer.

Below is an example of a 3 choice question (taken from the CoS-E dataset) :

Context sentence Choice 1 Choice 2 Choice 3 Label
People do what during their time off from work? take trips brow shorter become hysterical take trips

If you have a trained mcQA model and want to infer on a dataset, it should have the same format as the train data, but the label column.

See example data preparation below:

from mcqa.data import MCQAData

mcqa_data = MCQAData(bert_model="bert-base-uncased", lower_case=True, max_seq_length=256) 
                     
train_dataset = mcqa_data.read(data_file='swagaf/data/train.csv', is_training=True)
test_dataset = mcqa_data.read(data_file='swagaf/data/test.csv', is_training=False)

Model training

from mcqa.models import Model

mdl = Model(bert_model="bert-base-uncased", device="cuda") 
            
mdl.fit(train_dataset, train_batch_size=32, num_train_epochs=20)

Prediction

preds = mdl.predict(test_dataset, eval_batch_size=32)

Evaluation

from sklearn.metrics import accuracy_score
from mcqa.data import get_labels

print(accuracy_score(preds, get_labels(train_dataset)))

References

Type Title Author Year
📰 Paper Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets Mor Geva, Yoav Goldberg, Jonathan Berant 2019
📰 Paper Explain Yourself! Leveraging Language Models for Commonsense Reasoning Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong and Richard Socher 2019
📰 Paper SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference Rowan Zellers, Yonatan Bisk, Roy Schwartz and Yejin Choi 2018
📰 Paper Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering Todor Mihaylov, Peter Clark, Tushar Khot, Ashish Sabharwal 2018
📰 Paper CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge Alon Talmor, Jonathan Herzig, Nicholas Lourie, Jonathan Berant 2018
📰 Paper RACE: Large-scale ReAding Comprehension Dataset From Examinations Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang and Eduard Hovy 2017
💻 Framework Scikit-learn: Machine Learning in Python Pedregosa et al. 2011
💻 Framework PyTorch Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan 2016
💻 Framework Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. Hugging Face 2018
📹 Video Stanford CS224N: NLP with Deep Learning Lecture 10 – Question Answering Christopher Manning 2019

LICENSE

Apache-2.0

Contributing

Read our Contributing Guidelines.

Citation

@misc{Taycir2019,
  author = {mcQA-suite},
  title = {mcQA},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/mcQA-suite/mcQA/}}
}