GitHub - jhyuklee/sparc: ACL'2020: Contextualized Sparse Representations for Real-Time Open-Domain Question Answering

Sparc

Contextualized Sparse Representations for Real-Time Open-Domain Question Answering

This repository provides author's implementation of Contextualized Sparse Representation for Real-Time Open-Domain Question Answering. You can train and evaluate DenSPI+Sparc described in our paper and make your own Sparc vector.

Environment

Please install the Conda environment as follows:

$ conda env create -f environment.yml
$ conda activate sparc

Note that this repository is mostly based on DenSPI and DrQA.

Resources

We use SQuAD v1.1 for training DenSPI+Sparc. Please download them in $DATA_DIR.

$ mkdir $DATA_DIR
$ wget https://raw.githubusercontent.com/rajpurkar/SQuAD-explorer/master/dataset/train-v1.1.json -O $DATA_DIR/train-v1.1.json
$ wget https://raw.githubusercontent.com/rajpurkar/SQuAD-explorer/master/dataset/dev-v1.1.json -O $DATA_DIR/dev-v1.1.json

DenSPI is based on BERT. Please download pre-trained weights of BERT under $BERT_DIR.

$ mkdir $BERT_DIR
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin -O $BERT_DIR/pytorch_model_base_uncased.bin
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json -O $BERT_DIR/bert_config_base_uncased.json
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-pytorch_model.bin -O $BERT_DIR/pytorch_model_large_uncased.bin
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-config.json -O $BERT_DIR/bert_config_large_uncased.json
# Vocabulary is the same for BERT-base and BERT-large.
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt -O $BERT_DIR/vocab.txt

Model

To train DenSPI+Sparc on SQuAD, use train.py. Trained models will be saved in $OUT_DIR1.

$ mkdir $OUT_DIR1
# Train with BERT-base
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR1 --bert_model_option 'base_uncased' --train_file train-v1.1.json --predict_file dev-v1.1.json --do_train --do_predict --do_eval
# Train with BERT-large (use smaller train_batch_size for 12GB GPUs)
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR1 --bert_model_option 'large_uncased' --parallel --train_file train-v1.1.json --predict_file dev-v1.1.json --do_train --do_predict --do_eval --train_batch_size 6

The result will look like (in case of BERT-base):

04/28/2020 06:32:59 - INFO - post -   num vecs=45059736, num_words=1783576, nvpw=25.2637
04/28/2020 06:33:01 - INFO - __main__ -   [Validation] loss: 8.700, b'{"exact_match": 75.10879848628193, "f1": 83.42143097917004}\n'

To use DenSPI+Sparc in an open-domain setting, you have to additionally train it with negative samples. In case of DenSPI+Sparc with BERT-base (same for BERT-large except --bert_model_option and --parallel arguments), commands for training on negative samples are:

$ mkdir $OUT_DIR2
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR --bert_model_option 'base_uncased' --train_file train-v1.1.json --predict_file dev-v1.1.json --do_train_neg --do_predict --do_eval --do_load --load_dir $OUT_DIR1 --load_epoch 3

Finally, train the phrase classifer as:

$ mkdir $OUT_DIR3
# Train only 1 epoch for phrase classifier
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR --bert_model_option 'base_uncased' --train_file train-v1.1.json --predict_file dev-v1.1.json --num_train_epochs 1 --do_train_filter --do_predict --do_eval --do_load --load_dir $OUT_DIR2 --load_epoch 3

We also provide a pretrained DenSPI+Sparc as follows:

DenSPI+Sparc pre-trained on SQuAD - link

Sparc Embedding

Given the pre-trained DenSPI+Sparc, you can get Sparc embedding with following commands. Example below assumes using our pre-trained weight (denspi_sparc.zip unzipped in denspi_sparc folder). If you want to use your own model, please modify MODEL_DIR accordingly.

For any type of text you want to embed, put them in each line of input_examples.txt. Then run:

$ export DATA_DIR=.
$ export MODEL_DIR=denspi_sparc
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR --predict_file input_examples.txt --parallel --bert_model_option 'large_uncased' --do_load --load_dir $MODEL_DIR --load_epoch 1 --do_embed --dump_file output.json

The result file $OUT_DIR/output.json will show Sparc embedding of the input text ([CLS] representation, sorted by scores). For instance:

{
    "out": [
        {
            "text": "They defeated the Arizona Cardinals 49-15 in the NFC Championship Game and advanced to their second Super Bowl appearance since the franchise was founded in 1995.",
            "sparc": {
                "uni": {
                    "1995": {
                        "score": 1.6841894388198853,
                        "vocab": "2786"
                    },
                    "second": {
                        "score": 1.6321970224380493,
                        "vocab": "2117"
                    },
                    "49": {
                        "score": 1.6075607538223267,
                        "vocab": "4749"
                    },
                    "arizona": {
                        "score": 1.1734912395477295,
                        "vocab": "5334"
                    },
                },
                "bi": {
                    "arizona cardinals": {
                        "score": 1.3190401792526245,
                        "vocab": "5334, 9310"
                    },
                    "nfc championship": {
                        "score": 1.1005975008010864,
                        "vocab": "22309, 2528"
                    },
                    "49 -": {
                        "score": 1.0863999128341675,
                        "vocab": "4749, 1011"
                    },
                    "the arizona": {
                        "score": 0.9722453951835632,
                        "vocab": "1996, 5334"
                    },
                }
            }
        }
    ]
}

Note that each text is segmented by the BERT tokenizer ("vocab" denotes the BERT vocab index).

To see how Sparc changes for each phrase, set start_index in here to the target token position. For instance, setting start_index = 17 to embed Sparc of 415,000 of the following text gives you (some n-grams are omitted):

            "text": "Between 1991 and 2000, the total area of forest lost in the Amazon rose from 415,000 to 587,000 square kilometres.",
            "sparc": {
                "uni": {
                    "1991": {
                        "score": 1.182684063911438,
                        "vocab": "2889"
                    },
                    "2000": {
                        "score": 0.41507360339164734,
                        "vocab": "2456"
                    },

whereas setting start_index = 21 to embed Sparc of 587,000 gives you:

            "text": "Between 1991 and 2000, the total area of forest lost in the Amazon rose from 415,000 to 587,000 square kilometres.",
            "sparc": {
                "uni": {
                    "2000": {
                        "score": 1.1923936605453491,
                        "vocab": "2456"
                    },
                    "1991": {
                        "score": 0.7090237140655518,
                        "vocab": "2889"
                    },

Phrase Index

For now, please see the original DenSPI repository or the recent application of DenSPI in COVID-19 domain for building phrase index using DenSPI+Sparc. The main changes in phrase indexing are in post.py and mips_phrase.py where Sparc is used for the open-domain QA inference (See here).

Reference

@inproceedings{lee2020contextualized,
  title={Contextualized Sparse Representations for Real-Time Open-Domain Question Answering},
  author={Lee, Jinhyuk and Seo, Minjoon and Hajishirzi, Hannaneh and Kang, Jaewoo},
  booktitle={ACL},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_tfidf.py		build_tfidf.py
environment.yml		environment.yml
eval_utils.py		eval_utils.py
evaluate-v1.1.py		evaluate-v1.1.py
file_utils.py		file_utils.py
input_examples.txt		input_examples.txt
local_dump.py		local_dump.py
mips_phrase.py		mips_phrase.py
modeling.py		modeling.py
optimization.py		optimization.py
post.py		post.py
pre.py		pre.py
run_index.py		run_index.py
run_server.py		run_server.py
simple_tokenizer.py		simple_tokenizer.py
tfidf_doc_ranker.py		tfidf_doc_ranker.py
tfidf_util.py		tfidf_util.py
tokenization.py		tokenization.py
tokenizer_util.py		tokenizer_util.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparc

Environment

Resources

Model

Sparc Embedding

Phrase Index

Reference

About

Releases

Packages

Languages

License

jhyuklee/sparc

Folders and files

Latest commit

History

Repository files navigation

Sparc

Environment

Resources

Model

Sparc Embedding

Phrase Index

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages