Code to create pre-training data for a span selection pre-training task inspired by reading comprehension and an effort to avoid encoding general knowledge in the transformer network itself.
Available through Hugging Face as:
- michaelrglass/bert-base-uncased-sspt
- michaelrglass/bert-large-uncased-sspt
Load with: AutoConfig.from_pretrained , AutoTokenizer.from_pretrained , AutoModelForQuestionAnswering.from_pretrained. See run_qa.py for example code.
- python setup.py
- build irsimple.jar (or use pre-built com.ibm.research.ai.irsimple/irsimple.jar)
- cd com.ibm.research.ai.irsimple/
- mvn clean compile assembly:single
- (install maven if necessary from https://maven.apache.org/install.html)
- Download a Wikipedia dump and WikiExtractor
- IBM is not granting a license to any third-party data set. You are responsible for complying with all third-party licenses, if any exist.
python WikiExtractor.py --json --filter_disambig_pages --processes 32 --output wikiextracteddir enwiki-20190801-pages-articles-multistream.xml.bz2
- Run create_passages.py (this just splits into passages by double newline)
python create_passages.py --wikiextracted wikiextracteddir --output wikipassagesdir
- Run Lucene indexing
java -cp irsimple.jar com.ibm.research.ai.irsimple.MakeIndex wikipassagesdir wikipassagesindex
- Run sspt_gen.sh
nohup bash sspt_gen.sh ssptGen wikipassagesdir 2>&1 > querygen.log &
- And AsyncWriter
nohup java -cp irsimple.jar com.ibm.research.ai.irsimple.AsyncWriter \
ssptGen \
wikipassagesindex 2>&1 > instgen.log &
FIXME: rc_data and span_selection_pretraining require a modified version of pytorch-transformers The adaptations needed are in the process of being worked into this repo and a pull request for pytorch-transformers. Hopefully it is relatively clear how it should work.
python span_selection_pretraining.py \
--bert_model bert-base-uncased \
--train_dir ssptGen \
--num_instances 1000000 \
--save_model rc_1M_base.bin