BatteryBERT: A Pre-trained Language Model for Battery Database Enhancement
- BatteryBERT Pre-training: Pre-training from Scratch or Further Training
- BatteryBERT Fine-tuning: Sequence Classification + Question Answering
- BatteryBERT Usage: Document Classifier, Device Data Extractor, General Q&A Agent
- Large-scale Device Data Extraction and Database Enhancement
Run the following commands to clone the repository and install batterybert:
git clone https://github.com/ShuHuang/batterybert.git
cd batterybert; pip install -r requirements.txt; python setup.py develop
Pre-training from scratch or further training using a masked language modeling (MLM) loss. See python run_pretraining.py --help
for a full list of arguments and their defaults.
python run_pretrain.py \
--train_root $TRAIN_ROOT \
--eval_root $EVAL_ROOT \
--output_dir $OUTPUT_DIR \
--tokenizer_root $TOEKNIZER_ROOT \
--checkpoint=$CHECKPOINT_DIR
Train a WordPiece tokenizer from scratch using the dataset from $TRAIN_ROOT
. See python run_tokenizer.py --help
for a full list of arguments and their defaults.
python run_tokenizer.py \
--train_root $TRAIN_DIR \
--save_root $SAVE_DIR \
--save_name $TOKENIZER_NAME
Fine-tune a BERT model on a question answering dataset (e.g. SQuAD). See python run_finetune_qa.py --help
for a full list of arguments and their defaults.
python run_finetune_qa.py
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $OUTPUT_DIR
Fine-tune a BERT model on a sequence classification dataset (e.g. paper corpus). See python run_finetune_doc_classify.py --help
for a full list of arguments and their defaults.
$ python run_finetune_doc_classify.py
--model_name_or_path $MODEL_NAME_OR_PATH \
--output_dir $OUTPUT_DIR \
--train_root $TRAIN_ROOT \
--eval_root $EVAL_ROOT
>>> from batterybert.apps import DocClassifier
# Model name to be changed after published
# Create a binary classifier
>>> model_name = "batterydata/test4"
>>> sample_text = "sample text"
>>> classifier = DocClassifier(model_name)
# 0 for non-battery text, 1 for battery text
>>> category = classifier.classify(sample_text)
>>> print(category)
0
>>> from batterybert.apps import DeviceDataExtractor
# Model name to be changed after published
# Create a device data extractor
>>> model_name = "batterydata/test1"
>>> sample_text = "The anode of this Li-ion battery is graphite."
>>> extractor = DeviceDataExtractor(model_name)
# Set the confidence score threshold
>>> result = extractor.extract(sample_text, threshold=0.1)
>>> print(result)
[{'type': 'anode', 'answer': 'graphite', 'score': 0.9736555218696594, 'context': 'The anode of this battery is graphite.'}]
>>> from batterybert.apps import QAAgent
# Model name to be changed after published
# Create a QA agent
>>> model_name = "batterydata/test1"
>>> context = "The University of Cambridge is a collegiate research university in Cambridge, United Kingdom. Founded in 1209 and granted a royal charter by Henry III in 1231, Cambridge is the second-oldest university in the English-speaking world and the world's fourth-oldest surviving university."
>>> question = "When was University of Cambridge founded?"
>>> qa_agent = QAAgent(model_name)
# Set the confidence score threshold
>>> result = qa_agent.answer(question=question, context=context, threshold=0.1)
>>> print(result)
{'score': 0.9867061972618103, 'start': 105, 'end': 109, 'answer': '1209'}
This project was financially supported by the Science and Technology Facilities Council (STFC), the Royal Academy of Engineering (RCSRF1819\7\10) and Christ's College, Cambridge. The Argonne Leadership Computing Facility, which is a DOE Office of Science Facility, is also acknowledged for use of its research resources, under contract No. DEAC02-06CH11357.
@article{huang2022batterybert,
title={BatteryBERT: A Pretrained Language Model for Battery Database Enhancement},
author={Huang, Shu and Cole, Jacqueline M},
journal={J. Chem. Inf. Model.},
year={2022},
doi={10.1021/acs.jcim.2c00035},
url={DOI:10.1021/acs.jcim.2c00035},
pages={DOI: 10.1021/acs.jcim.2c00035},
publisher={ACS Publications}
}