Large-scale semantic indexing of Spanish biomedical literature using contrastive transfer learning
Install the requirements of BERTDeCS:
git clone https://github.com/yourh/BERTDeCS.git
cd BERTDeCS
conda create -n BERTDeCS python=3.12
conda activate BERTDeCS
pip install -r requirements.txt
mkdir models
cd models
wget https://zenodo.org/records/14190447/files/BERTDeCS_A-DeCS_ES.pt
cd ..
Preprocess the citations with journal names, titles and abstracts:
python preprocess.py tokenize \
-j data/test_st1_journal.txt \
-t data/test_st1_title.txt \
-a data/test_st1_abstract.txt \
-o data/test_st1
Predict the DeCS terms by BERTDeCS:
python main.py \
configures/data.yaml \
configures/BERTDeCS-A.yaml \
--valid-name dev_st1 \
--labels decs \
--eval "test_st1" \
-b 25 \
-a
Evaluate the performance of prediction:
python evaluation.py \
-t data/test_st1_decs.txt \
-r results/BERTDeCS_A-DeCS_ES-test_st1.npz \
-n 10
We have trained BERTDeCS on 4×4090 by following steps:
- Preprocess the pre-training and training data by
python preprocess.py tokenize \
-j data/{journal} \
-t data/{title} \
-a data/{abstract} \
-o data/{data_name}
- Run contrastive learning by
torchrun --nproc-per-node 4 main.py \
configures/data.yaml \
configures/BERTDeCS-CL.yaml \
--train-name train_cl \
--train \
--dist -a
- Run pre-training by
torchrun --nproc-per-node 4 main.py \
configures/data.yaml \
configures/BERTDeCS-Af.yaml \
--train-name train_pubmed \
--valid-name dev_st1 \
--labels mesh_decs \
--train \
-p models/BERTDeCS_CL-DeCS_CL.pt \
-b 25 \
--dist -a
- Run fine-tuning by
torchrun --nproc-per-node 4 main.py \
configures/data.yaml \
configures/BERTDeCS-A.yaml \
--train-name train_es \
--valid-name dev_st1 \
--labels decs \
--train --eval "dev_st1,test_st1" \
-p models/BERTDeCS_Af-DeCS_PM300W.pt \
-b 25 \
--dist -a
It is free for non-commercial use. For commercial use, please contact Dr. Ronghui You and Prof. Shanfeng Zhu (zhusf@fudan.edu.cn).