Guide to reproduce all the work from scratch.
All the supplementary data are available at zenodo
To download the ready EvaNIL dataset access the link or run:
wget https://zenodo.org/record/6561410/files/evanil.tar.gz?download=1
tar -xvf 'evanil.tar.gz?download=1'
rm 'evanil.tar.gz?download=1'
If you want generate yourself the EvaNIL dataset from scratch first get the necessary data:
Then run:
./get_EvaNIL_preparation_data.sh
python src/evanil/dataset.py -partition <partition>
Arg
- partition: "medic" (MEDIC), "ctd_chem" (CTD-Chemicals), "ctd_anat" (CTD-Anatomy), "chebi" (ChEBI),"go_bp" (Gene Ontology - Biological Process), "hp" (Human Phenotype Ontology)
Files will be in the directory 'data/evanil/'.
wget git clone https://github.com/pedroruas18/BioSyn.git
cd BioSyn/
- CONDA: First is necessary to install conda:
wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
bash Anaconda3-2021.11-Linux-x86_64.sh
Type yes
Run:
source ~/.bashrc
- Install the necessary requirements:
conda create -n BioSyn python=3.7
conda activate BioSyn
conda install numpy tqdm scikit-learn
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
conda install transformers==4.18.0
conda install pandas
conda install gdown
*Note: modify cudatoolkit version according to your setup
- Setup the correct path:
export PYTHONPATH="${PYTHONPATH}:"
- Create directory 'datasets'
mkdir datasets
To train the BioSyn models on the EvaNIL datasets or to use them for inference in the NEL datasets it is necessary to download them:
wget https://zenodo.org/record/6561477/files/evanil_preprocessed_biosyn.tar.gz?download=1
tar -xvf 'evanil_preprocessed_biosyn.tar.gz?download=1'
rm 'evanil_preprocessed_biosyn.tar.gz?download=1'
Retrieve the already trained models:
mkdir tmp/
cd tmp/
wget https://zenodo.org/record/6561477/files/trained_biosyn_models.tar.gz?download=1
tar -xvf 'trained_biosyn_models.tar.gz?download=1'
rm 'trained_biosyn_models.tar.gz?download=1'
cd ../
Or instead, if you want to train the BioSyn models from scratch on the EvaNIL dataset:
MODEL_NAME_OR_PATH=dmis-lab/biobert-base-cased-v1.1
OUTPUT_DIR=./tmp/biosyn-biobert-<partition>/
DATA_DIR=./datasets/evanil/<partition>/
CUDA_VISIBLE_DEVICES=0 python train.py --model_name_or_path ${MODEL_NAME_OR_PATH} --train_dictionary_path ${DATA_DIR}/preprocessed/train_dictionary.txt --train_dir ${DATA_DIR}/preprocessed/processed_train --output_dir ${OUTPUT_DIR} --use_cuda --topk 1 --epoch 2 --train_batch_size 16 --learning_rate 1e-5 --max_length 25
It is necessary to download the BC5CDR-Disease, BC5CDR-Chemical and the NCBI-Disease datasets
cd datasets/
# BC5CDR-Disease
gdown https://drive.google.com/uc?id=1moAqukbrdpAPseJc3UELEY6NLcNk22AA
tar -xvf bc5cdr-disease.tar.gz
# BC5CDR-Chemical
gdown https://drive.google.com/uc?id=1mgQhjAjpqWLCkoxIreLnNBYcvjdsSoGi
tar -xvf bc5cdr-chemical.tar.gz
# NCBI-Disease
gdown https://drive.google.com/uc?id=1mmV7p33E1iF32RzAET3MsLHPz1PiF9vc
tar -xvf ncbi-disease.tar.gz
cd ../
You can download the Word-Concept dictionaries, the embeddings and the annotations files used in the experiments.
However, if you want to generate yourself those files that are associated with a given partition of the EvaNIL dataset, run:
./get_NILINKER_preparation_data.sh
./prepare_NILINKER.sh <partition>
Arg:
- partition: 'medic', 'ctd_anat', 'ctd_chem', 'chebi', 'go_bp' or 'hp'
At this stage NILINKER is ready for training or hyperparameter optimization
Run:
cd data/
wget https://zenodo.org/record/6561477/files/annotations.tar.gz?download=1
tar -xvf 'annotations.tar.gz?download=1'
rm 'annotations.tar.gz?download=1'
cd ../
Run experiments to find best combination of hyperparameters:
python src/NILINKER/hyperparameter_optimization.py -partition <partition>
Args:
- partition: 'medic', 'ctd_anat', 'ctd_chem', 'chebi', 'go_bp' or 'hp'
To train the final version of the NILINKER model in given EvaNIL partition, use the same script but change the value of the arg 'mode' to 'final'.
Example:
python src/NILINKER/train_nilinker.py -mode train -partition chebi
The file associated with the trained model 'best.h5' will in the directory 'data/nilinker_files/chebi/final/'.
Datasets (with targe Knowledge Bases within parentheses):
- BC5CDR-Disease (MEDIC vocabulary)
- BC5CDR-Chemical (CTD-Chemical vocabulary)
- NCBI Disease corpus (MEDIC vocabulary)
- GSC+ (Human Phenotype Ontology)
- CHR (ChEBI ontology)
- PHAEDRA (CTD-Chemical vocabulary)
Get the modified version of the repository "Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT-based Models"
git clone https://github.com/pedroruas18/Fair-Evaluation-BERT.git
Install the requirements:
pip install -r requirements.txt
Inside the repository, execute the script 'prepare.sh':
cd Fair-Evaluation-BERT
chmod +x prepare.sh
./prepare.sh
cd ../
This will prepare the provided datasets (BC5CDR-Disease, BC5CDR-Chemical, NCBI Disease) to be used with REEL-based models.
First get the data:
chmod +x get_NEL_evaluation_data.sh
./get_NEL_evaluation_data.sh
And then preprocess it to prepare it to be used with REEL-based models:
cd src/nel_evaluation/
python process_nel_corpora.py gsc+
python process_nel_corpora.py phaedra
python process_nel_corpora.py chr
cd ../../
Run:
wget https://zenodo.org/record/6561477/files/preprocessed_nel_datasets.tar.gz?download=1
tar -xvf 'preprocessed_nel_datasets.tar.gz?download=1'
rm preprocessed_nel_datasets.tar.gz?download=1