Skip to content

Latest commit

 

History

History
301 lines (206 loc) · 8.28 KB

PREPARATION.md

File metadata and controls

301 lines (206 loc) · 8.28 KB

Preparation

Guide to reproduce all the work from scratch.

All the supplementary data are available at zenodo


Summary


1. EvaNIL

1.1. Download the EvaNIL dataset

To download the ready EvaNIL dataset access the link or run:

wget https://zenodo.org/record/6561410/files/evanil.tar.gz?download=1
tar -xvf 'evanil.tar.gz?download=1'
rm 'evanil.tar.gz?download=1'

1.2. Generate EvaNIL dataset from source corpora (optional)

If you want generate yourself the EvaNIL dataset from scratch first get the necessary data:

Then run:

./get_EvaNIL_preparation_data.sh
python src/evanil/dataset.py -partition <partition>

Arg

  • partition: "medic" (MEDIC), "ctd_chem" (CTD-Chemicals), "ctd_anat" (CTD-Anatomy), "chebi" (ChEBI),"go_bp" (Gene Ontology - Biological Process), "hp" (Human Phenotype Ontology)

Files will be in the directory 'data/evanil/'.


2. Prepare BioSyn

2.1. Get the modified version of the repository

wget git clone https://github.com/pedroruas18/BioSyn.git
cd BioSyn/

2.2. Setup

  • CONDA: First is necessary to install conda:
wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
bash Anaconda3-2021.11-Linux-x86_64.sh

Type yes

Run:

source ~/.bashrc
  • Install the necessary requirements:
conda create -n BioSyn python=3.7
conda activate BioSyn
conda install numpy tqdm scikit-learn
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
conda install transformers==4.18.0
conda install pandas
conda install gdown

*Note: modify cudatoolkit version according to your setup

  • Setup the correct path:
export PYTHONPATH="${PYTHONPATH}:"
  • Create directory 'datasets'
mkdir datasets

2.3. Retrieve the already preprocessed EvaNIL files

To train the BioSyn models on the EvaNIL datasets or to use them for inference in the NEL datasets it is necessary to download them:

wget https://zenodo.org/record/6561477/files/evanil_preprocessed_biosyn.tar.gz?download=1
tar -xvf 'evanil_preprocessed_biosyn.tar.gz?download=1'
rm 'evanil_preprocessed_biosyn.tar.gz?download=1'

2.4. Get the already trained models

Retrieve the already trained models:

mkdir tmp/
cd tmp/
wget https://zenodo.org/record/6561477/files/trained_biosyn_models.tar.gz?download=1
tar -xvf 'trained_biosyn_models.tar.gz?download=1'
rm 'trained_biosyn_models.tar.gz?download=1'
cd ../ 

2.5. Train BioSyn models on the EvaNIL dataset (optional)

Or instead, if you want to train the BioSyn models from scratch on the EvaNIL dataset:

MODEL_NAME_OR_PATH=dmis-lab/biobert-base-cased-v1.1
OUTPUT_DIR=./tmp/biosyn-biobert-<partition>/
DATA_DIR=./datasets/evanil/<partition>/

CUDA_VISIBLE_DEVICES=0 python train.py --model_name_or_path ${MODEL_NAME_OR_PATH} --train_dictionary_path ${DATA_DIR}/preprocessed/train_dictionary.txt --train_dir ${DATA_DIR}/preprocessed/processed_train --output_dir ${OUTPUT_DIR} --use_cuda --topk 1 --epoch 2 --train_batch_size 16 --learning_rate 1e-5 --max_length 25

2.6. Get the Named Entity Linking datasets

It is necessary to download the BC5CDR-Disease, BC5CDR-Chemical and the NCBI-Disease datasets

cd datasets/

# BC5CDR-Disease
gdown https://drive.google.com/uc?id=1moAqukbrdpAPseJc3UELEY6NLcNk22AA
tar -xvf bc5cdr-disease.tar.gz

# BC5CDR-Chemical
gdown  https://drive.google.com/uc?id=1mgQhjAjpqWLCkoxIreLnNBYcvjdsSoGi
tar -xvf bc5cdr-chemical.tar.gz

# NCBI-Disease
gdown https://drive.google.com/uc?id=1mmV7p33E1iF32RzAET3MsLHPz1PiF9vc
tar -xvf ncbi-disease.tar.gz

cd ../

3. NILINKER

3.1. Preparing NILINKER (optional)

You can download the Word-Concept dictionaries, the embeddings and the annotations files used in the experiments.

However, if you want to generate yourself those files that are associated with a given partition of the EvaNIL dataset, run:

./get_NILINKER_preparation_data.sh
./prepare_NILINKER.sh <partition>

Arg:

  • partition: 'medic', 'ctd_anat', 'ctd_chem', 'chebi', 'go_bp' or 'hp'

At this stage NILINKER is ready for training or hyperparameter optimization

3.2. Train NILINKER models (optional)

3.2.1. Get the preprocessed EvaNIL dataset in the format of annotations to input to NILINKER

Run:

cd data/
wget https://zenodo.org/record/6561477/files/annotations.tar.gz?download=1
tar -xvf 'annotations.tar.gz?download=1'
rm 'annotations.tar.gz?download=1'
cd ../

3.2.2. Hyperparameter optimization

Run experiments to find best combination of hyperparameters:

python src/NILINKER/hyperparameter_optimization.py -partition <partition>

Args:

  • partition: 'medic', 'ctd_anat', 'ctd_chem', 'chebi', 'go_bp' or 'hp'

3.2.3. Final training

To train the final version of the NILINKER model in given EvaNIL partition, use the same script but change the value of the arg 'mode' to 'final'.

Example:

python src/NILINKER/train_nilinker.py -mode train -partition chebi

The file associated with the trained model 'best.h5' will in the directory 'data/nilinker_files/chebi/final/'.


4. Named Entity Linking Evaluation datasets

Datasets (with targe Knowledge Bases within parentheses):

  • BC5CDR-Disease (MEDIC vocabulary)
  • BC5CDR-Chemical (CTD-Chemical vocabulary)
  • NCBI Disease corpus (MEDIC vocabulary)
  • GSC+ (Human Phenotype Ontology)
  • CHR (ChEBI ontology)
  • PHAEDRA (CTD-Chemical vocabulary)

4.1. Preprocess BC5CDR-Disease, BC5CDR-Chemical, NCBI Disease datasets (optional)

Get the modified version of the repository "Fair Evaluation in Concept Normalization: a Large-scale Comparative Analysis for BERT-based Models"

git clone https://github.com/pedroruas18/Fair-Evaluation-BERT.git

Install the requirements:

pip install -r requirements.txt

Inside the repository, execute the script 'prepare.sh':

cd Fair-Evaluation-BERT
chmod +x prepare.sh
./prepare.sh
cd ../

This will prepare the provided datasets (BC5CDR-Disease, BC5CDR-Chemical, NCBI Disease) to be used with REEL-based models.

4.2. Preprocess the CHR, GSC+ and PHAEDRA datasets (optional)

First get the data:

chmod +x get_NEL_evaluation_data.sh
./get_NEL_evaluation_data.sh

And then preprocess it to prepare it to be used with REEL-based models:

cd src/nel_evaluation/
python process_nel_corpora.py gsc+
python process_nel_corpora.py phaedra
python process_nel_corpora.py chr
cd ../../

4.3. Download the already preprocessed datasets

Run:

wget https://zenodo.org/record/6561477/files/preprocessed_nel_datasets.tar.gz?download=1
tar -xvf 'preprocessed_nel_datasets.tar.gz?download=1'
rm preprocessed_nel_datasets.tar.gz?download=1