Multi-Head-CRF

This repository contains the implementation for the Multi-Head-CRF model as described in:

Multi-head CRF classifier for biomedical multi-class Named Entity Recognition on Spanish clinical notes

Setup

Create a python environment.

python -m venv venv
PIP=venv/bin/pip
$PIP install --upgrade pip
$PIP install -r requirements.txt

Dataset

The dataset used in this work merges four separate datasets:

SymptEMIST: Zenodo
DisTEMIST: Zenodo
MedProcNER: Zenodo
PharmaCoNER: Zenodo

All datasets are licensed under CC4.

To set up the dataset, a script is provided (dataset/download_dataset.sh) that downloads these datasets, prepares them in the correct format, and merges them to create a unified dataset.

Alternatively, the dataset is available on:

This step is required if you wish to run the Named Entity Linking or Evaluation.

Named Entity Recognition

Go to src directory

cd src

To train a model, use the following command:

python hf_trainer.py lcampillos/roberta-es-clinical-trials-ner --augmentation random --number_of_layer_per_head 3 --context 32 --epochs 60 --batch 16 --percentage_tags 0.25 --aug_prob 0.5 --classes SYMPTOM PROCEDURE DISEASE PROTEIN CHEMICAL

lcampillos/roberta-es-clinical-trials-ner: Model checkpoint.
--number_of_layer_per_head: Number of hidden layers to use in each CRF head (Good options: 1-3).
--context: Context size for splitting documents exceeding the 512 token limit (Good options: 2 or 32).
--epochs: Number of epochs to train.
--batch: Batch size.
--augmentation: Augmentation strategy (None, 'random', or 'unk').
--aug_prob: Probability to apply augmentation to a sample.
--percentage_tags: Percentage of tokens to change.
--classes: Classes to train, must be a combination of: SYMPTOM PROCEDURE DISEASE PROTEIN CHEMICAL.
--val: Whether to use a validation dataset; otherwise, the test dataset is utilized.

To run inference for the model, we provide an inference file, which will conduct inference over the test dataset by default: python inference.py MODEL_CHECKPOINT

We also provide several of our best performing models available on Hugging Face.

Example:

python inference.py IEETA/RobertaMultiHeadCRF-C32-0

Named Entity Linking

In order to utilize the SNOMED CT terminology, it is necessary to create a UMLS account and download the file. This folder is expected to be extracted into the embeddings directory. Although we do not supply the original resource, we do supply all the embeddings used for SNOMED CT and the various gazetteers, which are available here, with a script available in embeddings/download_embeddings.sh

In order to build the embeddings, it is required to run the embeddings/prepare_jsonl_for_embedding.py script, which will create jsonl files from the various gazetteers.

In order to build the embeddings it is required to run embeddings/build_embeddings_index.py.

python build_embeddings_index.py snomedCT.jsonl

With these embeddings we can conduct normalization (in src).

python normalize.py INPUT_RUN --t 0.6 --use_gazetteer False --output_folder runs

Were --t is the the threshold of acceptance, and --use_gazetteer is whether or not to use the gazetteers to normalize.

Evaluation

The evaluation (NER and entity linking) can be run in the evaluation/ directory as follows:

python3 evaluation.py train/test PREDICTIONS_FILE.tsv

License

This project is licensed under the MIT License - see the LICENSE file for details.

Authors:

Richard A A Jonker (ORCID: 0000-0002-3806-6940)
Tiago Almeida (ORCID: 0000-0002-4258-3350)
Rui Antunes (ORCID: 0000-0003-3533-8872)
João R Almeida (ORCID: 0000-0003-0729-2264)
Sérgio Matos (ORCID: 0000-0003-1941-3983)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Head-CRF

Setup

Dataset

Named Entity Recognition

Named Entity Linking

Evaluation

License

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
dataset		dataset
embeddings		embeddings
evaluation		evaluation
models		models
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

ieeta-pt/Multi-Head-CRF

Folders and files

Latest commit

History

Repository files navigation

Multi-Head-CRF

Setup

Dataset

Named Entity Recognition

Named Entity Linking

Evaluation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages