UMLS-MEDLINE Biomedical Distant RE for Bag-level Multiple Instance Learning

Code for the paper BioNLP 2020 paper A Data-driven Approach for Noise Reduction in Distantly Supervised Biomedical Relation Extraction.

Requirements

pip install -r requirements.txt

Data

To run the code, please obtain the data as follows:

UMLS

Install the UMLS tools by following the steps here. Once installed, under INSTALLED_DIR/2019AB/META, you can find MRREL.RRF and MRCONSO.RRF, copy the files and place under data/UMLS.

MEDLINE

Download MEDLINE abstracts medline_abs.txt (~24.5GB) and place under data/MEDLINE. UPDATE: Please follow the discussion here: #2

Data Creation

From project base dir, call the script to process UMLS as: python -m data_utils.process_umls. This will create an object data/umls_vocab.pkl.
Next, run the script python -m data_utils.extract_unique_sentences_medline. This might take a while. This will create a file data/MEDLINE/medline_unique_sentences.txt.
Link the entities with texts: python -m data_utils.link_entities (see config.py to adjust linking settings).

Data Splits

To reproduce the data splits used reported in the paper for k-tag setting, run wit default options as python -m data_utils.create_split. This will take a while for the first time because of generating the one time file data/MEDLINE/linked_sentences_to_groups.jsonl. For next runs, it will use the cached version. For s-tag, set the flag k_tag=False in config.py. For s-tag+exprels, additionally set the flag expand_rels=True.

Features

Run python -m data_utils.features. Running the job with multi-processing will be significantly faster.

Train

Run python train.py.

Checkpoint

Download the best model checkpoint here.

Citation

If you use this code for your research, please consider citing:

@inproceedings{amin-etal-2020-data,
    title = "A Data-driven Approach for Noise Reduction in Distantly Supervised Biomedical Relation Extraction",
    author = "Amin, Saadullah and Dunfield, Katherine Ann and Vechkaeva, Anna and Neumann, G{\"u}nter",
    booktitle = "Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.bionlp-1.20",
    doi = "10.18653/v1/2020.bionlp-1.20",
    pages = "187--194"
}

Also, check our follow up work introducing a new benchmark using PubMed abstracts and SNOMED CT knowledge base, MedDistant19:

@inproceedings{amin-etal-2022-meddistant19,
    title = "{M}ed{D}istant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation Extraction",
    author = "Amin, Saadullah and Minervini, Pasquale and Chang, David and Stenetorp, Pontus and Neumann, G{\"u}nter",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.198",
    pages = "2259--2277",
}

Acknowledgements

We thank Qin Dai (daiqin@ecei.tohoku.ac.jp) for guiding us on steps to obtain the relevant triples data from the UMLS in private communication.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data_utils		data_utils
imgs		imgs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
config.py		config.py
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UMLS-MEDLINE Biomedical Distant RE for Bag-level Multiple Instance Learning

Requirements

Data

UMLS

MEDLINE

Data Creation

Data Splits

Features

Train

Checkpoint

Citation

Acknowledgements

About

Contributors 2

Languages

License

suamin/MIL-RBERT

Folders and files

Latest commit

History

Repository files navigation

UMLS-MEDLINE Biomedical Distant RE for Bag-level Multiple Instance Learning

Requirements

Data

UMLS

MEDLINE

Data Creation

Data Splits

Features

Train

Checkpoint

Citation

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages