GGPONC - The German Clinical Guideline Corpus for Oncology

This repository collects resources related to GGPONC.

It covers:

(Nested) Clinical Named Entity Recogition
UMLS Entity Linking with xMEN
Resolution of Coordination Ellipses
Molecular Named Entities (Genes / Proteins, Variants)

Preparation

Get access to GGPONC following the instructions on the project homepage and place the contents of the 2.0 release in the data folder:
- https://zenodo.org/records/12518458 as v2.0_2022_03_24
- https://zenodo.org/records/12530242 as v2.0_agreement
Install Python dependencies pip install -r requirements.txt `

Clinical Named Entity Recognition

Data Loading

A BigBIO-compatible data loader for loading the latest gold-standard annotations (GGPONC 2.0) to train NER models are available through the Hugging Face Hub: https://huggingface.co/datasets/bigbio/ggponc2

from datasets import load_dataset
dataset = load_dataset('bigbio/ggponc2', data_dir='data/v2.0_2022_03_24', name='ggponc2_fine_long_bigbio_kb')

Nested NER with spaCy Spancat

A trained spaCy model for nested NER is available on Hugging Face: https://huggingface.co/phlobo/de_ggponc_medbertde

huggingface-cli download phlobo/de_ggponc_medbertde de_ggponc_medbertde-any-py3-none-any.whl --local-dir .
pip install -q de_ggponc_medbertde-any-py3-none-any.whl

See: 01_GGPONC_Nested_NER

Flat NER

Training and evaluation of the (flat) NER models described in Borchert et al. (2022) is covered in the GGPONC 2.0 repository.

UMLS Entity Linking with xMEN

We use the xMEN toolkit with a pre-trained re-ranker to normalize identified entity mention spans to UMLS codes.

See: 02_GGPONC_UMLS_Linking

Resolution of Coordination Ellipses

Application of our encoder-decoder model for resolving elliptical coordinated compound noun phrases (ECCNPs), e.g. Chemo- und Strahlentherapie -> Chemotherapie und Strahlentherapie

See: 03_ECCNP_Analysis.ipynb

Molecular Named Entities

Training and evaluation of a nested NER model for gene / protein and variant mentions. The dataset (molecular_2024_04_03) is not yet published, but available upon request. Place the release in data to run the notebook.

See: 04_Molecular.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
data		data
spacy_ner		spacy_ner
.gitignore		.gitignore
01_GGPONC_Nested_NER.ipynb		01_GGPONC_Nested_NER.ipynb
02_GGPONC_UMLS_Linking.ipynb		02_GGPONC_UMLS_Linking.ipynb
03_ECCNP_Analysis.ipynb		03_ECCNP_Analysis.ipynb
04_Molecular.ipynb		04_Molecular.ipynb
README.md		README.md
german_umls.yaml		german_umls.yaml
ggponc.py		ggponc.py
ggponc2tui.csv		ggponc2tui.csv
nb_util.py		nb_util.py
nen_util.py		nen_util.py
ner_util.py		ner_util.py
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
xmen_ggponc3.yaml		xmen_ggponc3.yaml

Repository	Description
ggponc_annotation	GGPONC 2.0 Results and Gold Standard Annotations
ggponc_preprocessing	Pre-Processing Pipeline (Tokenization, POS Tagging) and GGPONC 1.0 Results
ggponc_ellipses	Resolving Elliptical Compounds in German Medical Text
ggponc_molecular	GGTWEAK - Gene Tagging with Weak Supervision for German Clinical Text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GGPONC - The German Clinical Guideline Corpus for Oncology

Preparation

Clinical Named Entity Recognition

Data Loading

Nested NER with spaCy Spancat

Flat NER

UMLS Entity Linking with xMEN

Resolution of Coordination Ellipses

Molecular Named Entities

About

Languages

hpi-dhc/ggponc

Folders and files

Latest commit

History

Repository files navigation

GGPONC - The German Clinical Guideline Corpus for Oncology

Preparation

Clinical Named Entity Recognition

Data Loading

Nested NER with spaCy Spancat

Flat NER

UMLS Entity Linking with xMEN

Resolution of Coordination Ellipses

Molecular Named Entities

About

Topics

Resources

Stars

Watchers

Forks

Languages