This repository collects resources related to GGPONC.
It covers:
- (Nested) Clinical Named Entity Recogition
- UMLS Entity Linking with xMEN
- Resolution of Coordination Ellipses
- Molecular Named Entities (Genes / Proteins, Variants)
see also:
Repository | Description |
---|---|
ggponc_annotation | GGPONC 2.0 Results and Gold Standard Annotations |
ggponc_preprocessing | Pre-Processing Pipeline (Tokenization, POS Tagging) and GGPONC 1.0 Results |
ggponc_ellipses | Resolving Elliptical Compounds in German Medical Text |
ggponc_molecular | GGTWEAK - Gene Tagging with Weak Supervision for German Clinical Text |
- Get access to GGPONC following the instructions on the project homepage and place the contents of the 2.0 release in the
data
folder:- https://zenodo.org/records/12518458 as
v2.0_2022_03_24
- https://zenodo.org/records/12530242 as
v2.0_agreement
- https://zenodo.org/records/12518458 as
- Install Python dependencies
pip install -r requirements.txt
`
A BigBIO-compatible data loader for loading the latest gold-standard annotations (GGPONC 2.0) to train NER models are available through the Hugging Face Hub: https://huggingface.co/datasets/bigbio/ggponc2
from datasets import load_dataset
dataset = load_dataset('bigbio/ggponc2', data_dir='data/v2.0_2022_03_24', name='ggponc2_fine_long_bigbio_kb')
A trained spaCy model for nested NER is available on Hugging Face: https://huggingface.co/phlobo/de_ggponc_medbertde
huggingface-cli download phlobo/de_ggponc_medbertde de_ggponc_medbertde-any-py3-none-any.whl --local-dir .
pip install -q de_ggponc_medbertde-any-py3-none-any.whl
See: 01_GGPONC_Nested_NER
Training and evaluation of the (flat) NER models described in Borchert et al. (2022) is covered in the GGPONC 2.0 repository.
We use the xMEN toolkit with a pre-trained re-ranker to normalize identified entity mention spans to UMLS codes.
Application of our encoder-decoder model for resolving elliptical coordinated compound noun phrases (ECCNPs), e.g. Chemo- und Strahlentherapie
-> Chemotherapie und Strahlentherapie
Training and evaluation of a nested NER model for gene / protein and variant mentions. The dataset (molecular_2024_04_03
) is not yet published, but available upon request. Place the release in data
to run the notebook.
See: 04_Molecular.ipynb