Important

⚠️ This repository is contains the code to reproduce results of the original GGPONC 2.0 paper.

➡️ More recent tooling for data loading, NER model training and entity linking is now available here: https://github.com/hpi-dhc/ggponc.

GGPONC 2.0 — Annotation and Experiments

This repository contains the code to reproduce the results from the paper:

Florian Borchert, Christina Lohr, Luise Modersohn, Jonas Witt, Thomas Langer, Markus Follmann, Matthias Gietzelt, Bert Arnrich, Udo Hahn, and Matthieu-P. Schapranow. 2022. GGPONC 2.0 - The German Clinical Guideline Corpus for Oncology: Curation Workflow, Annotation Policy, Baseline NER Taggers. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3650–3660, Marseille, France. European Language Resources Association.

Preparation

Get access to GGPONC following the instructions on the project homepage and place the the contents of the 2.0 release (v2.0_2022_03_24) in the data folder
Install Python dependencies pip install -r requirements.txt

Preprocessing (optional)

The prepocessed, i.e., sentence-splitted and tokenized text portions, are already included in the release (data/plain_text)

If you want to do the preprocessing yourself, build the jar file from https://github.com/hpi-dhc/ggponc_preprocessing and use it with data/xml/cpg-corpus-cms.xml

Notebooks

In notebooks, we provide the following Jupyter Notebooks to reproduce the results from the paper:

01_Statistics.ipynb
- Corpus Statistics and IAA Calculcation with INCEpTALYTICS
02_NER_Baselines.ipynb
- NER Baselines using BERT / HuggingFace Transformers
03_NER_Analysis.ipynb
- Analysis of NER errors and coordination ellipses
04_spaCy_Spancat.ipynb
- Alternative NER implementation using spaCy's SpanCategorizer feature, which can handle overlapping and nested mentions
05_Ellipses.ipynb
- Extraction of elliptical coordinated compound noun phrases from GGPONC 2.0 annotations

Running NER Experiments with HuggingFace and Hydra

In experiments, we provide Hydra configurations for the different NER experiments with the best hyperparameters found through grid search. To run such an experiment, do:

cd experiments
python run_experiment.py -cn <experiment>.yaml cuda=<cuda devices>
- for instance: python run_experiment.py -cn 01_ggponc_coarse_short.yaml cuda=0

If you have installed and configured Weights & Biases, it will automatically sync your runs.

To run a hyperparameter sweep, pass the optiom -m to Hydra, e.g.:

python run_experiment.py -m -cn=01_ggponc_coarse_short.yaml cuda=0 learning_rate=1e-6,5e-6,1e-5,5e-5,1e-4 label_smoothing_factor=0.0,0.05,0.1,0.2 weight_decay=0.0,0.05,0.1

Annotation Guide

Please refer to the annotation guide for a detailed description of the entity classes and rules.

Citing GGPONC

According to the terms of use of GGPONC, contributions which are based on the corpus must cite the following publication:

Florian Borchert, Christina Lohr, Luise Modersohn, Jonas Witt, Thomas Langer, Markus Follmann, Matthias Gietzelt, Bert Arnrich, Udo Hahn, and Matthieu-P. Schapranow. 2022. GGPONC 2.0 - The German Clinical Guideline Corpus for Oncology: Curation Workflow, Annotation Policy, Baseline NER Taggers. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3650–3660, Marseille, France. European Language Resources Association.

BibTeX:

@inproceedings{borchert-etal-2022-ggponc,
    title = "{GGPONC} 2.0 - The {G}erman Clinical Guideline Corpus for Oncology: Curation Workflow, Annotation Policy, Baseline {NER} Taggers",
    author = "Borchert, Florian  and
      Lohr, Christina  and
      Modersohn, Luise  and
      Witt, Jonas  and
      Langer, Thomas  and
      Follmann, Markus  and
      Gietzelt, Matthias  and
      Arnrich, Bert  and
      Hahn, Udo  and
      Schapranow, Matthieu-P.",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    pages = "3650--3660"
}

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
annotation_guide		annotation_guide
data		data
experiments		experiments
inception		inception
notebooks		notebooks
spacy		spacy
test		test
util		util
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
README.md		README.md
conftest.py		conftest.py
gamma_agreement.py		gamma_agreement.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Important

GGPONC 2.0 — Annotation and Experiments

Preparation

Preprocessing (optional)

Notebooks

Running NER Experiments with HuggingFace and Hydra

Annotation Guide

Citing GGPONC

About

Releases 1

Packages

Languages

License

hpi-dhc/ggponc_annotation

Folders and files

Latest commit

History

Repository files navigation

Important

GGPONC 2.0 — Annotation and Experiments

Preparation

Preprocessing (optional)

Notebooks

Running NER Experiments with HuggingFace and Hydra

Annotation Guide

Citing GGPONC

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages