Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill in dataset card for NCBI disease dataset #3115

Merged
merged 2 commits into from
Oct 22, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 100 additions & 44 deletions datasets/ncbi_disease/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,59 +18,73 @@ task_categories:
task_ids:
- named-entity-recognition
paperswithcode_id: ncbi-disease-1
pretty_name: NCBI Disease
---

# Dataset Card for [Dataset Name]
# Dataset Card for NCBI Disease

## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
- [Dataset Card for NCBI Disease](#dataset-card-for-ncbi-disease)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
- [Who are the source language producers?](#who-are-the-source-language-producers)
- [Annotations](#annotations)
- [Annotation process](#annotation-process)
- [Who are the annotators?](#who-are-the-annotators)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)

## Dataset Description

- **Homepage:** [NCBI](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951655/)
- **Homepage:** [NCBI](https://www.ncbi.nlm.nih.gov/research/bionlp/Data/disease)
- **Repository:** [Github](https://github.com/spyysalo/ncbi-disease)
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
- **Paper:** [NCBI disease corpus: A resource for disease name recognition and concept normalization](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951655)
- **Leaderboard:** [Named Entity Recognition on NCBI-disease](https://paperswithcode.com/sota/named-entity-recognition-ner-on-ncbi-disease)
- **Point of Contact:** zhiyong.lu@nih.gov

### Dataset Summary

[More Information Needed]
This dataset contains the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community.

### Supported Tasks and Leaderboards

[More Information Needed]
Named Entity Recognition: [Leaderboard](https://paperswithcode.com/sota/named-entity-recognition-ner-on-ncbi-disease)

### Languages

[More Information Needed]
The text in the dataset is in English. The associated BCP-47 code is `en`.

## Dataset Structure

### Data Instances

[More Information Needed]
Instances of the dataset contain an array of `tokens`, `ner_tags` and an `id`. An example of an instance of the dataset:
```
{
'tokens': ['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.'],
'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0],
'id': '0'
}
```

### Data Fields

Expand All @@ -80,65 +94,107 @@ paperswithcode_id: ncbi-disease-1

### Data Splits

[More Information Needed]
The data is split into a train (5433 instances), validation (924 instances) and test set (941 instances).

## Dataset Creation

### Curation Rationale

[More Information Needed]
The goal of the dataset consists on improving the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks.

### Source Data

#### Initial Data Collection and Normalization

[More Information Needed]
The dataset consists on abstracts extracted from PubMed.

#### Who are the source language producers?

[More Information Needed]
The source language producers are the authors of publication abstracts hosted in PubMed.

### Annotations

#### Annotation process

[More Information Needed]
Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency.

#### Who are the annotators?

[More Information Needed]
The annotator group consisted of 14 people with backgrounds in biomedical informatics research and experience in biomedical text corpus annotation.

### Personal and Sensitive Information

[More Information Needed]
[N/A]

## Considerations for Using the Data

### Social Impact of Dataset

[More Information Needed]
Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information. This dataset provides an annotated corpora that can be used to develop highly effective tools to automatically detect central biomedical concepts such as diseases.

### Discussion of Biases

[More Information Needed]
To avoid annotator bias, pairs of annotators were chosen randomly for each set, so that each pair of annotators overlapped for at most two sets.

### Other Known Limitations

[More Information Needed]
A handful of disease concepts were discovered that were not included in MEDIC. For those, we decided to include the appropriate OMIM identifiers.

In addition, certain disease mentions were found to not be easily represented using the standard categorizations.

Also, each PMID document was pre-annotated using the Inference Method developed for disease name normalization, which properly handles abbreviation recognition, robust string matching, etc. As such, human annotators were given the pre-annotated documents as a starting point and allowed to see each pre-annotation with a computed confidence.

## Additional Information

### Dataset Curators

[More Information Needed]
Rezarta Islamaj Doğan, Robert Leaman, Zhiyong Lu

### Licensing Information

[More Information Needed]
```
PUBLIC DOMAIN NOTICE

This work is a "United States Government Work" under the terms of the
United States Copyright Act. It was written as part of the authors'
official duties as a United States Government employee and thus cannot
be copyrighted within the United States. The data is freely available
to the public for use. The National Library of Medicine and the
U.S. Government have not placed any restriction on its use or
reproduction.

Although all reasonable efforts have been taken to ensure the accuracy
and reliability of the data and its source code, the NLM and the
U.S. Government do not and cannot warrant the performance or results
that may be obtained by using it. The NLM and the U.S. Government
disclaim all warranties, express or implied, including warranties of
performance, merchantability or fitness for any particular purpose.

Please cite the authors in any work or product based on this material:

An improved corpus of disease mentions in PubMed citations
http://aclweb.org/anthology-new/W/W12/W12-2411.pdf

NCBI Disease Corpus: A Resource for Disease Name Recognition and
Normalization http://www.ncbi.nlm.nih.gov/pubmed/24393765

Disease Name Normalization with Pairwise Learning to Rank
http://www.ncbi.nlm.nih.gov/pubmed/23969135
```

### Citation Information

[More Information Needed]
```
@article{dougan2014ncbi,
title={NCBI disease corpus: a resource for disease name recognition and concept normalization},
author={Do{\u{g}}an, Rezarta Islamaj and Leaman, Robert and Lu, Zhiyong},
journal={Journal of biomedical informatics},
volume={47},
pages={1--10},
year={2014},
publisher={Elsevier}
}
```

### Contributions

Expand Down