Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fine classes in trec dataset #4801

Merged
merged 5 commits into from
Aug 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 94 additions & 17 deletions datasets/trec/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,24 @@
---
annotations_creators:
- expert-generated
language:
- en
paperswithcode_id: trecqa
language_creators:
- expert-generated
license:
- unknown
multilinguality:
- monolingual
pretty_name: Text Retrieval Conference Question Answering
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- multi-class-classification
paperswithcode_id: trecqa
---

# Dataset Card for "trec"
Expand Down Expand Up @@ -43,51 +59,113 @@ pretty_name: Text Retrieval Conference Question Answering

### Dataset Summary

The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set. The dataset has 6 labels, 47 level-2 labels. Average length of each sentence is 10, vocabulary size of 8700.
The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set.

Data are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set.
The dataset has 6 coarse class labels and 50 fine class labels. Average length of each sentence is 10, vocabulary size of 8700.

Data are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set. These questions were manually labeled.

### Supported Tasks and Leaderboards

[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

### Languages

[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
The language in this dataset is English (`en`).

## Dataset Structure

### Data Instances

#### default

- **Size of downloaded dataset files:** 0.34 MB
- **Size of the generated dataset:** 0.39 MB
- **Total amount of disk used:** 0.74 MB

An example of 'train' looks as follows.
```
{
"label-coarse": 1,
"label-fine": 2,
"text": "What fowl grabs the spotlight after the Chinese Year of the Monkey ?"
'text': 'How did serfdom develop in and then leave Russia ?',
'coarse_label': 2,
'fine_label': 26
}
```

### Data Fields

The data fields are the same among all splits.

#### default
- `label-coarse`: a classification label, with possible values including `DESC` (0), `ENTY` (1), `ABBR` (2), `HUM` (3), `NUM` (4).
- `label-fine`: a classification label, with possible values including `manner` (0), `cremat` (1), `animal` (2), `exp` (3), `ind` (4).
- `text`: a `string` feature.
- `text` (`str`): Text of the question.
- `coarse_label` (`ClassLabel`): Coarse class label. Possible values are:
- 'ABBR' (0): Abbreviation.
- 'ENTY' (1): Entity.
- 'DESC' (2): Description and abstract concept.
- 'HUM' (3): Human being.
- 'LOC' (4): Location.
- 'NUM' (5): Numeric value.
- `fine_label` (`ClassLabel`): Fine class label. Possible values are:
- ABBREVIATION:
- 'ABBR:abb' (0): Abbreviation.
- 'ABBR:exp' (1): Expression abbreviated.
- ENTITY:
- 'ENTY:animal' (2): Animal.
- 'ENTY:body' (3): Organ of body.
- 'ENTY:color' (4): Color.
- 'ENTY:cremat' (5): Invention, book and other creative piece.
- 'ENTY:currency' (6): Currency name.
- 'ENTY:dismed' (7): Disease and medicine.
- 'ENTY:event' (8): Event.
- 'ENTY:food' (9): Food.
- 'ENTY:instru' (10): Musical instrument.
- 'ENTY:lang' (11): Language.
- 'ENTY:letter' (12): Letter like a-z.
- 'ENTY:other' (13): Other entity.
- 'ENTY:plant' (14): Plant.
- 'ENTY:product' (15): Product.
- 'ENTY:religion' (16): Religion.
- 'ENTY:sport' (17): Sport.
- 'ENTY:substance' (18): Element and substance.
- 'ENTY:symbol' (19): Symbols and sign.
- 'ENTY:techmeth' (20): Techniques and method.
- 'ENTY:termeq' (21): Equivalent term.
- 'ENTY:veh' (22): Vehicle.
- 'ENTY:word' (23): Word with a special property.
- DESCRIPTION:
- 'DESC:def' (24): Definition of something.
- 'DESC:desc' (25): Description of something.
- 'DESC:manner' (26): Manner of an action.
- 'DESC:reason' (27): Reason.
- HUMAN:
- 'HUM:gr' (28): Group or organization of persons
- 'HUM:ind' (29): Individual.
- 'HUM:title' (30): Title of a person.
- 'HUM:desc' (31): Description of a person.
- LOCATION:
- 'LOC:city' (32): City.
- 'LOC:country' (33): Country.
- 'LOC:mount' (34): Mountain.
- 'LOC:other' (35): Other location.
- 'LOC:state' (36): State.
- NUMERIC:
- 'NUM:code' (37): Postcode or other code.
- 'NUM:count' (38): Number of something.
- 'NUM:date' (39): Date.
- 'NUM:dist' (40): Distance, linear measure.
- 'NUM:money' (41): Price.
- 'NUM:ord' (42): Order, rank.
- 'NUM:other' (43): Other number.
- 'NUM:period' (44): Lasting time of something
- 'NUM:perc' (45): Percent, fraction.
- 'NUM:speed' (46): Speed.
- 'NUM:temp' (47): Temperature.
- 'NUM:volsize' (48): Size, area and volume.
- 'NUM:weight' (49): Weight.


### Data Splits

| name |train|test|
|-------|----:|---:|
|default| 5452| 500|
| name | train | test |
|---------|------:|-----:|
| default | 5452 | 500 |

## Dataset Creation

Expand Down Expand Up @@ -165,7 +243,6 @@ The data fields are the same among all splits.
year = "2001",
url = "https://www.aclweb.org/anthology/H01-1069",
}

```


Expand Down
2 changes: 1 addition & 1 deletion datasets/trec/dataset_infos.json
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"default": {"description": "The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set. The dataset has 6 labels, 47 level-2 labels. Average length of each sentence is 10, vocabulary size of 8700.\n\nData are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set.\n", "citation": "@inproceedings{li-roth-2002-learning,\n title = \"Learning Question Classifiers\",\n author = \"Li, Xin and\n Roth, Dan\",\n booktitle = \"{COLING} 2002: The 19th International Conference on Computational Linguistics\",\n year = \"2002\",\n url = \"https://www.aclweb.org/anthology/C02-1150\",\n}\n@inproceedings{hovy-etal-2001-toward,\n title = \"Toward Semantics-Based Answer Pinpointing\",\n author = \"Hovy, Eduard and\n Gerber, Laurie and\n Hermjakob, Ulf and\n Lin, Chin-Yew and\n Ravichandran, Deepak\",\n booktitle = \"Proceedings of the First International Conference on Human Language Technology Research\",\n year = \"2001\",\n url = \"https://www.aclweb.org/anthology/H01-1069\",\n}\n", "homepage": "https://cogcomp.seas.upenn.edu/Data/QA/QC/", "license": "", "features": {"label-coarse": {"num_classes": 6, "names": ["DESC", "ENTY", "ABBR", "HUM", "NUM", "LOC"], "names_file": null, "id": null, "_type": "ClassLabel"}, "label-fine": {"num_classes": 47, "names": ["manner", "cremat", "animal", "exp", "ind", "gr", "title", "def", "date", "reason", "event", "state", "desc", "count", "other", "letter", "religion", "food", "country", "color", "termeq", "city", "body", "dismed", "mount", "money", "product", "period", "substance", "sport", "plant", "techmeth", "volsize", "instru", "abb", "speed", "word", "lang", "perc", "code", "dist", "temp", "symbol", "ord", "veh", "weight", "currency"], "names_file": null, "id": null, "_type": "ClassLabel"}, "text": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "builder_name": "trec", "config_name": "default", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 385090, "num_examples": 5452, "dataset_name": "trec"}, "test": {"name": "test", "num_bytes": 27983, "num_examples": 500, "dataset_name": "trec"}}, "download_checksums": {"https://cogcomp.seas.upenn.edu/Data/QA/QC/train_5500.label": {"num_bytes": 335858, "checksum": "9e4c8bdcaffb96ed61041bd64b564183d52793a8e91d84fc3a8646885f466ec3"}, "https://cogcomp.seas.upenn.edu/Data/QA/QC/TREC_10.label": {"num_bytes": 23354, "checksum": "033f22c028c2bbba9ca682f68ffe204dc1aa6e1cf35dd6207f2d4ca67f0d0e8e"}}, "download_size": 359212, "post_processing_size": null, "dataset_size": 413073, "size_in_bytes": 772285}}
{"default": {"description": "The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set.\n\nThe dataset has 6 coarse class labels and 50 fine class labels. Average length of each sentence is 10, vocabulary size of 8700.\n\nData are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set. These questions were manually labeled.\n", "citation": "@inproceedings{li-roth-2002-learning,\n title = \"Learning Question Classifiers\",\n author = \"Li, Xin and\n Roth, Dan\",\n booktitle = \"{COLING} 2002: The 19th International Conference on Computational Linguistics\",\n year = \"2002\",\n url = \"https://www.aclweb.org/anthology/C02-1150\",\n}\n@inproceedings{hovy-etal-2001-toward,\n title = \"Toward Semantics-Based Answer Pinpointing\",\n author = \"Hovy, Eduard and\n Gerber, Laurie and\n Hermjakob, Ulf and\n Lin, Chin-Yew and\n Ravichandran, Deepak\",\n booktitle = \"Proceedings of the First International Conference on Human Language Technology Research\",\n year = \"2001\",\n url = \"https://www.aclweb.org/anthology/H01-1069\",\n}\n", "homepage": "https://cogcomp.seas.upenn.edu/Data/QA/QC/", "license": "", "features": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "coarse_label": {"num_classes": 6, "names": ["ABBR", "ENTY", "DESC", "HUM", "LOC", "NUM"], "id": null, "_type": "ClassLabel"}, "fine_label": {"num_classes": 50, "names": ["ABBR:abb", "ABBR:exp", "ENTY:animal", "ENTY:body", "ENTY:color", "ENTY:cremat", "ENTY:currency", "ENTY:dismed", "ENTY:event", "ENTY:food", "ENTY:instru", "ENTY:lang", "ENTY:letter", "ENTY:other", "ENTY:plant", "ENTY:product", "ENTY:religion", "ENTY:sport", "ENTY:substance", "ENTY:symbol", "ENTY:techmeth", "ENTY:termeq", "ENTY:veh", "ENTY:word", "DESC:def", "DESC:desc", "DESC:manner", "DESC:reason", "HUM:gr", "HUM:ind", "HUM:title", "HUM:desc", "LOC:city", "LOC:country", "LOC:mount", "LOC:other", "LOC:state", "NUM:code", "NUM:count", "NUM:date", "NUM:dist", "NUM:money", "NUM:ord", "NUM:other", "NUM:period", "NUM:perc", "NUM:speed", "NUM:temp", "NUM:volsize", "NUM:weight"], "id": null, "_type": "ClassLabel"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "trec", "config_name": "default", "version": {"version_str": "2.0.0", "description": "Fine label contains 50 classes instead of 47.", "major": 2, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 385090, "num_examples": 5452, "dataset_name": "trec"}, "test": {"name": "test", "num_bytes": 27983, "num_examples": 500, "dataset_name": "trec"}}, "download_checksums": {"https://cogcomp.seas.upenn.edu/Data/QA/QC/train_5500.label": {"num_bytes": 335858, "checksum": "9e4c8bdcaffb96ed61041bd64b564183d52793a8e91d84fc3a8646885f466ec3"}, "https://cogcomp.seas.upenn.edu/Data/QA/QC/TREC_10.label": {"num_bytes": 23354, "checksum": "033f22c028c2bbba9ca682f68ffe204dc1aa6e1cf35dd6207f2d4ca67f0d0e8e"}}, "download_size": 359212, "post_processing_size": null, "dataset_size": 413073, "size_in_bytes": 772285}}
151 changes: 72 additions & 79 deletions datasets/trec/trec.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,22 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" The Text REtrieval Conference (TREC) Question Classification dataset."""
"""The Text REtrieval Conference (TREC) Question Classification dataset."""


import datasets


_DESCRIPTION = """\
The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set.

The dataset has 6 coarse class labels and 50 fine class labels. Average length of each sentence is 10, vocabulary size of 8700.

Data are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set. These questions were manually labeled.
"""

_HOMEPAGE = "https://cogcomp.seas.upenn.edu/Data/QA/QC/"

_CITATION = """\
@inproceedings{li-roth-2002-learning,
title = "Learning Question Classifiers",
Expand All @@ -40,114 +50,98 @@
}
"""

_DESCRIPTION = """\
The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set. The dataset has 6 labels, 47 level-2 labels. Average length of each sentence is 10, vocabulary size of 8700.

Data are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set.
"""

_URLs = {
"train": "https://cogcomp.seas.upenn.edu/Data/QA/QC/train_5500.label",
"test": "https://cogcomp.seas.upenn.edu/Data/QA/QC/TREC_10.label",
}

_COARSE_LABELS = ["DESC", "ENTY", "ABBR", "HUM", "NUM", "LOC"]
_COARSE_LABELS = ["ABBR", "ENTY", "DESC", "HUM", "LOC", "NUM"]

_FINE_LABELS = [
"manner",
"cremat",
"animal",
"exp",
"ind",
"gr",
"title",
"def",
"date",
"reason",
"event",
"state",
"desc",
"count",
"other",
"letter",
"religion",
"food",
"country",
"color",
"termeq",
"city",
"body",
"dismed",
"mount",
"money",
"product",
"period",
"substance",
"sport",
"plant",
"techmeth",
"volsize",
"instru",
"abb",
"speed",
"word",
"lang",
"perc",
"code",
"dist",
"temp",
"symbol",
"ord",
"veh",
"weight",
"currency",
"ABBR:abb",
"ABBR:exp",
"ENTY:animal",
"ENTY:body",
"ENTY:color",
"ENTY:cremat",
"ENTY:currency",
"ENTY:dismed",
"ENTY:event",
"ENTY:food",
"ENTY:instru",
"ENTY:lang",
"ENTY:letter",
"ENTY:other",
"ENTY:plant",
"ENTY:product",
"ENTY:religion",
"ENTY:sport",
"ENTY:substance",
"ENTY:symbol",
"ENTY:techmeth",
"ENTY:termeq",
"ENTY:veh",
"ENTY:word",
"DESC:def",
"DESC:desc",
"DESC:manner",
"DESC:reason",
"HUM:gr",
"HUM:ind",
"HUM:title",
"HUM:desc",
"LOC:city",
"LOC:country",
"LOC:mount",
"LOC:other",
"LOC:state",
"NUM:code",
"NUM:count",
"NUM:date",
"NUM:dist",
"NUM:money",
"NUM:ord",
"NUM:other",
"NUM:period",
"NUM:perc",
"NUM:speed",
"NUM:temp",
"NUM:volsize",
"NUM:weight",
]


class Trec(datasets.GeneratorBasedBuilder):
"""TODO: Short description of my dataset."""
"""The Text REtrieval Conference (TREC) Question Classification dataset."""

VERSION = datasets.Version("1.1.0")
VERSION = datasets.Version("2.0.0", description="Fine label contains 50 classes instead of 47.")

def _info(self):
# TODO: Specifies the datasets.DatasetInfo object
return datasets.DatasetInfo(
# This is the description that will appear on the datasets page.
description=_DESCRIPTION,
# datasets.features.FeatureConnectors
features=datasets.Features(
{
"label-coarse": datasets.ClassLabel(names=_COARSE_LABELS),
"label-fine": datasets.ClassLabel(names=_FINE_LABELS),
"text": datasets.Value("string"),
"coarse_label": datasets.ClassLabel(names=_COARSE_LABELS),
"fine_label": datasets.ClassLabel(names=_FINE_LABELS),
}
),
# If there's a common (input, target) tuple from the features,
# specify them here. They'll be used if as_supervised=True in
# builder.as_dataset.
supervised_keys=None,
# Homepage of the dataset for documentation
homepage="https://cogcomp.seas.upenn.edu/Data/QA/QC/",
homepage=_HOMEPAGE,
citation=_CITATION,
)

def _split_generators(self, dl_manager):
"""Returns SplitGenerators."""
# TODO: Downloads the data and defines the splits
# dl_manager is a datasets.download.DownloadManager that can be used to
# download and extract URLs
dl_files = dl_manager.download_and_extract(_URLs)
dl_files = dl_manager.download(_URLs)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
# These kwargs will be passed to _generate_examples
gen_kwargs={
"filepath": dl_files["train"],
},
),
datasets.SplitGenerator(
name=datasets.Split.TEST,
# These kwargs will be passed to _generate_examples
gen_kwargs={
"filepath": dl_files["test"],
},
Expand All @@ -156,14 +150,13 @@ def _split_generators(self, dl_manager):

def _generate_examples(self, filepath):
"""Yields examples."""
# TODO: Yields (key, example) tuples from the dataset
with open(filepath, "rb") as f:
for id_, row in enumerate(f):
# One non-ASCII byte: sisterBADBYTEcity. We replace it with a space
label, _, text = row.replace(b"\xf0", b" ").strip().decode().partition(" ")
coarse_label, _, fine_label = label.partition(":")
fine_label, _, text = row.replace(b"\xf0", b" ").strip().decode().partition(" ")
coarse_label = fine_label.split(":")[0]
yield id_, {
"label-coarse": coarse_label,
"label-fine": fine_label,
"text": text,
"coarse_label": coarse_label,
"fine_label": fine_label,
}