huggingface · albertvillanova · Aug 22, 2022 · Aug 7, 2022 · Aug 7, 2022 · Aug 7, 2022
diff --git a/datasets/trec/README.md b/datasets/trec/README.md
@@ -1,8 +1,24 @@
 ---
+annotations_creators:
+- expert-generated
 language:
 - en
-paperswithcode_id: trecqa
+language_creators:
+- expert-generated
+license:
+- unknown
+multilinguality:
+- monolingual
 pretty_name: Text Retrieval Conference Question Answering
+size_categories:
+- 1K<n<10K
+source_datasets:
+- original
+task_categories:
+- text-classification
+task_ids:
+- multi-class-classification
+paperswithcode_id: trecqa
 ---
 
 # Dataset Card for "trec"
@@ -43,51 +59,113 @@ pretty_name: Text Retrieval Conference Question Answering
 
 ### Dataset Summary
 
-The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set. The dataset has 6 labels, 47 level-2 labels. Average length of each sentence is 10, vocabulary size of 8700.
+The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set.
 
-Data are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set.
+The dataset has 6 coarse class labels and 50 fine class labels. Average length of each sentence is 10, vocabulary size of 8700.
+
+Data are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set. These questions were manually labeled.
 
 ### Supported Tasks and Leaderboards
 
 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
 
 ### Languages
 
-[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
+The language in this dataset is English (`en`).
 
 ## Dataset Structure
 
 ### Data Instances
 
-#### default
-
 - **Size of downloaded dataset files:** 0.34 MB
 - **Size of the generated dataset:** 0.39 MB
 - **Total amount of disk used:** 0.74 MB
 
 An example of 'train' looks as follows.
 ```
 {
-    "label-coarse": 1,
-    "label-fine": 2,
-    "text": "What fowl grabs the spotlight after the Chinese Year of the Monkey ?"
+  'text': 'How did serfdom develop in and then leave Russia ?',
+  'coarse_label': 2,
+  'fine_label': 26
 }
 ```
 
 ### Data Fields
 
 The data fields are the same among all splits.
 
-#### default
-- `label-coarse`: a classification label, with possible values including `DESC` (0), `ENTY` (1), `ABBR` (2), `HUM` (3), `NUM` (4).
-- `label-fine`: a classification label, with possible values including `manner` (0), `cremat` (1), `animal` (2), `exp` (3), `ind` (4).
-- `text`: a `string` feature.
+- `text` (`str`): Text of the question.
+- `coarse_label` (`ClassLabel`): Coarse class label. Possible values are:
+  - 'ABBR' (0): Abbreviation.
+  - 'ENTY' (1): Entity.
+  - 'DESC' (2): Description and abstract concept.
+  - 'HUM' (3): Human being.
+  - 'LOC' (4): Location.
+  - 'NUM' (5): Numeric value.
+- `fine_label` (`ClassLabel`): Fine class label. Possible values are:
+  - ABBREVIATION:
+    - 'ABBR:abb' (0): Abbreviation.
+    - 'ABBR:exp' (1): Expression abbreviated.
+  - ENTITY:
+    - 'ENTY:animal' (2): Animal.
+    - 'ENTY:body' (3): Organ of body.
+    - 'ENTY:color' (4): Color.
+    - 'ENTY:cremat' (5): Invention, book and other creative piece.
+    - 'ENTY:currency' (6): Currency name.
+    - 'ENTY:dismed' (7): Disease and medicine.
+    - 'ENTY:event' (8): Event.
+    - 'ENTY:food' (9): Food.
+    - 'ENTY:instru' (10): Musical instrument.
+    - 'ENTY:lang' (11): Language.
+    - 'ENTY:letter' (12): Letter like a-z.
+    - 'ENTY:other' (13): Other entity.
+    - 'ENTY:plant' (14): Plant.
+    - 'ENTY:product' (15): Product.
+    - 'ENTY:religion' (16): Religion.
+    - 'ENTY:sport' (17): Sport.
+    - 'ENTY:substance' (18): Element and substance.
+    - 'ENTY:symbol' (19): Symbols and sign.
+    - 'ENTY:techmeth' (20): Techniques and method.
+    - 'ENTY:termeq' (21): Equivalent term.
+    - 'ENTY:veh' (22): Vehicle.
+    - 'ENTY:word' (23): Word with a special property.
+  - DESCRIPTION:
+    - 'DESC:def' (24): Definition of something.
+    - 'DESC:desc' (25): Description of something.
+    - 'DESC:manner' (26): Manner of an action.
+    - 'DESC:reason' (27): Reason.
+  - HUMAN:
+    - 'HUM:gr' (28): Group or organization of persons
+    - 'HUM:ind' (29): Individual.
+    - 'HUM:title' (30): Title of a person.
+    - 'HUM:desc' (31): Description of a person.
+  - LOCATION:
+    - 'LOC:city' (32): City.
+    - 'LOC:country' (33): Country.
+    - 'LOC:mount' (34): Mountain.
+    - 'LOC:other' (35): Other location.
+    - 'LOC:state' (36): State.
+  - NUMERIC:
+    - 'NUM:code' (37): Postcode or other code.
+    - 'NUM:count' (38): Number of something.
+    - 'NUM:date' (39): Date.
+    - 'NUM:dist' (40): Distance, linear measure.
+    - 'NUM:money' (41): Price.
+    - 'NUM:ord' (42): Order, rank.
+    - 'NUM:other' (43): Other number.
+    - 'NUM:period' (44): Lasting time of something
+    - 'NUM:perc' (45): Percent, fraction.
+    - 'NUM:speed' (46): Speed.
+    - 'NUM:temp' (47): Temperature.
+    - 'NUM:volsize' (48): Size, area and volume.
+    - 'NUM:weight' (49): Weight.
+
 
 ### Data Splits
 
-| name  |train|test|
-|-------|----:|---:|
-|default| 5452| 500|
+| name    | train | test |
+|---------|------:|-----:|
+| default |  5452 |  500 |
 
 ## Dataset Creation
 
@@ -165,7 +243,6 @@ The data fields are the same among all splits.
     year = "2001",
     url = "https://www.aclweb.org/anthology/H01-1069",
 }
-
 ```
 
 

diff --git a/datasets/trec/dataset_infos.json b/datasets/trec/dataset_infos.json
@@ -1 +1 @@
-{"default": {"description": "The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set. The dataset has 6 labels, 47 level-2 labels. Average length of each sentence is 10, vocabulary size of 8700.\n\nData are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set.\n", "citation": "@inproceedings{li-roth-2002-learning,\n    title = \"Learning Question Classifiers\",\n    author = \"Li, Xin  and\n      Roth, Dan\",\n    booktitle = \"{COLING} 2002: The 19th International Conference on Computational Linguistics\",\n    year = \"2002\",\n    url = \"https://www.aclweb.org/anthology/C02-1150\",\n}\n@inproceedings{hovy-etal-2001-toward,\n    title = \"Toward Semantics-Based Answer Pinpointing\",\n    author = \"Hovy, Eduard  and\n      Gerber, Laurie  and\n      Hermjakob, Ulf  and\n      Lin, Chin-Yew  and\n      Ravichandran, Deepak\",\n    booktitle = \"Proceedings of the First International Conference on Human Language Technology Research\",\n    year = \"2001\",\n    url = \"https://www.aclweb.org/anthology/H01-1069\",\n}\n", "homepage": "https://cogcomp.seas.upenn.edu/Data/QA/QC/", "license": "", "features": {"label-coarse": {"num_classes": 6, "names": ["DESC", "ENTY", "ABBR", "HUM", "NUM", "LOC"], "names_file": null, "id": null, "_type": "ClassLabel"}, "label-fine": {"num_classes": 47, "names": ["manner", "cremat", "animal", "exp", "ind", "gr", "title", "def", "date", "reason", "event", "state", "desc", "count", "other", "letter", "religion", "food", "country", "color", "termeq", "city", "body", "dismed", "mount", "money", "product", "period", "substance", "sport", "plant", "techmeth", "volsize", "instru", "abb", "speed", "word", "lang", "perc", "code", "dist", "temp", "symbol", "ord", "veh", "weight", "currency"], "names_file": null, "id": null, "_type": "ClassLabel"}, "text": {"dtype": "string", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "builder_name": "trec", "config_name": "default", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 385090, "num_examples": 5452, "dataset_name": "trec"}, "test": {"name": "test", "num_bytes": 27983, "num_examples": 500, "dataset_name": "trec"}}, "download_checksums": {"https://cogcomp.seas.upenn.edu/Data/QA/QC/train_5500.label": {"num_bytes": 335858, "checksum": "9e4c8bdcaffb96ed61041bd64b564183d52793a8e91d84fc3a8646885f466ec3"}, "https://cogcomp.seas.upenn.edu/Data/QA/QC/TREC_10.label": {"num_bytes": 23354, "checksum": "033f22c028c2bbba9ca682f68ffe204dc1aa6e1cf35dd6207f2d4ca67f0d0e8e"}}, "download_size": 359212, "post_processing_size": null, "dataset_size": 413073, "size_in_bytes": 772285}}
+{"default": {"description": "The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set.\n\nThe dataset has 6 coarse class labels and 50 fine class labels. Average length of each sentence is 10, vocabulary size of 8700.\n\nData are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set. These questions were manually labeled.\n", "citation": "@inproceedings{li-roth-2002-learning,\n    title = \"Learning Question Classifiers\",\n    author = \"Li, Xin  and\n      Roth, Dan\",\n    booktitle = \"{COLING} 2002: The 19th International Conference on Computational Linguistics\",\n    year = \"2002\",\n    url = \"https://www.aclweb.org/anthology/C02-1150\",\n}\n@inproceedings{hovy-etal-2001-toward,\n    title = \"Toward Semantics-Based Answer Pinpointing\",\n    author = \"Hovy, Eduard  and\n      Gerber, Laurie  and\n      Hermjakob, Ulf  and\n      Lin, Chin-Yew  and\n      Ravichandran, Deepak\",\n    booktitle = \"Proceedings of the First International Conference on Human Language Technology Research\",\n    year = \"2001\",\n    url = \"https://www.aclweb.org/anthology/H01-1069\",\n}\n", "homepage": "https://cogcomp.seas.upenn.edu/Data/QA/QC/", "license": "", "features": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "coarse_label": {"num_classes": 6, "names": ["ABBR", "ENTY", "DESC", "HUM", "LOC", "NUM"], "id": null, "_type": "ClassLabel"}, "fine_label": {"num_classes": 50, "names": ["ABBR:abb", "ABBR:exp", "ENTY:animal", "ENTY:body", "ENTY:color", "ENTY:cremat", "ENTY:currency", "ENTY:dismed", "ENTY:event", "ENTY:food", "ENTY:instru", "ENTY:lang", "ENTY:letter", "ENTY:other", "ENTY:plant", "ENTY:product", "ENTY:religion", "ENTY:sport", "ENTY:substance", "ENTY:symbol", "ENTY:techmeth", "ENTY:termeq", "ENTY:veh", "ENTY:word", "DESC:def", "DESC:desc", "DESC:manner", "DESC:reason", "HUM:gr", "HUM:ind", "HUM:title", "HUM:desc", "LOC:city", "LOC:country", "LOC:mount", "LOC:other", "LOC:state", "NUM:code", "NUM:count", "NUM:date", "NUM:dist", "NUM:money", "NUM:ord", "NUM:other", "NUM:period", "NUM:perc", "NUM:speed", "NUM:temp", "NUM:volsize", "NUM:weight"], "id": null, "_type": "ClassLabel"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "trec", "config_name": "default", "version": {"version_str": "2.0.0", "description": "Fine label contains 50 classes instead of 47.", "major": 2, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 385090, "num_examples": 5452, "dataset_name": "trec"}, "test": {"name": "test", "num_bytes": 27983, "num_examples": 500, "dataset_name": "trec"}}, "download_checksums": {"https://cogcomp.seas.upenn.edu/Data/QA/QC/train_5500.label": {"num_bytes": 335858, "checksum": "9e4c8bdcaffb96ed61041bd64b564183d52793a8e91d84fc3a8646885f466ec3"}, "https://cogcomp.seas.upenn.edu/Data/QA/QC/TREC_10.label": {"num_bytes": 23354, "checksum": "033f22c028c2bbba9ca682f68ffe204dc1aa6e1cf35dd6207f2d4ca67f0d0e8e"}}, "download_size": 359212, "post_processing_size": null, "dataset_size": 413073, "size_in_bytes": 772285}}
diff --git a/datasets/trec/dummy/1.1.0/dummy_data.zip → datasets/trec/dummy/2.0.0/dummy_data.zip b/datasets/trec/dummy/1.1.0/dummy_data.zip → datasets/trec/dummy/2.0.0/dummy_data.zip
diff --git a/datasets/trec/trec.py b/datasets/trec/trec.py
@@ -12,12 +12,22 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-""" The Text REtrieval Conference (TREC) Question Classification dataset."""
+"""The Text REtrieval Conference (TREC) Question Classification dataset."""
 
 
 import datasets
 
 
+_DESCRIPTION = """\
+The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set.
+
+The dataset has 6 coarse class labels and 50 fine class labels. Average length of each sentence is 10, vocabulary size of 8700.
+
+Data are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set. These questions were manually labeled.
+"""
+
+_HOMEPAGE = "https://cogcomp.seas.upenn.edu/Data/QA/QC/"
+
 _CITATION = """\
 @inproceedings{li-roth-2002-learning,
     title = "Learning Question Classifiers",
@@ -40,114 +50,98 @@
 }
 """
 
-_DESCRIPTION = """\
-The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set. The dataset has 6 labels, 47 level-2 labels. Average length of each sentence is 10, vocabulary size of 8700.
-
-Data are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set.
-"""
-
 _URLs = {
     "train": "https://cogcomp.seas.upenn.edu/Data/QA/QC/train_5500.label",
     "test": "https://cogcomp.seas.upenn.edu/Data/QA/QC/TREC_10.label",
 }
 
-_COARSE_LABELS = ["DESC", "ENTY", "ABBR", "HUM", "NUM", "LOC"]
+_COARSE_LABELS = ["ABBR", "ENTY", "DESC", "HUM", "LOC", "NUM"]
 
 _FINE_LABELS = [
-    "manner",
-    "cremat",
-    "animal",
-    "exp",
-    "ind",
-    "gr",
-    "title",
-    "def",
-    "date",
-    "reason",
-    "event",
-    "state",
-    "desc",
-    "count",
-    "other",
-    "letter",
-    "religion",
-    "food",
-    "country",
-    "color",
-    "termeq",
-    "city",
-    "body",
-    "dismed",
-    "mount",
-    "money",
-    "product",
-    "period",
-    "substance",
-    "sport",
-    "plant",
-    "techmeth",
-    "volsize",
-    "instru",
-    "abb",
-    "speed",
-    "word",
-    "lang",
-    "perc",
-    "code",
-    "dist",
-    "temp",
-    "symbol",
-    "ord",
-    "veh",
-    "weight",
-    "currency",
+    "ABBR:abb",
+    "ABBR:exp",
+    "ENTY:animal",
+    "ENTY:body",
+    "ENTY:color",
+    "ENTY:cremat",
+    "ENTY:currency",
+    "ENTY:dismed",
+    "ENTY:event",
+    "ENTY:food",
+    "ENTY:instru",
+    "ENTY:lang",
+    "ENTY:letter",
+    "ENTY:other",
+    "ENTY:plant",
+    "ENTY:product",
+    "ENTY:religion",
+    "ENTY:sport",
+    "ENTY:substance",
+    "ENTY:symbol",
+    "ENTY:techmeth",
+    "ENTY:termeq",
+    "ENTY:veh",
+    "ENTY:word",
+    "DESC:def",
+    "DESC:desc",
+    "DESC:manner",
+    "DESC:reason",
+    "HUM:gr",
+    "HUM:ind",
+    "HUM:title",
+    "HUM:desc",
+    "LOC:city",
+    "LOC:country",
+    "LOC:mount",
+    "LOC:other",
+    "LOC:state",
+    "NUM:code",
+    "NUM:count",
+    "NUM:date",
+    "NUM:dist",
+    "NUM:money",
+    "NUM:ord",
+    "NUM:other",
+    "NUM:period",
+    "NUM:perc",
+    "NUM:speed",
+    "NUM:temp",
+    "NUM:volsize",
+    "NUM:weight",
 ]
 
 
 class Trec(datasets.GeneratorBasedBuilder):
-    """TODO: Short description of my dataset."""
+    """The Text REtrieval Conference (TREC) Question Classification dataset."""
 
-    VERSION = datasets.Version("1.1.0")
+    VERSION = datasets.Version("2.0.0", description="Fine label contains 50 classes instead of 47.")
 
     def _info(self):
-        # TODO: Specifies the datasets.DatasetInfo object
         return datasets.DatasetInfo(
-            # This is the description that will appear on the datasets page.
             description=_DESCRIPTION,
-            # datasets.features.FeatureConnectors
             features=datasets.Features(
                 {
-                    "label-coarse": datasets.ClassLabel(names=_COARSE_LABELS),
-                    "label-fine": datasets.ClassLabel(names=_FINE_LABELS),
                     "text": datasets.Value("string"),
+                    "coarse_label": datasets.ClassLabel(names=_COARSE_LABELS),
+                    "fine_label": datasets.ClassLabel(names=_FINE_LABELS),
                 }
             ),
-            # If there's a common (input, target) tuple from the features,
-            # specify them here. They'll be used if as_supervised=True in
-            # builder.as_dataset.
-            supervised_keys=None,
-            # Homepage of the dataset for documentation
-            homepage="https://cogcomp.seas.upenn.edu/Data/QA/QC/",
+            homepage=_HOMEPAGE,
             citation=_CITATION,
         )
 
     def _split_generators(self, dl_manager):
         """Returns SplitGenerators."""
-        # TODO: Downloads the data and defines the splits
-        # dl_manager is a datasets.download.DownloadManager that can be used to
-        # download and extract URLs
-        dl_files = dl_manager.download_and_extract(_URLs)
+        dl_files = dl_manager.download(_URLs)
         return [
             datasets.SplitGenerator(
                 name=datasets.Split.TRAIN,
-                # These kwargs will be passed to _generate_examples
                 gen_kwargs={
                     "filepath": dl_files["train"],
                 },
             ),
             datasets.SplitGenerator(
                 name=datasets.Split.TEST,
-                # These kwargs will be passed to _generate_examples
                 gen_kwargs={
                     "filepath": dl_files["test"],
                 },
@@ -156,14 +150,13 @@ def _split_generators(self, dl_manager):
 
     def _generate_examples(self, filepath):
         """Yields examples."""
-        # TODO: Yields (key, example) tuples from the dataset
         with open(filepath, "rb") as f:
             for id_, row in enumerate(f):
                 # One non-ASCII byte: sisterBADBYTEcity. We replace it with a space
-                label, _, text = row.replace(b"\xf0", b" ").strip().decode().partition(" ")
-                coarse_label, _, fine_label = label.partition(":")
+                fine_label, _, text = row.replace(b"\xf0", b" ").strip().decode().partition(" ")
+                coarse_label = fine_label.split(":")[0]
                 yield id_, {
-                    "label-coarse": coarse_label,
-                    "label-fine": fine_label,
                     "text": text,
+                    "coarse_label": coarse_label,
+                    "fine_label": fine_label,
                 }