[GEM Dataset] Added TurkCorpus, an evaluation dataset for sentence si…

…mplification. (#1732) * Added TurkCorpus, an evaluation dataset for sentence simplification that focuses on lexical paraphrasing. * Corrected the dataset name in the config file * Rectified formatting issues in the dataset file * Retrigger checks * Added YAML tags, updated README with data instances and reduced size of dummy data
huggingface · Jan 14, 2021 · 63734d7 · 63734d7
1 parent c028893
commit 63734d7
Show file tree

Hide file tree

Showing 4 changed files with 282 additions and 0 deletions.
diff --git a/datasets/turk/README.md b/datasets/turk/README.md
@@ -0,0 +1,163 @@
+---
+annotations_creators:
+- machine-generated
+language_creators:
+- found
+languages:
+- en
+licenses:
+- gnu-gpl-v3.0
+multilinguality:
+- monolingual
+size_categories:
+- 1K<n<10K
+source_datasets:
+- original
+task_categories:
+- conditional-text-generation
+task_ids:
+- text-simplification
+---
+
+# Dataset Card for TURK
+
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-instances)
+  - [Data Splits](#data-instances)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+
+## Dataset Description
+
+- **Homepage:** None 
+- **Repository:** [TURK](https://github.com/cocoxu/simplification)
+- **Paper:** [Optimizing Statistical Machine Translation for Text Simplification](https://www.aclweb.org/anthology/Q16-1029/)
+- **Leaderboard:** N/A
+- **Point of Contact:** [Wei Xu](mailto:wei.xu@cc.gatech.edu)
+
+
+### Dataset Summary
+
+TURK is a multi-reference dataset for the evaluation of sentence simplification in English. The dataset consists of 2,359 sentences from the [Parallel Wikipedia Simplification (PWKP) corpus](https://www.aclweb.org/anthology/C10-1152/). Each sentence is associated with 8 crowdsourced simplifications that focus on only lexical paraphrasing (no sentence splitting or deletion).
+
+### Supported Tasks and Leaderboards
+
+No Leaderboard for the task.
+
+### Languages
+
+TURK contains English text only (BCP-47: `en`).
+
+## Dataset Structure
+
+### Data Instances
+
+An instance consists of an original sentence and 8 possible reference simplifications that focus on lexical paraphrasing.
+
+```
+{'original': 'one side of the armed conflicts is composed mainly of the sudanese military and the janjaweed , a sudanese militia group recruited mostly from the afro-arab abbala tribes of the northern rizeigat region in sudan .',
+ 'simplifications': ['one side of the armed conflicts is made of sudanese military and the janjaweed , a sudanese militia recruited from the afro-arab abbala tribes of the northern rizeigat region in sudan .', 'one side of the armed conflicts consist of the sudanese military and the sudanese militia group janjaweed .', 'one side of the armed conflicts is mainly sudanese military and the janjaweed , which recruited from the afro-arab abbala tribes .', 'one side of the armed conflicts is composed mainly of the sudanese military and the janjaweed , a sudanese militia group recruited mostly from the afro-arab abbala tribes in sudan .', 'one side of the armed conflicts is made up mostly of the sudanese military and the janjaweed , a sudanese militia group whose recruits mostly come from the afro-arab abbala tribes from the northern rizeigat region in sudan .', 'the sudanese military and the janjaweed make up one of the armed conflicts , mostly from the afro-arab abbal tribes in sudan .', 'one side of the armed conflicts is composed mainly of the sudanese military and the janjaweed , a sudanese militia group recruited mostly from the afro-arab abbala tribes of the northern rizeigat regime in sudan .', 'one side of the armed conflicts is composed mainly of the sudanese military and the janjaweed , a sudanese militia group recruited mostly from the afro-arab abbala tribes of the northern rizeigat region in sudan .']}
+```
+
+
+### Data Fields
+
+- `original`: an original sentence from the source datasets
+- `simplifications`:  a set of reference simplifications produced by crowd workers.
+
+### Data Splits
+
+TURK does not contain a training set; many models use [WikiLarge](https://github.com/XingxingZhang/dress) (Zhang and Lapata, 2017) or [Wiki-Auto](https://github.com/chaojiang06/wiki-auto) (Jiang et. al 2020) for training. 
+
+Each input sentence has 8 associated reference simplified sentences. 2,359 input sentences are randomly split into 2,000 validation and 359 test sentences.
+
+|                            | Dev    | Test | Total |
+| -----                      | ------ | ---- | ----- |
+| Input Sentences            | 2000   | 359  | 2359  |
+| Reference Simplifications  | 16000  | 2872 | 18872 |
+
+
+## Dataset Creation
+
+### Curation Rationale
+
+The TURK dataset was constructed to evaluate the task of text simplification.  It contains multiple human-written references that focus on only lexical simplification. 
+
+### Source Data
+
+#### Initial Data Collection and Normalization
+
+ The input sentences in the dataset are extracted from the [Parallel Wikipedia Simplification (PWKP) corpus](https://www.aclweb.org/anthology/C10-1152/). 
+
+#### Who are the source language producers?
+
+The references are crowdsourced from Amazon Mechanical Turk. The annotators were asked to provide simplifications without losing any information or splitting the input sentence. No other demographic or compensation information is provided in the paper.
+
+### Annotations
+
+#### Annotation process
+
+The instructions given to the annotators are available in the paper.
+
+#### Who are the annotators?
+
+The annotators are Amazon Mechanical Turk workers.
+
+### Personal and Sensitive Information
+
+Since the dataset is created from English Wikipedia (August 22, 2009 version), all the information contained in the dataset is already in the public domain.
+
+## Considerations for Using the Data
+
+### Social Impact of Dataset
+
+The dataset helps move forward the research towards text simplification by creating a higher quality validation and test dataset. Progress in text simplification in turn has the potential to increase the accessibility of written documents to wider audiences.
+
+### Discussion of Biases
+
+The dataset may contain some social biases, as the input sentences are based on Wikipedia. Studies have shown that the English Wikipedia contains both gender biases [(Schmahl et al., 2020)](https://research.tudelft.nl/en/publications/is-wikipedia-succeeding-in-reducing-gender-bias-assessing-changes) and racial biases [(Adams et al., 2019)](https://journals.sagepub.com/doi/pdf/10.1177/2378023118823946).
+
+### Other Known Limitations
+
+Since the dataset contains only 2,359 sentences that are derived from Wikipedia, it is limited to a small subset of topics present on Wikipedia.
+
+
+## Additional Information
+
+### Dataset Curators
+
+TURK was developed by researchers at the University of Pennsylvania. The work was  supported by the NSF under grant IIS-1430651 and the NSF GRFP under grant 1232825. 
+
+### Licensing Information
+
+[GNU General Public License v3.0](https://github.com/cocoxu/simplification/blob/master/LICENSE)
+
+### Citation Information
+```
+ @article{Xu-EtAl:2016:TACL,
+ author = {Wei Xu and Courtney Napoles and Ellie Pavlick and Quanze Chen and Chris Callison-Burch},
+ title = {Optimizing Statistical Machine Translation for Text Simplification},
+ journal = {Transactions of the Association for Computational Linguistics},
+ volume = {4},
+ year = {2016},
+ url = {https://cocoxu.github.io/publications/tacl2016-smt-simplification.pdf},
+ pages = {401--415}
+ }
+ ```
diff --git a/datasets/turk/dataset_infos.json b/datasets/turk/dataset_infos.json
@@ -0,0 +1 @@
+{"simplification": {"description": "TURKCorpus is a dataset for evaluating sentence simplification systems that focus on lexical paraphrasing,\nas described in \"Optimizing Statistical Machine Translation for Text Simplification\". The corpus is composed of 2000 validation and 359 test original sentences that were each simplified 8 times by different annotators.\n", "citation": " @article{Xu-EtAl:2016:TACL,\n author = {Wei Xu and Courtney Napoles and Ellie Pavlick and Quanze Chen and Chris Callison-Burch},\n title = {Optimizing Statistical Machine Translation for Text Simplification},\n journal = {Transactions of the Association for Computational Linguistics},\n volume = {4},\n year = {2016},\n url = {https://cocoxu.github.io/publications/tacl2016-smt-simplification.pdf},\n pages = {401--415}\n }\n}\n", "homepage": "https://github.com/cocoxu/simplification", "license": "GNU General Public License v3.0", "features": {"original": {"dtype": "string", "id": null, "_type": "Value"}, "simplifications": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "builder_name": "turk", "config_name": "simplification", "version": {"version_str": "1.0.0", "description": null, "major": 1, "minor": 0, "patch": 0}, "splits": {"validation": {"name": "validation", "num_bytes": 2120187, "num_examples": 2000, "dataset_name": "turk"}, "test": {"name": "test", "num_bytes": 396378, "num_examples": 359, "dataset_name": "turk"}}, "download_checksums": {"https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/test.8turkers.tok.norm": {"num_bytes": 45291, "checksum": "5a45e4deb23524dbd06fae0bbaf4a547df8c5d982bf4c9867c0f1462ed99ac46"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/tune.8turkers.tok.norm": {"num_bytes": 242697, "checksum": "1a0a0bf500bac72486eda8816e0a64347e79bd3652daddd1289fd4eec773df00"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/tune.8turkers.tok.turk.0": {"num_bytes": 227391, "checksum": "fb7c373e88dd188e234c688e6c7ed22012658e06c5c127d4be5f19f0e66a6542"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/tune.8turkers.tok.turk.1": {"num_bytes": 227362, "checksum": "308fab45b60d36bbd0ff651245cc0ceed82654658679c27ce575c4b487827394"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/tune.8turkers.tok.turk.2": {"num_bytes": 227046, "checksum": "f428363b156759352c4240a218f5485909961c84554fd20dbcf076a4518c1f13"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/tune.8turkers.tok.turk.3": {"num_bytes": 228063, "checksum": "22a430a69b348643e4e86e33724ef8a0dc690e948827af9667d21536f7f19981"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/tune.8turkers.tok.turk.4": {"num_bytes": 226410, "checksum": "a07211cb2a493f8a6c00f3f437c826eb10d01abb354f910d278d74752c306c24"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/tune.8turkers.tok.turk.5": {"num_bytes": 226117, "checksum": "951a03c67fd726a946a7d303af6edc64b4c3aa351721c7e921bd83c5f8a7e1c6"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/tune.8turkers.tok.turk.6": {"num_bytes": 226780, "checksum": "2983e016b4a7edff749106865251653d93def0c8f4f6f30ef6800b83cc3becbb"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/tune.8turkers.tok.turk.7": {"num_bytes": 226300, "checksum": "f427962c2fa8aee00911c74b3c2c093e5b50acc70928a619d3f3225ba29f38eb"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/test.8turkers.tok.turk.0": {"num_bytes": 37584, "checksum": "33399612ddb7ec4f0cd798508ea2928a3ab9b2ec3a9e524a4d5a0da44bf1425a"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/test.8turkers.tok.turk.1": {"num_bytes": 39995, "checksum": "6ea0d23083ce25c7cceb19f4e454ddde7d8b4010243d7af2ab0a96884587e79b"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/test.8turkers.tok.turk.2": {"num_bytes": 39854, "checksum": "abe871f586783f6e2273557fbc1ed203b06e5a5c2a52da260113c939ce1e79e3"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/test.8turkers.tok.turk.3": {"num_bytes": 42606, "checksum": "b4387233b14c123c7cef8d15c2ee7c68244fedb10e6e37008c0eed782b98897e"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/test.8turkers.tok.turk.4": {"num_bytes": 42005, "checksum": "1abf53f4dc075660322be772b40cdd26545902d5a7fa8746a460ea55301dd847"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/test.8turkers.tok.turk.5": {"num_bytes": 44149, "checksum": "3bbb08c71bbf692a2b7f2b6421a833397f96574fb9d7ff1dfd2c0f52ea0c52d6"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/test.8turkers.tok.turk.6": {"num_bytes": 45780, "checksum": "d100c0a63c9a01cde27694f18275e760d3f77bcd8b46ab9f6f832e8bc37c4857"}, "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/test.8turkers.tok.turk.7": {"num_bytes": 47964, "checksum": "e1956804ef69855a83a6c214acd07373533dad31615de0254ec60e3d0dbbedac"}}, "download_size": 2443394, "post_processing_size": null, "dataset_size": 2516565, "size_in_bytes": 4959959}}
diff --git a/datasets/turk/dummy/simplification/1.0.0/dummy_data.zip b/datasets/turk/dummy/simplification/1.0.0/dummy_data.zip
diff --git a/datasets/turk/turk.py b/datasets/turk/turk.py
@@ -0,0 +1,118 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""TURKCorpus: a dataset for sentence simplification evaluation"""
+
+from __future__ import absolute_import, division, print_function
+
+import datasets
+
+
+_CITATION = """\
+ @article{Xu-EtAl:2016:TACL,
+ author = {Wei Xu and Courtney Napoles and Ellie Pavlick and Quanze Chen and Chris Callison-Burch},
+ title = {Optimizing Statistical Machine Translation for Text Simplification},
+ journal = {Transactions of the Association for Computational Linguistics},
+ volume = {4},
+ year = {2016},
+ url = {https://cocoxu.github.io/publications/tacl2016-smt-simplification.pdf},
+ pages = {401--415}
+ }
+}
+"""
+
+_DESCRIPTION = """\
+TURKCorpus is a dataset for evaluating sentence simplification systems that focus on lexical paraphrasing,
+as described in "Optimizing Statistical Machine Translation for Text Simplification". The corpus is composed of 2000 validation and 359 test original sentences that were each simplified 8 times by different annotators.
+"""
+
+_HOMEPAGE = "https://github.com/cocoxu/simplification"
+
+_LICENSE = "GNU General Public License v3.0"
+
+_URL_LIST = [
+    (
+        "test.8turkers.tok.norm",
+        "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/test.8turkers.tok.norm",
+    ),
+    (
+        "tune.8turkers.tok.norm",
+        "https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/tune.8turkers.tok.norm",
+    ),
+]
+_URL_LIST += [
+    (
+        f"{spl}.8turkers.tok.turk.{i}",
+        f"https://raw.githubusercontent.com/cocoxu/simplification/master/data/turkcorpus/{spl}.8turkers.tok.turk.{i}",
+    )
+    for spl in ["tune", "test"]
+    for i in range(8)
+]
+
+_URLs = dict(_URL_LIST)
+
+
+class Turk(datasets.GeneratorBasedBuilder):
+
+    VERSION = datasets.Version("1.0.0")
+
+    BUILDER_CONFIGS = [
+        datasets.BuilderConfig(
+            name="simplification",
+            version=VERSION,
+            description="A set of original sentences aligned with 8 possible simplifications for each.",
+        )
+    ]
+
+    def _info(self):
+        features = datasets.Features(
+            {
+                "original": datasets.Value("string"),
+                "simplifications": datasets.Sequence(datasets.Value("string")),
+            }
+        )
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            supervised_keys=None,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        data_dir = dl_manager.download_and_extract(_URLs)
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={
+                    "filepaths": data_dir,
+                    "split": "valid",
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={"filepaths": data_dir, "split": "test"},
+            ),
+        ]
+
+    def _generate_examples(self, filepaths, split):
+        """ Yields examples. """
+        if split == "valid":
+            split = "tune"
+        files = [open(filepaths[f"{split}.8turkers.tok.norm"], encoding="utf-8")] + [
+            open(filepaths[f"{split}.8turkers.tok.turk.{i}"], encoding="utf-8") for i in range(8)
+        ]
+        for id_, lines in enumerate(zip(*files)):
+            yield id_, {"original": lines[0].strip(), "simplifications": [line.strip() for line in lines[1:]]}