add CoVoST2 (#1935)

* add covost2 * update script * add dummy data * add covost2 * update script * add dummy data * update script, metadata * add dataset card * fix formating * slim dummy data * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * adress review comments Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
huggingface · Feb 24, 2021 · 96578ad · 96578ad · github-actions · Feb 24, 2021
1 parent 7072e1b
commit 96578ad
Show file tree

Hide file tree

Showing 39 changed files with 421 additions and 0 deletions.
diff --git a/datasets/covost2/README.md b/datasets/covost2/README.md
@@ -0,0 +1,225 @@
+---
+annotations_creators:
+- expert-generated
+language_creators:
+- crowdsourced
+- expert-generated
+languages:
+- "fr"
+- "de"
+- "es"
+- "ca"
+- "it"
+- "ru"
+- "zh-CN"
+- "pt"
+- "fa"
+- "et"
+- "mn"
+- "nl"
+- "tr"
+- "ar"
+- "sv-SE"
+- "lv"
+- "sl"
+- "ta"
+- "ja"
+- "id"
+- "cy"
+licenses:
+- cc-by-nc-4.0
+multilinguality:
+- multilingual
+size_categories:
+- 100K<n<1M
+source_datasets:
+- extended|other-common-voice
+task_categories:
+- other
+task_ids:
+- other-other-speech-translation
+---
+
+# Dataset Card for covost2
+
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-instances)
+  - [Data Splits](#data-instances)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+  - [Contributions](#contributions)
+
+## Dataset Description
+
+- **Homepage:** https://github.com/facebookresearch/covost
+- **Repository:** https://github.com/facebookresearch/covost
+- **Paper:** https://arxiv.org/abs/2007.10310
+- **Leaderboard:** [Needs More Information]
+- **Point of Contact:** Changhan Wang (changhan@fb.com), Juan Miguel Pino (juancarabina@fb.com), Jiatao Gu (jgu@fb.com)
+
+### Dataset Summary
+
+CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English \
+and from English into 15 languages. The dataset is created using Mozillas open-source Common Voice database of \
+crowdsourced voice recordings. There are 2,900 hours of speech represented in the corpus.
+
+### Supported Tasks and Leaderboards
+
+`speech-translation`: The dataset can be used for Speech-to-text translation (ST). The model is presented with an audio file in one language and asked to transcribe the audio file to written text in another language. The most common evaluation metric is the BLEU score. Examples can be found at https://github.com/pytorch/fairseq/blob/master/examples/speech_to_text/docs/covost_example.md .
+
+### Languages
+
+The dataset contains the audio, transcriptions, and translations in the following languages, French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian, Chinese, Welsh, Catalan, Slovenian, Estonian, Indonesian, Arabic, Tamil, Portuguese, Latvian, and Japanese. 
+
+## Dataset Structure
+
+### Data Instances
+
+A typical data point comprises the path to the audio file, usually called `file`, its transcription, called `sentence`, and the translation in target language called `translation`.
+
+```
+{'client_id': 'd277a1f3904ae00b09b73122b87674e7c2c78e08120721f37b5577013ead08d1ea0c053ca5b5c2fb948df2c81f27179aef2c741057a17249205d251a8fe0e658',
+ 'file': '/home/suraj/projects/fairseq_s2t/covst/dataset/en/clips/common_voice_en_18540003.mp3',
+ 'id': 'common_voice_en_18540003',
+ 'sentence': 'When water is scarce, avoid wasting it.',
+ 'translation': 'Wenn Wasser knapp ist, verschwenden Sie es nicht.'}
+```
+
+### Data Fields
+
+- file: A path to the downloaded audio file in .mp3 format.
+
+- sentence: The transcription of the audio file in source language.
+
+- translation: The transcription of the audio file in the target language. 
+
+- id: unique id of the data sample.
+
+### Data Splits
+
+| config   | train  | validation | test  |
+|----------|--------|------------|-------|
+| en_de    | 289430 | 15531      | 15531 |
+| en_tr    | 289430 | 15531      | 15531 |
+| en_fa    | 289430 | 15531      | 15531 |
+| en_sv-SE | 289430 | 15531      | 15531 |
+| en_mn    | 289430 | 15531      | 15531 |
+| en_zh-CN | 289430 | 15531      | 15531 |
+| en_cy    | 289430 | 15531      | 15531 |
+| en_ca    | 289430 | 15531      | 15531 |
+| en_sl    | 289430 | 15531      | 15531 |
+| en_et    | 289430 | 15531      | 15531 |
+| en_id    | 289430 | 15531      | 15531 |
+| en_ar    | 289430 | 15531      | 15531 |
+| en_ta    | 289430 | 15531      | 15531 |
+| en_lv    | 289430 | 15531      | 15531 |
+| en_ja    | 289430 | 15531      | 15531 |
+| fr_en    | 207374 | 14760      | 14760 |
+| de_en    | 127834 | 13511      | 13511 |
+| es_en    | 79015  | 13221      | 13221 |
+| ca_en    | 95854  | 12730      | 12730 |
+| it_en    | 31698  | 8940       | 8951  |
+| ru_en    | 12112  | 6110       | 6300  |
+| zh-CN_en | 7085   | 4843       | 4898  |
+| pt_en    | 9158   | 3318       | 4023  |
+| fa_en    | 53949  | 3445       | 3445  |
+| et_en    | 1782   | 1576       | 1571  |
+| mn_en    | 2067   | 1761       | 1759  |
+| nl_en    | 7108   | 1699       | 1699  |
+| tr_en    | 3966   | 1624       | 1629  |
+| ar_en    | 2283   | 1758       | 1695  |
+| sv-SE_en | 2160   | 1349       | 1595  |
+| lv_en    | 2337   | 1125       | 1629  |
+| sl_en    | 1843   | 509        | 360   |
+| ta_en    | 1358   | 384        | 786   |
+| ja_en    | 1119   | 635        | 684   |
+| id_en    | 1243   | 792        | 844   |
+| cy_en    | 1241   | 690        | 690   |
+
+
+## Dataset Creation
+
+### Curation Rationale
+
+[Needs More Information]
+
+### Source Data
+
+#### Initial Data Collection and Normalization
+
+[Needs More Information]
+
+#### Who are the source language producers?
+
+[Needs More Information]
+
+### Annotations
+
+#### Annotation process
+
+[Needs More Information]
+
+#### Who are the annotators?
+
+[Needs More Information]
+
+### Personal and Sensitive Information
+
+[Needs More Information]
+
+## Considerations for Using the Data
+
+### Social Impact of Dataset
+
+[Needs More Information]
+
+### Discussion of Biases
+
+[Needs More Information]
+
+### Other Known Limitations
+
+[Needs More Information]
+
+## Additional Information
+
+### Dataset Curators
+
+[Needs More Information]
+
+### Licensing Information
+
+cc-by-nc-4.0
+
+### Citation Information
+
+```
+@misc{wang2020covost,
+    title={CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus},
+    author={Changhan Wang and Anne Wu and Juan Pino},
+    year={2020},
+    eprint={2007.10310},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+```
+
+### Contributions
+
+Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset.