-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add covost2 * update script * add dummy data * add covost2 * update script * add dummy data * update script, metadata * add dataset card * fix formating * slim dummy data * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * adress review comments Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
- Loading branch information
1 parent
7072e1b
commit 96578ad
Showing
39 changed files
with
421 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,225 @@ | ||
--- | ||
annotations_creators: | ||
- expert-generated | ||
language_creators: | ||
- crowdsourced | ||
- expert-generated | ||
languages: | ||
- "fr" | ||
- "de" | ||
- "es" | ||
- "ca" | ||
- "it" | ||
- "ru" | ||
- "zh-CN" | ||
- "pt" | ||
- "fa" | ||
- "et" | ||
- "mn" | ||
- "nl" | ||
- "tr" | ||
- "ar" | ||
- "sv-SE" | ||
- "lv" | ||
- "sl" | ||
- "ta" | ||
- "ja" | ||
- "id" | ||
- "cy" | ||
licenses: | ||
- cc-by-nc-4.0 | ||
multilinguality: | ||
- multilingual | ||
size_categories: | ||
- 100K<n<1M | ||
source_datasets: | ||
- extended|other-common-voice | ||
task_categories: | ||
- other | ||
task_ids: | ||
- other-other-speech-translation | ||
--- | ||
|
||
# Dataset Card for covost2 | ||
|
||
## Table of Contents | ||
- [Dataset Description](#dataset-description) | ||
- [Dataset Summary](#dataset-summary) | ||
- [Supported Tasks](#supported-tasks-and-leaderboards) | ||
- [Languages](#languages) | ||
- [Dataset Structure](#dataset-structure) | ||
- [Data Instances](#data-instances) | ||
- [Data Fields](#data-instances) | ||
- [Data Splits](#data-instances) | ||
- [Dataset Creation](#dataset-creation) | ||
- [Curation Rationale](#curation-rationale) | ||
- [Source Data](#source-data) | ||
- [Annotations](#annotations) | ||
- [Personal and Sensitive Information](#personal-and-sensitive-information) | ||
- [Considerations for Using the Data](#considerations-for-using-the-data) | ||
- [Social Impact of Dataset](#social-impact-of-dataset) | ||
- [Discussion of Biases](#discussion-of-biases) | ||
- [Other Known Limitations](#other-known-limitations) | ||
- [Additional Information](#additional-information) | ||
- [Dataset Curators](#dataset-curators) | ||
- [Licensing Information](#licensing-information) | ||
- [Citation Information](#citation-information) | ||
- [Contributions](#contributions) | ||
|
||
## Dataset Description | ||
|
||
- **Homepage:** https://github.com/facebookresearch/covost | ||
- **Repository:** https://github.com/facebookresearch/covost | ||
- **Paper:** https://arxiv.org/abs/2007.10310 | ||
- **Leaderboard:** [Needs More Information] | ||
- **Point of Contact:** Changhan Wang (changhan@fb.com), Juan Miguel Pino (juancarabina@fb.com), Jiatao Gu (jgu@fb.com) | ||
|
||
### Dataset Summary | ||
|
||
CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English \ | ||
and from English into 15 languages. The dataset is created using Mozillas open-source Common Voice database of \ | ||
crowdsourced voice recordings. There are 2,900 hours of speech represented in the corpus. | ||
|
||
### Supported Tasks and Leaderboards | ||
|
||
`speech-translation`: The dataset can be used for Speech-to-text translation (ST). The model is presented with an audio file in one language and asked to transcribe the audio file to written text in another language. The most common evaluation metric is the BLEU score. Examples can be found at https://github.com/pytorch/fairseq/blob/master/examples/speech_to_text/docs/covost_example.md . | ||
|
||
### Languages | ||
|
||
The dataset contains the audio, transcriptions, and translations in the following languages, French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian, Chinese, Welsh, Catalan, Slovenian, Estonian, Indonesian, Arabic, Tamil, Portuguese, Latvian, and Japanese. | ||
|
||
## Dataset Structure | ||
|
||
### Data Instances | ||
|
||
A typical data point comprises the path to the audio file, usually called `file`, its transcription, called `sentence`, and the translation in target language called `translation`. | ||
|
||
``` | ||
{'client_id': 'd277a1f3904ae00b09b73122b87674e7c2c78e08120721f37b5577013ead08d1ea0c053ca5b5c2fb948df2c81f27179aef2c741057a17249205d251a8fe0e658', | ||
'file': '/home/suraj/projects/fairseq_s2t/covst/dataset/en/clips/common_voice_en_18540003.mp3', | ||
'id': 'common_voice_en_18540003', | ||
'sentence': 'When water is scarce, avoid wasting it.', | ||
'translation': 'Wenn Wasser knapp ist, verschwenden Sie es nicht.'} | ||
``` | ||
|
||
### Data Fields | ||
|
||
- file: A path to the downloaded audio file in .mp3 format. | ||
|
||
- sentence: The transcription of the audio file in source language. | ||
|
||
- translation: The transcription of the audio file in the target language. | ||
|
||
- id: unique id of the data sample. | ||
|
||
### Data Splits | ||
|
||
| config | train | validation | test | | ||
|----------|--------|------------|-------| | ||
| en_de | 289430 | 15531 | 15531 | | ||
| en_tr | 289430 | 15531 | 15531 | | ||
| en_fa | 289430 | 15531 | 15531 | | ||
| en_sv-SE | 289430 | 15531 | 15531 | | ||
| en_mn | 289430 | 15531 | 15531 | | ||
| en_zh-CN | 289430 | 15531 | 15531 | | ||
| en_cy | 289430 | 15531 | 15531 | | ||
| en_ca | 289430 | 15531 | 15531 | | ||
| en_sl | 289430 | 15531 | 15531 | | ||
| en_et | 289430 | 15531 | 15531 | | ||
| en_id | 289430 | 15531 | 15531 | | ||
| en_ar | 289430 | 15531 | 15531 | | ||
| en_ta | 289430 | 15531 | 15531 | | ||
| en_lv | 289430 | 15531 | 15531 | | ||
| en_ja | 289430 | 15531 | 15531 | | ||
| fr_en | 207374 | 14760 | 14760 | | ||
| de_en | 127834 | 13511 | 13511 | | ||
| es_en | 79015 | 13221 | 13221 | | ||
| ca_en | 95854 | 12730 | 12730 | | ||
| it_en | 31698 | 8940 | 8951 | | ||
| ru_en | 12112 | 6110 | 6300 | | ||
| zh-CN_en | 7085 | 4843 | 4898 | | ||
| pt_en | 9158 | 3318 | 4023 | | ||
| fa_en | 53949 | 3445 | 3445 | | ||
| et_en | 1782 | 1576 | 1571 | | ||
| mn_en | 2067 | 1761 | 1759 | | ||
| nl_en | 7108 | 1699 | 1699 | | ||
| tr_en | 3966 | 1624 | 1629 | | ||
| ar_en | 2283 | 1758 | 1695 | | ||
| sv-SE_en | 2160 | 1349 | 1595 | | ||
| lv_en | 2337 | 1125 | 1629 | | ||
| sl_en | 1843 | 509 | 360 | | ||
| ta_en | 1358 | 384 | 786 | | ||
| ja_en | 1119 | 635 | 684 | | ||
| id_en | 1243 | 792 | 844 | | ||
| cy_en | 1241 | 690 | 690 | | ||
|
||
|
||
## Dataset Creation | ||
|
||
### Curation Rationale | ||
|
||
[Needs More Information] | ||
|
||
### Source Data | ||
|
||
#### Initial Data Collection and Normalization | ||
|
||
[Needs More Information] | ||
|
||
#### Who are the source language producers? | ||
|
||
[Needs More Information] | ||
|
||
### Annotations | ||
|
||
#### Annotation process | ||
|
||
[Needs More Information] | ||
|
||
#### Who are the annotators? | ||
|
||
[Needs More Information] | ||
|
||
### Personal and Sensitive Information | ||
|
||
[Needs More Information] | ||
|
||
## Considerations for Using the Data | ||
|
||
### Social Impact of Dataset | ||
|
||
[Needs More Information] | ||
|
||
### Discussion of Biases | ||
|
||
[Needs More Information] | ||
|
||
### Other Known Limitations | ||
|
||
[Needs More Information] | ||
|
||
## Additional Information | ||
|
||
### Dataset Curators | ||
|
||
[Needs More Information] | ||
|
||
### Licensing Information | ||
|
||
cc-by-nc-4.0 | ||
|
||
### Citation Information | ||
|
||
``` | ||
@misc{wang2020covost, | ||
title={CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus}, | ||
author={Changhan Wang and Anne Wu and Juan Pino}, | ||
year={2020}, | ||
eprint={2007.10310}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.CL} | ||
``` | ||
|
||
### Contributions | ||
|
||
Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset. |
Oops, something went wrong.
96578ad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Show benchmarks
PyArrow==0.17.1
Show updated benchmarks!
Benchmark: benchmark_array_xd.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!
Benchmark: benchmark_array_xd.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json