Skip to content

Commit

Permalink
add CoVoST2 (#1935)
Browse files Browse the repository at this point in the history
* add covost2

* update script

* add dummy data

* add covost2

* update script

* add dummy data

* update script, metadata

* add dataset card

* fix formating

* slim dummy data

* Apply suggestions from code review

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

* adress review comments

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
  • Loading branch information
patil-suraj and lhoestq authored Feb 24, 2021
1 parent 7072e1b commit 96578ad
Show file tree
Hide file tree
Showing 39 changed files with 421 additions and 0 deletions.
225 changes: 225 additions & 0 deletions datasets/covost2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
---
annotations_creators:
- expert-generated
language_creators:
- crowdsourced
- expert-generated
languages:
- "fr"
- "de"
- "es"
- "ca"
- "it"
- "ru"
- "zh-CN"
- "pt"
- "fa"
- "et"
- "mn"
- "nl"
- "tr"
- "ar"
- "sv-SE"
- "lv"
- "sl"
- "ta"
- "ja"
- "id"
- "cy"
licenses:
- cc-by-nc-4.0
multilinguality:
- multilingual
size_categories:
- 100K<n<1M
source_datasets:
- extended|other-common-voice
task_categories:
- other
task_ids:
- other-other-speech-translation
---

# Dataset Card for covost2

## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)

## Dataset Description

- **Homepage:** https://github.com/facebookresearch/covost
- **Repository:** https://github.com/facebookresearch/covost
- **Paper:** https://arxiv.org/abs/2007.10310
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** Changhan Wang (changhan@fb.com), Juan Miguel Pino (juancarabina@fb.com), Jiatao Gu (jgu@fb.com)

### Dataset Summary

CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English \
and from English into 15 languages. The dataset is created using Mozillas open-source Common Voice database of \
crowdsourced voice recordings. There are 2,900 hours of speech represented in the corpus.

### Supported Tasks and Leaderboards

`speech-translation`: The dataset can be used for Speech-to-text translation (ST). The model is presented with an audio file in one language and asked to transcribe the audio file to written text in another language. The most common evaluation metric is the BLEU score. Examples can be found at https://github.com/pytorch/fairseq/blob/master/examples/speech_to_text/docs/covost_example.md .

### Languages

The dataset contains the audio, transcriptions, and translations in the following languages, French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian, Chinese, Welsh, Catalan, Slovenian, Estonian, Indonesian, Arabic, Tamil, Portuguese, Latvian, and Japanese.

## Dataset Structure

### Data Instances

A typical data point comprises the path to the audio file, usually called `file`, its transcription, called `sentence`, and the translation in target language called `translation`.

```
{'client_id': 'd277a1f3904ae00b09b73122b87674e7c2c78e08120721f37b5577013ead08d1ea0c053ca5b5c2fb948df2c81f27179aef2c741057a17249205d251a8fe0e658',
'file': '/home/suraj/projects/fairseq_s2t/covst/dataset/en/clips/common_voice_en_18540003.mp3',
'id': 'common_voice_en_18540003',
'sentence': 'When water is scarce, avoid wasting it.',
'translation': 'Wenn Wasser knapp ist, verschwenden Sie es nicht.'}
```

### Data Fields

- file: A path to the downloaded audio file in .mp3 format.

- sentence: The transcription of the audio file in source language.

- translation: The transcription of the audio file in the target language.

- id: unique id of the data sample.

### Data Splits

| config | train | validation | test |
|----------|--------|------------|-------|
| en_de | 289430 | 15531 | 15531 |
| en_tr | 289430 | 15531 | 15531 |
| en_fa | 289430 | 15531 | 15531 |
| en_sv-SE | 289430 | 15531 | 15531 |
| en_mn | 289430 | 15531 | 15531 |
| en_zh-CN | 289430 | 15531 | 15531 |
| en_cy | 289430 | 15531 | 15531 |
| en_ca | 289430 | 15531 | 15531 |
| en_sl | 289430 | 15531 | 15531 |
| en_et | 289430 | 15531 | 15531 |
| en_id | 289430 | 15531 | 15531 |
| en_ar | 289430 | 15531 | 15531 |
| en_ta | 289430 | 15531 | 15531 |
| en_lv | 289430 | 15531 | 15531 |
| en_ja | 289430 | 15531 | 15531 |
| fr_en | 207374 | 14760 | 14760 |
| de_en | 127834 | 13511 | 13511 |
| es_en | 79015 | 13221 | 13221 |
| ca_en | 95854 | 12730 | 12730 |
| it_en | 31698 | 8940 | 8951 |
| ru_en | 12112 | 6110 | 6300 |
| zh-CN_en | 7085 | 4843 | 4898 |
| pt_en | 9158 | 3318 | 4023 |
| fa_en | 53949 | 3445 | 3445 |
| et_en | 1782 | 1576 | 1571 |
| mn_en | 2067 | 1761 | 1759 |
| nl_en | 7108 | 1699 | 1699 |
| tr_en | 3966 | 1624 | 1629 |
| ar_en | 2283 | 1758 | 1695 |
| sv-SE_en | 2160 | 1349 | 1595 |
| lv_en | 2337 | 1125 | 1629 |
| sl_en | 1843 | 509 | 360 |
| ta_en | 1358 | 384 | 786 |
| ja_en | 1119 | 635 | 684 |
| id_en | 1243 | 792 | 844 |
| cy_en | 1241 | 690 | 690 |


## Dataset Creation

### Curation Rationale

[Needs More Information]

### Source Data

#### Initial Data Collection and Normalization

[Needs More Information]

#### Who are the source language producers?

[Needs More Information]

### Annotations

#### Annotation process

[Needs More Information]

#### Who are the annotators?

[Needs More Information]

### Personal and Sensitive Information

[Needs More Information]

## Considerations for Using the Data

### Social Impact of Dataset

[Needs More Information]

### Discussion of Biases

[Needs More Information]

### Other Known Limitations

[Needs More Information]

## Additional Information

### Dataset Curators

[Needs More Information]

### Licensing Information

cc-by-nc-4.0

### Citation Information

```
@misc{wang2020covost,
title={CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus},
author={Changhan Wang and Anne Wu and Juan Pino},
year={2020},
eprint={2007.10310},
archivePrefix={arXiv},
primaryClass={cs.CL}
```

### Contributions

Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset.
Loading

1 comment on commit 96578ad

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==0.17.1

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.019542 / 0.011353 (0.008189) 0.016835 / 0.011008 (0.005826) 0.045729 / 0.038508 (0.007221) 0.034968 / 0.023109 (0.011859) 0.210031 / 0.275898 (-0.065867) 0.232751 / 0.323480 (-0.090729) 0.006688 / 0.007986 (-0.001298) 0.005153 / 0.004328 (0.000825) 0.006553 / 0.004250 (0.002303) 0.046752 / 0.037052 (0.009699) 0.225854 / 0.258489 (-0.032635) 0.249770 / 0.293841 (-0.044071) 0.173345 / 0.128546 (0.044799) 0.140294 / 0.075646 (0.064648) 0.429129 / 0.419271 (0.009858) 0.459158 / 0.043533 (0.415625) 0.211435 / 0.255139 (-0.043704) 0.221809 / 0.283200 (-0.061390) 1.939282 / 0.141683 (1.797599) 1.862768 / 1.452155 (0.410613) 1.848622 / 1.492716 (0.355905)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.039679 / 0.037411 (0.002267) 0.023142 / 0.014526 (0.008616) 0.028880 / 0.176557 (-0.147676) 0.050026 / 0.737135 (-0.687109) 0.041167 / 0.296338 (-0.255171)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.250461 / 0.215209 (0.035252) 2.592360 / 2.077655 (0.514705) 1.308604 / 1.504120 (-0.195516) 1.157495 / 1.541195 (-0.383700) 1.208361 / 1.468490 (-0.260129) 7.965218 / 4.584777 (3.380441) 6.530112 / 3.745712 (2.784400) 9.029087 / 5.269862 (3.759225) 8.021539 / 4.565676 (3.455862) 0.724492 / 0.424275 (0.300217) 0.011310 / 0.007607 (0.003703) 0.302318 / 0.226044 (0.076274) 3.189957 / 2.268929 (0.921028) 1.861057 / 55.444624 (-53.583568) 1.545052 / 6.876477 (-5.331425) 1.554133 / 2.142072 (-0.587940) 7.582056 / 4.805227 (2.776829) 7.509266 / 6.500664 (1.008601) 8.548873 / 0.075469 (8.473404)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 11.865749 / 1.841788 (10.023961) 14.934417 / 8.074308 (6.860109) 21.316937 / 10.191392 (11.125545) 0.534995 / 0.680424 (-0.145429) 0.315621 / 0.534201 (-0.218580) 0.837962 / 0.579283 (0.258679) 0.681658 / 0.434364 (0.247294) 0.756965 / 0.540337 (0.216627) 1.677558 / 1.386936 (0.290622)
PyArrow==1.0
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.018643 / 0.011353 (0.007290) 0.017352 / 0.011008 (0.006344) 0.048527 / 0.038508 (0.010019) 0.033688 / 0.023109 (0.010579) 0.338174 / 0.275898 (0.062276) 0.357501 / 0.323480 (0.034022) 0.006068 / 0.007986 (-0.001917) 0.004936 / 0.004328 (0.000607) 0.007471 / 0.004250 (0.003221) 0.054376 / 0.037052 (0.017324) 0.321549 / 0.258489 (0.063060) 0.385638 / 0.293841 (0.091797) 0.161405 / 0.128546 (0.032859) 0.129707 / 0.075646 (0.054061) 0.429353 / 0.419271 (0.010082) 0.440163 / 0.043533 (0.396630) 0.382170 / 0.255139 (0.127031) 0.358569 / 0.283200 (0.075369) 1.735512 / 0.141683 (1.593829) 1.785187 / 1.452155 (0.333033) 1.843999 / 1.492716 (0.351283)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.043064 / 0.037411 (0.005652) 0.021709 / 0.014526 (0.007183) 0.032843 / 0.176557 (-0.143714) 0.047661 / 0.737135 (-0.689474) 0.050876 / 0.296338 (-0.245462)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.367350 / 0.215209 (0.152141) 3.452680 / 2.077655 (1.375025) 2.091445 / 1.504120 (0.587325) 1.922422 / 1.541195 (0.381228) 1.985011 / 1.468490 (0.516521) 7.059528 / 4.584777 (2.474751) 6.292839 / 3.745712 (2.547127) 8.744271 / 5.269862 (3.474410) 7.752327 / 4.565676 (3.186651) 0.712702 / 0.424275 (0.288427) 0.010042 / 0.007607 (0.002435) 0.380682 / 0.226044 (0.154638) 3.924842 / 2.268929 (1.655914) 2.580454 / 55.444624 (-52.864170) 2.311259 / 6.876477 (-4.565218) 2.378753 / 2.142072 (0.236681) 7.251728 / 4.805227 (2.446500) 6.435026 / 6.500664 (-0.065638) 7.136216 / 0.075469 (7.060747)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 11.985410 / 1.841788 (10.143622) 15.420552 / 8.074308 (7.346244) 22.382930 / 10.191392 (12.191538) 0.793597 / 0.680424 (0.113173) 0.579390 / 0.534201 (0.045189) 0.793402 / 0.579283 (0.214119) 0.652445 / 0.434364 (0.218081) 0.747935 / 0.540337 (0.207598) 1.584207 / 1.386936 (0.197271)

CML watermark

Please sign in to comment.