Skip to content

Llamacha/IWSLT2024_Quechua_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 

Repository files navigation

IWSLT2024 - Low-resource Speech Translation Track: Quechua-Spanish Parallel Corpus

Main repository for the sharing of Quechua-Spanish Speech Translation data as part of the low-resource shared task at IWSLT 2024.

TEST DATA NOW AVAILABLE FOR THE IWSLT 2024 CONSTRAINED TASK

IWSLT 2024 TEST DATA

Parallel data for the constrained task

This corpus is a small extraction of the Siminchik corpus (Cardenas_et_al.,2018), a Quechua-based corpus created from several radio audio recordings. The recordings have been transcribed and translated into Spanish. The total recording time for the clean speech data is 1 hour and 40 minutes. It can be found in the que_spa_constrained folder which contains three sub-folders: training, valid, and test. The test folder will be made visible after the submissions have been received.

The raw text transcriptions are located in que_spa_constrained/<split>/txt/<split>.<lang>.

True-cased Spanish target translations are found in que_spa_constrained/<split>/txt/<split>.spa.tc.

True-casing was done with a sacremoses Truecaser model trained on the Spanish side of WMT13 EN-ES.

Additional audio data for the unconstrained task - ADDITIONAL DATA 1

In addition to the 1 hour, 40 minutes of Quechua audio data aligned with Spanish translations, we also provided participants with a corpus of 48 hours of fully transcribed Quechua audio without translations for the unconstrained task. The audio data and corresponding transcriptions are a bigger extract from the Siminchik data set. The hope is that this data can be directly used for assistance in the development of speech recognition components for the unconstrained task. The data can be easily downloaded directly fron here: Unconstrained QUE-SPA Additional Audio 1.

Please Note: Participants are not required to use this data but are free to use with the license below.

Citation

@article{cardenas2018siminchik,
  title={Siminchik: A speech corpus for preservation of southern quechua},
  author={Cardenas, Ronald and Zevallos, Rodolfo and Baquerizo, Reynaldo and Camacho, Luis},
  journal={ISI-NLP 2},
  pages={21},
  year={2018}
}

Additional Parallel Machine Translation Text data for the constrained task

As part of the constrained task, we allow the use of Machine Transaltion parallel text from previous work. Participants are also not required to use this data.

The data is found in this repository in the folder: additional_mt_text. They are extracted from the JW300 and Hinantin websites and used in the cited work below. Please make sure to cite the work below if you use this data.

Citation

@article{ortega2020neural,
  title={Neural machine translation with a polysynthetic low resource language},
  author={Ortega, John E and Castro Mamani, Richard and Cho, Kyunghyun},
  journal={Machine Translation},
  volume={34},
  number={4},
  pages={325--346},
  year={2020},
  publisher={Springer}
}

License

All audio recordings are property of Siminchikkunarayku and Llamacha.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

Acknowledgements

Part of this work has been funded by AmericasNLP-2022, John E. Ortega, and Llamacha. Special thanks to Eva Mühlbauer, Maximilian Torres and Anku Kichka their support.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published