A new publicly available dataset consisting of 289 life story interviews (365 hours), featuring a broad range of speakers varying in age, education, and regional accents.
Hugging Face |
---|
https://huggingface.co/datasets/nilc-nlp/CORAA-NURC-SP-Audio-Corpus |
Hugging Face |
---|
https://huggingface.co/nilc-nlp/distil-whisper-coraa-mupe-asr |
Leal, S.E.; Candido Junior, A.; Marcacini, R.; Casanova, E.; Gonçalves, O.; Soares, A.; Lima, R.; Gris, L.; Aluísio, S.M. MuPe Life Stories Dataset: Spontaneous Speech in Brazilian Portuguese with a Case Study Evaluation on ASR Bias against Speakers Groups and Topic Modeling. Proceedings of the 31st International Conference on Computational Linguistics (COLING) (2025).
@inProceedings{Leal2025Coling,
author={Sidney Leal
and Arnaldo Candido Jr.
and Ricardo Marcacini
and Edresson Casanova
and Odilon Gonçalves
and Anderson Soares
and Rodrigo Lima
and Lucas Gris
and Sandra Alu{\'i}sio,
title={MuPe Life Stories Dataset: Spontaneous Speech in Brazilian Portuguese with a Case Study Evaluation on ASR Bias against Speakers Groups and Topic Modeling},
booktitle={Proceedings of the 31st International Conference on Computational Linguistics (COLING)},
year={2025}
}
This work was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant #2019/07665-4) and by the IBM Corporation. This project was also supported by the Ministry of Science, Technology and Innovation, with resources of Law No. 8.248, of October 23, 1991, within the scope of PPI-SOFTEX, coordinated by Softex and published Residence in TIC 13, DOU 01245.010222/2022-44.
Because of files size, the dataset is available at huggingface. But here all the csv's files with metadata for training, validation and test subsets, and a sample for the audio files.