MuPe Life Stories Dataset

A new publicly available dataset consisting of 289 life story interviews (365 hours), featuring a broad range of speakers varying in age, education, and regional accents.

Dataset

Hugging Face
https://huggingface.co/datasets/nilc-nlp/CORAA-NURC-SP-Audio-Corpus

Model

Hugging Face
https://huggingface.co/nilc-nlp/distil-whisper-coraa-mupe-asr

Citation

Leal, S.E.; Candido Junior, A.; Marcacini, R.; Casanova, E.; Gonçalves, O.; Soares, A.; Lima, R.; Gris, L.; Aluísio, S.M. MuPe Life Stories Dataset: Spontaneous Speech in Brazilian Portuguese with a Case Study Evaluation on ASR Bias against Speakers Groups and Topic Modeling. Proceedings of the 31st International Conference on Computational Linguistics (COLING) (2025).

@inProceedings{Leal2025Coling,
author={Sidney Leal
   and Arnaldo Candido Jr.
   and Ricardo Marcacini
   and Edresson Casanova
   and Odilon Gonçalves
   and Anderson Soares
   and Rodrigo Lima
   and Lucas Gris
   and Sandra Alu{\'i}sio,
title={MuPe Life Stories Dataset: Spontaneous Speech in Brazilian Portuguese with a Case Study Evaluation on ASR Bias against Speakers Groups and Topic Modeling},
booktitle={Proceedings of the 31st International Conference on Computational Linguistics (COLING)},
year={2025}
}

Sponsors / Funding

This work was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant #2019/07665-4) and by the IBM Corporation. This project was also supported by the Ministry of Science, Technology and Innovation, with resources of Law No. 8.248, of October 23, 1991, within the scope of PPI-SOFTEX, coordinated by Softex and published Residence in TIC 13, DOU 01245.010222/2022-44.

Examples

Because of files size, the dataset is available at huggingface. But here all the csv's files with metadata for training, validation and test subsets, and a sample for the audio files.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
scripts		scripts
test/pc_ma_hv010		test/pc_ma_hv010
README.md		README.md
test.csv		test.csv
train.csv		train.csv
validation.csv		validation.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MuPe Life Stories Dataset

Dataset

Model

Citation

Sponsors / Funding

Examples

About

Releases

Packages

Languages

nilc-nlp/coling-mupe-asr

Folders and files

Latest commit

History

Repository files navigation

MuPe Life Stories Dataset

Dataset

Model

Citation

Sponsors / Funding

Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages