Skip to content

nilc-nlp/coling-mupe-asr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MuPe Life Stories Dataset

A new publicly available dataset consisting of 289 life story interviews (365 hours), featuring a broad range of speakers varying in age, education, and regional accents.

Dataset

Hugging Face
https://huggingface.co/datasets/nilc-nlp/CORAA-NURC-SP-Audio-Corpus

Model

Hugging Face
https://huggingface.co/nilc-nlp/distil-whisper-coraa-mupe-asr

Citation

Leal, S.E.; Candido Junior, A.; Marcacini, R.; Casanova, E.; Gonçalves, O.; Soares, A.; Lima, R.; Gris, L.; Aluísio, S.M. MuPe Life Stories Dataset: Spontaneous Speech in Brazilian Portuguese with a Case Study Evaluation on ASR Bias against Speakers Groups and Topic Modeling. Proceedings of the 31st International Conference on Computational Linguistics (COLING) (2025).

@inProceedings{Leal2025Coling,
author={Sidney Leal
   and Arnaldo Candido Jr.
   and Ricardo Marcacini
   and Edresson Casanova
   and Odilon Gonçalves
   and Anderson Soares
   and Rodrigo Lima
   and Lucas Gris
   and Sandra Alu{\'i}sio,
title={MuPe Life Stories Dataset: Spontaneous Speech in Brazilian Portuguese with a Case Study Evaluation on ASR Bias against Speakers Groups and Topic Modeling},
booktitle={Proceedings of the 31st International Conference on Computational Linguistics (COLING)},
year={2025}
}

Sponsors / Funding

This work was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant #2019/07665-4) and by the IBM Corporation. This project was also supported by the Ministry of Science, Technology and Innovation, with resources of Law No. 8.248, of October 23, 1991, within the scope of PPI-SOFTEX, coordinated by Softex and published Residence in TIC 13, DOU 01245.010222/2022-44.

Examples

Because of files size, the dataset is available at huggingface. But here all the csv's files with metadata for training, validation and test subsets, and a sample for the audio files.

image

image

About

MuPe Life Stories Dataset - Codes and examples

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published