RoDia: A Dataset for Romanian Dialect Identification from Speech

Dialect identification is a critical task in speech processing and language technology, enhancing various applications such as speech recognition, speaker verification, and many others. While most research studies have been dedicated to dialect identification in widely spoken languages, limited attention has been given to dialect identification in low-resource languages, such as Romanian.

To address this research gap, we introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data.

Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging.

We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification.

Dataset

The dataset is in the data directory. We released the train.csv and the test.csv for the dataset split. Moreover, we included gender and age statistics.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
.gitattributes		.gitattributes
README.md		README.md
test.csv		test.csv
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RoDia: A Dataset for Romanian Dialect Identification from Speech

Dataset

About

Releases

Packages

codrut2/RoDia

Folders and files

Latest commit

History

Repository files navigation

RoDia: A Dataset for Romanian Dialect Identification from Speech

Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages