Here, we present the Brazilian Portuguese Speech Emotion Recognition Task. This task aims to motivate research for SER in our community, mainly to discuss theoretical and practical aspects of Speech Emotion Recognition, pre-processing and feature extraction, and machine learning models for Brazilian Portuguese.
We provide a dataset called CORAA SER version 1.0 composed of approximately 50 minutes of audio segments labeled in three classes: neutral, non-neutral female, and non-neutral male. While the neutral class represents audio segments with no well-defined emotional state, the non-neutral classes represent segments associated with one of the primary emotional states in the speaker's speech. This dataset was built from the C-ORAL-BRASIL I corpus.
The available corpus consists of audio segments representing Brazilian Portuguese informal spontaneous speech. The non-neutral emotion class was labeled considering paralinguistic elements (laughing, crying, etc). Participants can use pre-trained models and external data, as long as the original C-ORAL-BRASIL corpus (or variants) is not used for model training.
In this task, participants must train their own models using acoustic audio features. A training set is available. The models trained by the participants will be evaluated in a test set, which will be made publicly available after the challenge.
Train audio segments are available in the data_train.zip file.
Audio files are named according to their label: <file-id>_<label>.wav
. Check the baselines for some examples on reading and pre-processing the training set.
Test audio segments are available in the test_ser.zip file.
The ground truth and other metadata are available in the test_ser_metadata.csv file.
We present two simple baselines as examples of pre-processing audio segments for feature extraction and model training for emotion recognition.
The first baseline uses a set of prosodic audio features for emotion classification.
In the second baseline, we use the Wav2Vec model to extract features (i.e. embeddings) from the audio segments. These features can be used for training a speech emotion recognition classifier.
Each participant can submit up to three models. The Macro F1 Score measure will be used to evaluate the models.
The S&ER 2022 Workshop is collocated with the 15th edition of the International Conference on the Computational Processing of Portuguese (PROPOR 2022).
Workshop website: https://sites.google.com/view/ser2022/home