This repository aims at the development of audio technologies using Wav2vec 2.0, such as Automatic Speech Recognition (ASR), for the Brazilian Portuguese language.
This repository contains code and fine-tuned Wav2vec checkpoints for Brazilian Portuguese, including some useful scripts to download and preprocess transcribed data.
Wav2vec 2.0 learns speech representations on unlabeled data as described in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020). For more information about Wav2vec, please access the official repository.
- Add CORAA to the BP Dataset (BP Dataset Version 2);
- Release BP Dataset V2 fine tuned models;
- Finetune using the XLR-S 300M, XLR-S 1B and XLR-S 2B models.
We provide several Wav2vec fine-tuned models for ASR. For a more detailed description of how we finetuned these models, please check the paper Brazilian Portuguese Speech Recognition Using Wav2vec 2.0.
Our last model is the bp_400. It was finetuned using the 400h filtered version of the BP Dataset (see Brazilian Portuguese (BP) Dataset Version 1 below). The results against each gathered dataset are shown below.
Model name | Pretrained model | Fairseq model | Dict | Hugging Face link |
---|---|---|---|---|
bp_400 | XLSR-53 | fairseq | dict | hugging face |
bp_400_xls-r-300M | XLS-R-300M | fairseq | dict | hugging face |
Model name | Pretrained model | Fairseq model | Dict | Hugging Face link |
---|---|---|---|---|
bp_500 | XLSR-53 | fairseq | dict | hugging face |
bp_500_10k | VoxPopuli 10k BASE | fairseq | dict | hugging face |
bp_500_100k | VoxPopuli 100k BASE | fairseq | dict | hugging face |
Model name | Pretrained model | Fairseq model | Dict | Hugging Face link |
---|---|---|---|---|
bp_cetuc_100 | XLSR-53 | fairseq | dict | hugging face |
bp_commonvoice_100 | XLSR-53 | fairseq | dict | hugging face |
bp_commonvoice_10 | XLSR-53 | fairseq | dict | hugging face |
bp_lapsbm_1 | XLSR-53 | fairseq | dict | hugging face |
bp_mls_100 | XLSR-53 | fairseq | dict | hugging face |
bp_sid_10 | XLSR-53 | fairseq | dict | hugging face |
bp_tedx_100 | XLSR-53 | fairseq | dict | hugging face |
bp_voxforge_1 | XLSR-53 | fairseq | dict | hugging face |
We provide other Wav2vec checkpoints. These models were trained using all the available data at the time, including its dev and test subsets. Only Common Voice dev/test was selected to validate and test the model, respectively.
Datasets used for training | Fairseq model | Dict | Hugging Face link |
---|---|---|---|
CETUC + CV 6.1 (only train) + LaPS BM + MLS + VoxForge | fairseq | dict | hugging face |
CETUC + CV 6.1 (all validated) + LaPS BM + MLS + VoxForge | hugging face |
Model | CETUC | CV | LaPS | MLS | SID | TEDx | VF | AVG |
---|---|---|---|---|---|---|---|---|
bp_400 | 0.052 | 0.140 | 0.074 | 0.117 | 0.121 | 0.245 | 0.118 | 0.124 |
bp_400_xls-r-300M | 0.048 | 0.123 | 0.068 | 0.111 | 0.084 | 0.207 | 0.095 | 0.105 |
bp_500 | 0.052 | 0.137 | 0.032 | 0.118 | 0.095 | 0.236 | 0.082* | 0.112 |
bp_500-base10k_voxpopuli | 0.120 | 0.249 | 0.039 | 0.227 | 0.169 | 0.349 | 0.116* | 0.181 |
bp_500-base100k_voxpopuli | 0.074 | 0.174 | 0.032 | 0.182 | 0.181 | 0.349 | 0.111* | 0.157 |
bp_cetuc_100** | 0.446 | 0.856 | 0.089 | 0.967 | 1.172 | 0.929 | 0.902 | 0.765 |
bp_commonvoice_100 | 0.088 | 0.126 | 0.121 | 0.173 | 0.177 | 0.424 | 0.145 | 0.179 |
bp_commonvoice_10 | 0.133 | 0.189 | 0.165 | 0.189 | 0.247 | 0.474 | 0.251 | 0.235 |
bp_lapsbm_1 | 0.111 | 0.418 | 0.145 | 0.299 | 0.562 | 0.580 | 0.469 | 0.369 |
bp_mls_100 | 0.192 | 0.260 | 0.162 | 0.163 | 0.268 | 0.492 | 0.268 | 0.257 |
bp_sid_10 | 0.186 | 0.327 | 0.207 | 0.505 | 0.124 | 0.835 | 0.472 | 0.379 |
bp_tedx_100 | 0.138 | 0.369 | 0.169 | 0.165 | 0.794 | 0.222 | 0.395 | 0.321 |
bp_voxforge_1 | 0.468 | 0.608 | 0.503 | 0.505 | 0.717 | 0.731 | 0.561 | 0.584 |
* We found a problem with the dataset used in these experiments regarding the VoxForge subset. In this test set, some speakers were also present in the training set (which explains the lower WER). The final version of the dataset does not have such contamination.
** We do not perform validation in the subset experiments. CETUC has a poor variety of transcriptions. It might be overfitted.
Text | Transcription |
---|---|
alguém sabe a que horas começa o jantar | alguém sabe a que horas começo jantar |
lila covas ainda não sabe o que vai fazer no fundo | lilacovas ainda não sabe o que vai fazer no fundo |
que tal um pouco desse bom spaghetti | quetá um pouco deste bom ispaguete |
hong kong em cantonês significa porto perfumado | rongkong en cantones significa porto perfumado |
vamos hackear esse problema | vamos rackar esse problema |
apenas a poucos metros há uma estação de ônibus | apenas ha poucos metros á uma estação de ônibus |
relâmpago e trovão sempre andam juntos | relampagotrevão sempre andam juntos |
Datasets provided:
- CETUC: contains approximately 145 hours of Brazilian Portuguese speech distributed among 50 male and 50 female speakers, each pronouncing approximately 1,000 phonetically balanced sentences selected from the CETEN-Folha corpus.
- Common Voice 7.0: is a project proposed by Mozilla Foundation with the goal to create a wide-open dataset in different languages. In this project, volunteers donate and validate speech using the oficial site.
- Lapsbm: "Falabrasil - UFPA" is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. Contains 35 speakers (10 females), each one pronouncing 20 unique sentences, totaling 700 utterances in Brazilian Portuguese. The audios were recorded in 22.05 kHz without environment control.
- Multilingual Librispeech (MLS): a massive dataset available in many languages. The MLS is based on audiobook recordings in the public domain like LibriVox. The dataset contains a total of 6k hours of transcribed data in many languages. The set in Portuguese used in this work (mostly Brazilian variant) has approximately 284 hours of speech, obtained from 55 audiobooks read by 62 speakers.
- Multilingual TEDx: a collection of audio recordings from TEDx talks in 8 source languages. The Portuguese set (mostly Brazilian Portuguese variant) contains 164 hours of transcribed speech.
- Sidney (SID): contains 5,777 utterances recorded by 72 speakers (20 women) from 17 to 59 years old with fields such as place of birth, age, gender, education, and occupation;
- VoxForge: is a project with the goal to build open datasets for acoustic models. The corpus contains approximately 100 speakers and 4,130 utterances of Brazilian Portuguese, with sample rates varying from 16kHz to 44.1kHz.
These datasets were combined to build a larger Brazilian Portuguese dataset (BP Dataset). All data was used for training except Common Voice dev/test sets, which were used for validation/test respectively. We also made test sets for all the gathered datasets.
Dataset | Train | Valid | Test |
---|---|---|---|
CETUC | 93.9h | -- | 5.4h |
Common Voice | 37.6h | 8.9h | 9.5h |
LaPS BM | 0.8h | -- | 0.1h |
MLS | 161.0h | -- | 3.7h |
Multilingual TEDx (Portuguese) | 144.2h | -- | 1.8h |
SID | 5.0h | -- | 1.0h |
VoxForge | 2.8h | -- | 0.1h |
Total | 437.2h | 8.9h | 21.6h |
You can download the datasets individually using the scripts at scripts/ directory. The scripts will create the respective dev and test sets automatically.
python scripts/mls.py
If you want to join several datasets, execute the script join_datasets at scripts/:
python scripts/join_datasets.py /path/to/dataset1/train /path/to/dataset2/train ... --output-dir data/my_dataset --output-name train
After joining datasets, you might have some degree of transcription contamination. To remove all transcriptions present in a specific subset (for example, test subset), you can use the filter_dataset script:
python scripts/filter_datasets.py /path/to/my_dataset/train /path/to/dataset1/test /path/to/dataset2/test -output-dir data/my_dataset --output-name my_filtered_train
Alternativelly, download the raw datasets using the links below:
- https://igormq.github.io/datasets/
- https://commonvoice.mozilla.org/
- http://www.openslr.org/94/
- http://www.openslr.org/100
The BP Dataset is an assembled dataset composed of many others in Brazilian Portuguese. We used the original test sets of each gathered dataset to make individual test sets. For the datasets without test sets, we created them by selecting 5% of unique male and female speakers. Additionally, we performed some filtering removing all transcriptions of the test sets from the final training set. We also ignored audio more than 30 seconds long from the dataset.
If you run the provided scripts, you might generate a slightly different version of the BP dataset. If you want to use the same files used to train, validate and test our models, you can download the metadata here.
Our first attempt to build a larger dataset for BP produced a 500 hours dataset. However, we found some problems with the VoxForge subset. We also found some transcriptions of the test sets present in the training set. We made available the models trained with this version of the dataset (bp_500).
Language models can improve the ASR output. To use with fairseq, you will need to install flashlight python bindings. You will also need a lexicon containing the possible words.
You can download some Ken LM models here. It is compatible with the flashlight decoder.
Model name | Fairseq model | Dict |
---|---|---|
BP Transformer LM | fairseq model | dict |
Wikipedia Transformer LM | fairseq model | dict |
Wikipedia Prunned Transformer LM | fairseq model | dict |
If you want to use Wav2Vec2_PyCTCDecode with Transformers to decode the Hugging Face models, the Ken LM models provided above might not work. In this case, you should train your own following the instructions here, or use one of the two models trained with BP Dataset and Wikipedia below:
- To finetune the model, first install fairseq and its dependencies.
cd fairseq
pip install -e .
-
Download a pre-trained model (See pretrained models)
-
Create or use a configuration file (see configs/ directory).
-
Finetune the model executing fairseq-hydra-train
root=/path/to/wav2vec4bp
fairseq-hydra-train \
task.data=$root/data/my_dataset \
checkpoint.save_dir=$root/checkpoints/stt/my_model_name \
model.w2v_path=$root/xlsr_53_56k.pt \
common.tensorboard_logdir=$root/logs/stt/my_model_name \
--config-dir $root/configs \
--config-name my_configuration_file_name
To fine-tune Wav2vec, you will need to download a pre-trained model first.
- XLSR-53 (large) (recommended)
- VoxPopuli 10k (base)
- VoxPopuli 10k (large)
- VoxPopuli 100k (base)
- VoxPopuli 100k (large)
- XLR-S 300M
- XLR-S 1B
- XLR-S 2B
To easily finetune the model using hugging face, you can use the repository Wav2vec-wrapper.
To train a language model, one can use a Transformer LM or KenLM.
First, install KenLM.
git clone https://github.com/kpu/kenlm.git
cd kenlm
mkdir -p build
cd build
cmake ..
make -j 4
Then create a text file and run the following command:
./kenlm/build/bin/lmplz -o 5 <text.txt > path_to_lm.arpa
To train a Transformer LM, first prepare and preprocess train, valid and test text files:
TEXT=path/to/dataset
fairseq-preprocess \
--only-source \
--trainpref $TEXT/train.tokens \
--validpref $TEXT/valid.tokens \
--testpref $TEXT/test.tokens \
--destdir data/text/$dataset \
--workers 20
Then train the model:
fairseq-train --task language_modeling \
data/text/$dataset \
--save-dir checkpoints/transformer_lms/$name \
--arch transformer_lm --share-decoder-input-output-embed \
--dropout 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
--tokens-per-sample 512 --sample-break-mode none \
--max-tokens 1024 --update-freq 32 \
--fp16 \
--max-update 50000
We recommend using a docker container, such as flml/flashlight, to easily finetune and test your models.