Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

Code associated with the paper: Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation.

Abstract: The performance of automatic speech recognition (ASR) systems has advanced substantially in recent years, particularly for languages for which a large amount of transcribed speech is available. Unfortunately, for low-resource languages, such as minority languages, regional languages or dialects, ASR performance generally remains much lower. In this study, we investigate whether data augmentation techniques could help improve low-resource ASR performance, focusing on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal). For all four languages, we examine the use of self-training, where an ASR system trained with the available human-transcribed data is used to generate transcriptions, which are then combined with the original data to train a new ASR system. For Gronings, for which there was a pre-existing text-to-speech (TTS) system available, we also examined the use of TTS to generate ASR training data from text-only sources. We find that using a self-training approach consistently yields improved performance (a relative WER reduction up to 20.5% compared to using an ASR system trained on 24 minutes of manually transcribed speech). The performance gain from TTS augmentation for Gronings was even stronger (up to 25.5% relative reduction in WER compared to a system based on 24 minutes of manually transcribed speech). In sum, our results show the benefit of using self-training or (if possible) TTS-generated data as an efficient solution to overcome the limitations of data availability for resource-scarce languages in order to improve ASR performance.

Installation

git clone https://github.com/Bartelds/asr-augmentation.git
cd asr-augmentation
pip install -r requirements.txt

Data

In this repository, we provide a small demonstration dataset with Gronings in data/gos-demo, based on the dataset released by San et al. (2021). This data is also available on the Hugging Face Hub.

The full datasets used in the experiments for Gronings, Besemah, and Nasal are available on Zenodo. The FAME! ASR corpus for West-Frisian can be obtained by emailing the authors.

License

The code and (pre-trained/fine-tuned) models are released as Apache 2.0 license, as indicated in the LICENSE file. The Gronings, Nasal, and Besemah ASR datasets are released as CC-BY 4.0.

Usage

Pre-training

To continue pre-training on target domain data:

accelerate config

accelerate launch --mixed_precision fp16 \
src/cpt/run_wav2vec2_pretraining_no_trainer.py \
--dataset_name=google/fleurs \
--dataset_config_names nl_nl \
--dataset_split_names train \
--output_dir=checkpoints \
--max_train_steps=100000 \
--num_warmup_steps=10000 \
--gradient_accumulation_steps=1 \
--learning_rate=1e-5 \
--weight_decay=0.01 \
--max_duration_in_seconds=30.0 \
--min_duration_in_seconds=1.0 \
--model_name_or_path=facebook/wav2vec2-xls-r-300m \
--logging_steps=100 \
--saving_steps=1000 \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--adam_beta1=0.9 \
--adam_beta2=0.98 \
--adam_epsilon=1e-06 \
--gradient_checkpointing \
--wandb_project_name=your-project \
--wandb_run_name=your-run

src/cpt/run_wav2vec2_pretraining_no_trainer.py was adapted from the code originally developed by the The HuggingFace Inc. team (available under the Apache License, Version 2.0). In this example, the fleurs dataset (licensed under CC-BY 4.0) is used as the target domain data.

Fine-tuning

To fine-tune a wav2vec 2.0-based model:

python src/ft/train.py --config=src/ft/config.yaml \
    data.base_path=data/gos-demo/ \
    trainargs.output_dir=gos-demo \
    data.train_tsv=train.tsv \
    data.eval_tsv=dev.tsv \
    data.subset_train.mins=24 \
    data.subset_train.seed=4892 \
    trainargs.save_steps=500 \
    trainargs.load_best_model_at_end=False \
    trainargs.per_device_train_batch_size=8 \
    trainargs.per_device_eval_batch_size=8 \
    trainargs.gradient_accumulation_steps=4

The parameters as specified in config.yaml can be changed using the dot notation for nested keys. For example, the amount of data used for fine-tuning can be set to 48 minutes using: data.subset_train.mins=48. More information is available in the repository of our ComputEL-6 paper.

Data Augmentation

Self-training (ST) To transcribe unlabeled speech recordings using a wav2vec 2.0-based model:

python src/augmentation/st/aug-st.py --config=src/augmentation/st/config.yaml \
    data.base_path=data/gos-demo \
    data.transcribe_tsv=train.tsv \
    w2v2.model.pretrained_model_name_or_path=bartelds/gos-gpum-cp0_adp0_24m_1e-5_cp-13000

Text-to-speech (TTS) To generate synthetic speech of Gronings texts using an existing TTS system (e.g., FastSpeech2-based TTS):

python src/augmentation/tts/aug-tts.py --file=data/gos-demo/train.tsv

Please supply a config file or file with transcriptions. You can over-ride config parameters using the dot notation mentioned above. For example: python aug-st.py --config=CONFIG_FILE.yml w2v2.model.pretrained_model_name_or_path=your_teacher_model.

Evaluation

To compute word-error-rates for fine-tuned models:

python eval.py --config=config.yaml \
    trainargs.output_dir=test \
    env.WANDB_MODE=offline \
    data.base_path=data/gos-demo/ \
    data.train_tsv=dev.tsv \
    data.eval_tsv=test.tsv \
    data.subset_train.seed=4892 \
    w2v2.model.pretrained_model_name_or_path=bartelds/gos-gpu1-cp0_adp0_192m_5e-4_cp-12500 \
    trainargs.per_device_train_batch_size=8 \
    trainargs.per_device_eval_batch_size=8 \
    trainargs.gradient_accumulation_steps=4

Similarly, you can over-ride config parameters. For example: python eval.py --config=CONFIG_FILE.yml w2v2.model.pretrained_model_name_or_path=another_w2v2_model.

Models

All pre-trained and fine-tuned models are available on the Hugging Face Hub 🤗 .

Gronings

XLS-R continued pre-trained on Gronings speech:

wav2vec2-xls-r-300m-gos

Fine-tuned models:

Model	Min of data	Learning rate	Checkpoint (steps)	CPT	WER test
Hugging Face	24	5e-5	13000	yes	0.301
Hugging Face	24	1e-5	13000	no	0.332
Hugging Face	48	1e-5	12000	yes	0.252
Hugging Face	48	1e-4	10000	no	0.252
Hugging Face	96	1e-4	11500	yes	0.193
Hugging Face	96	5e-4	11000	no	0.202
Hugging Face	192	1e-5	12000	yes	0.144
Hugging Face	192	5e-4	12500	no	0.155

Data Augmentation:

Model	Min of data	Learning rate	Checkpoint (steps)	CPT	WER test
Hugging Face	24 + 168 ST	5e-4	12500	yes	0.282
Hugging Face	24 + 168 ST	1e-5	13000	no	0.286
Hugging Face	24 + 2 x 168 ST	1e-5	11000	no	0.281
Hugging Face	24 + 4 x 168 ST	1e-4	12000	no	0.264
Hugging Face	24 + 168 TTS	1e-4	12500	no	0.204
Hugging Face	24 + 2 x 168 TTS	5e-5	11000	no	0.209
Hugging Face	24 + 4 x 168 TTS	5e-4	12500	no	0.198
Hugging Face	48 + 144 ST	5e-4	11000	yes	0.226
Hugging Face	48 + 144 ST	5e-5	8000	no	0.230
Hugging Face	96 + 96 ST	5e-4	10000	yes	0.183
Hugging Face	96 + 96 ST	1e-4	11000	no	0.183

West-Frisian

Fine-tuned models:

Model	Min of data	Learning rate	Checkpoint (steps)	WER test
Hugging Face	24	5e-4	12000	0.457
Hugging Face	48	1e-4	10000	0.382
Hugging Face	96	5e-4	12500	0.307
Hugging Face	192	1e-5	10000	0.261

Data Augmentation:

Model	Min of data	Learning rate	Checkpoint (steps)	CPT	WER test
Hugging Face	24 + 168 ST	5e-5	12500	no	0.428
Hugging Face	48 + 144 ST	5e-5	11500	no	0.352
Hugging Face	96 + 96 ST	5e-4	13000	no	0.289

Besemah

Fine-tuned models:

Model	Min of data	Learning rate	Checkpoint (steps)	WER test
Hugging Face	24	1e-5	11500	0.517
Hugging Face	48	5e-5	11500	0.423
Hugging Face	96	1e-5	13000	0.359
Hugging Face	192	5e-4	12000	0.316

Data Augmentation:

Model	Min of data	Learning rate	Checkpoint (steps)	CPT	WER test
Hugging Face	24 + 168 ST	1e-5	13000	no	0.471
Hugging Face	48 + 144 ST	5e-4	13000	no	0.398
Hugging Face	96 + 96 ST	1e-4	13000	no	0.359

Nasal

Fine-tuned models:

Model	Min of data	Learning rate	Checkpoint (steps)	WER test
Hugging Face	24	1e-4	10500	0.591
Hugging Face	48	5e-5	13000	0.509
Hugging Face	96	1e-5	12500	0.453
Hugging Face	192	1e-5	12000	0.413

Data Augmentation:

Model	Min of data	Learning rate	Checkpoint (steps)	CPT	WER test
Hugging Face	24 + 168 ST	1e-4	12000	no	0.552
Hugging Face	48 + 144 ST	5e-4	11000	no	0.485
Hugging Face	96 + 96 ST	5e-4	13000	no	0.437

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

Installation

Data

License

Usage

Pre-training

Fine-tuning

Data Augmentation

Evaluation

Models

Gronings

XLS-R continued pre-trained on Gronings speech:

Fine-tuned models:

West-Frisian

Fine-tuned models:

Besemah

Fine-tuned models:

Nasal

Fine-tuned models:

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data/gos-demo		data/gos-demo
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

Bartelds/asr-augmentation

Folders and files

Latest commit

History

Repository files navigation

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

Installation

Data

License

Usage

Pre-training

Fine-tuning

Data Augmentation

Evaluation

Models

Gronings

XLS-R continued pre-trained on Gronings speech:

Fine-tuned models:

West-Frisian

Fine-tuned models:

Besemah

Fine-tuned models:

Nasal

Fine-tuned models:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages