SpeechBrain Performance Report

This document provides an overview of the performance achieved on key datasets and tasks supported by SpeechBrain.

AISHELL-1 Dataset

ASR

Model	Checkpoints	HuggingFace	Test-CER
`recipes/AISHELL-1/ASR/CTC/hparams/train_with_wav2vec.yaml`	here	here	5.06
`recipes/AISHELL-1/ASR/seq2seq/hparams/train.yaml`	here	-	7.51
`recipes/AISHELL-1/ASR/transformer/hparams/train_ASR_transformer.yaml`	here	here	6.04
`recipes/AISHELL-1/ASR/transformer/hparams/train_ASR_transformer_with_wav2vect.yaml`	here	here	5.58

Aishell1Mix Dataset

Separation

Model	Checkpoints	HuggingFace	SI-SNRi
`recipes/Aishell1Mix/separation/hparams/sepformer-aishell1mix2.yaml`	here	-	13.4dB
`recipes/Aishell1Mix/separation/hparams/sepformer-aishell1mix3.yaml`	here	-	11.2dB

BinauralWSJ0Mix Dataset

Separation

Model	Checkpoints	HuggingFace	SI-SNRi
`recipes/BinauralWSJ0Mix/separation/hparams/convtasnet-cross.yaml`	here	-	12.39dB
`recipes/BinauralWSJ0Mix/separation/hparams/convtasnet-independent.yaml`	here	-	11.90dB
`recipes/BinauralWSJ0Mix/separation/hparams/convtasnet-parallel-noise.yaml`	here	-	18.25dB
`recipes/BinauralWSJ0Mix/separation/hparams/convtasnet-parallel-reverb.yaml`	here	-	6.95dB
`recipes/BinauralWSJ0Mix/separation/hparams/convtasnet-parallel.yaml`	here	-	16.93dB

CVSS Dataset

S2ST

Model	Checkpoints	HuggingFace	Test-sacrebleu
`recipes/CVSS/S2ST/hparams/train_fr-en.yaml`	here	here	24.47

CommonLanguage Dataset

Language-id

Model	Checkpoints	HuggingFace	Error
`recipes/CommonLanguage/lang_id/hparams/train_ecapa_tdnn.yaml`	here	here	15.1%

CommonVoice Dataset

ASR-seq2seq

Model	Checkpoints	HuggingFace	Test-WER
`recipes/CommonVoice/ASR/seq2seq/hparams/train_de.yaml`	here	here	12.25%
`recipes/CommonVoice/ASR/seq2seq/hparams/train_en.yaml`	here	here	23.88%
`recipes/CommonVoice/ASR/seq2seq/hparams/train_fr.yaml`	here	here	14.88%
`recipes/CommonVoice/ASR/seq2seq/hparams/train_it.yaml`	here	here	17.02%
`recipes/CommonVoice/ASR/seq2seq/hparams/train_rw.yaml`	here	here	29.22%
`recipes/CommonVoice/ASR/seq2seq/hparams/train_es.yaml`	here	here	14.77%

ASR-CTC

Model	Checkpoints	HuggingFace	Test-WER
`recipes/CommonVoice/ASR/CTC/hparams/train_en_with_wav2vec.yaml`	here	-	16.16%
`recipes/CommonVoice/ASR/CTC/hparams/train_fr_with_wav2vec.yaml`	here	here	9.71%
`recipes/CommonVoice/ASR/CTC/hparams/train_it_with_wav2vec.yaml`	here	here	7.99%
`recipes/CommonVoice/ASR/CTC/hparams/train_rw_with_wav2vec.yaml`	here	here	22.52%
`recipes/CommonVoice/ASR/CTC/hparams/train_de_with_wav2vec.yaml`	here	here	8.39%
`recipes/CommonVoice/ASR/CTC/hparams/train_ar_with_wav2vec.yaml`	here	here	28.53%
`recipes/CommonVoice/ASR/CTC/hparams/train_es_with_wav2vec.yaml`	here	here	12.67%
`recipes/CommonVoice/ASR/CTC/hparams/train_pt_with_wav2vec.yaml`	here	here	21.69%
`recipes/CommonVoice/ASR/CTC/hparams/train_zh-CN_with_wav2vec.yaml`	here	here	23.17%

ASR-transformer

Model	Checkpoints	HuggingFace	Test-WER
`recipes/CommonVoice/ASR/transformer/hparams/train_hf_whisper.yaml`	-	-	16.96%

DNS Dataset

Enhancement

Model	Checkpoints	HuggingFace	valid-PESQ	test-SIG	test-BAK	test-OVRL
`recipes/DNS/enhancement/hparams/sepformer-dns-16k.yaml`	here	here	2.06	2.999	3.076	2.437

DVoice Dataset

ASR-CTC

Model	Checkpoints	HuggingFace	Test-WER
`recipes/DVoice/ASR/CTC/hparams/train_amh_with_wav2vec.yaml`	here	here	24.92%
`recipes/DVoice/ASR/CTC/hparams/train_dar_with_wav2vec.yaml`	here	here	18.28%
`recipes/DVoice/ASR/CTC/hparams/train_fon_with_wav2vec.yaml`	here	here	9.00%
`recipes/DVoice/ASR/CTC/hparams/train_sw_with_wav2vec.yaml`	here	here	23.16%
`recipes/DVoice/ASR/CTC/hparams/train_wol_with_wav2vec.yaml`	here	here	16.05%

Multilingual-ASR-CTC

Model	Checkpoints	HuggingFace	WER-Darija	WER-Swahili	WER-Fongbe	Fongbe-Wolof	WER-Amharic
`recipes/DVoice/ASR/CTC/hparams/train_multi_with_wav2vec.yaml`	here	-	13.27%	29.31%	10.26%	21.54%	31.15%

ESC50 Dataset

SoundClassification

Model	Checkpoints	HuggingFace	Accuracy
`recipes/ESC50/classification/hparams/cnn14.yaml`	here	-	82%
`recipes/ESC50/classification/hparams/conv2d.yaml`	here	-	75%

Fisher-Callhome-Spanish Dataset

Speech_Translation

Model	Checkpoints	HuggingFace	Test-sacrebleu
`recipes/Fisher-Callhome-Spanish/ST/transformer/hparams/transformer.yaml`	here	-	47.31
`recipes/Fisher-Callhome-Spanish/ST/transformer/hparams/conformer.yaml`	here	-	48.04

Google-speech-commands Dataset

Command_recognition

Model	Checkpoints	HuggingFace	Test-accuracy
`recipes/Google-speech-commands/hparams/xvect.yaml`	here	here	97.43%
`recipes/Google-speech-commands/hparams/xvect_leaf.yaml`	here	-	96.79%

IEMOCAP Dataset

Emotion_recognition

Model	Checkpoints	HuggingFace	Test-Accuracy
`recipes/IEMOCAP/emotion_recognition/hparams/train_with_wav2vec2.yaml`	here	here	65.7%
`recipes/IEMOCAP/emotion_recognition/hparams/train.yaml`	here	-	77.0%

IWSLT22_lowresource Dataset

Speech_Translation

Model	Checkpoints	HuggingFace	Test-BLEU
`recipes/IWSLT22_lowresource/AST/transformer/hparams/train_w2v2_mbart_st.yaml`	here	-	7.73
`recipes/IWSLT22_lowresource/AST/transformer/hparams/train_w2v2_nllb_st.yaml`	here	-	8.70
`recipes/IWSLT22_lowresource/AST/transformer/hparams/train_samu_mbart_st.yaml`	here	-	10.28
`recipes/IWSLT22_lowresource/AST/transformer/hparams/train_samu_nllb_st.yaml`	here	-	11.32

KsponSpeech Dataset

ASR

Model	Checkpoints	HuggingFace	clean-WER	others-WER
`recipes/KsponSpeech/ASR/transformer/hparams/conformer_medium.yaml`	here	here	20.78%	25.73%

LibriMix Dataset

Separation

Model	Checkpoints	HuggingFace	SI-SNR
`recipes/LibriMix/separation/hparams/sepformer-libri2mix.yaml`	here	-	20.4dB
`recipes/LibriMix/separation/hparams/sepformer-libri3mix.yaml`	here	-	19.0dB

LibriParty Dataset

VAD

Model	Checkpoints	HuggingFace	Test-Precision	Recall	F-Score
`recipes/LibriParty/VAD/hparams/train.yaml`	here	here	0.9518	0.9437	0.9477

LibriSpeech Dataset

ASR-Transformers

Model	Checkpoints	HuggingFace	Test_clean-WER	Test_other-WER
`recipes/LibriSpeech/ASR/transformer/hparams/conformer_small.yaml`	here	here	2.49%	6.10%
`recipes/LibriSpeech/ASR/transformer/hparams/transformer.yaml`	here	here	2.27%	5.53%
`recipes/LibriSpeech/ASR/transformer/hparams/conformer_large.yaml`	here	-	2.01%	4.52%
`recipes/LibriSpeech/ASR/transformer/hparams/branchformer_large.yaml`	here	-	2.04%	4.12%
`recipes/LibriSpeech/ASR/transformer/hparams/hyperconformer_22M.yaml`	here	-	2.23%	4.54%
`recipes/LibriSpeech/ASR/transformer/hparams/hyperconformer_8M.yaml`	here	-	2.55%	6.61%
`recipes/LibriSpeech/ASR/transformer/hparams/hyperbranchformer_25M.yaml`	-	-	2.36%	6.89%
`recipes/LibriSpeech/ASR/transformer/hparams/hyperbranchformer_13M.yaml`	-	-	2.54%	6.58%
`recipes/LibriSpeech/ASR/transformer/hparams/train_hf_whisper.yaml`	-	-
`recipes/LibriSpeech/ASR/transformer/hparams/bayesspeech.yaml`	here	-	2.84%	6.27%

ASR-Transducers

Model	Checkpoints	HuggingFace	Test_clean-WER	Test_other-WER
`recipes/LibriSpeech/ASR/transducer/hparams/conformer_transducer.yaml`	here	-	2.72%	6.47%

ASR-CTC

Model	Checkpoints	HuggingFace	Test_clean-WER	Test_other-WER
`recipes/LibriSpeech/ASR/CTC/hparams/train_hf_wav2vec.yaml`	here	here	1.65%	3.67%
`recipes/LibriSpeech/ASR/CTC/hparams/train_hf_wav2vec_transformer_rescoring.yaml`	here	-	1.57%	3.37%

G2P

Model	Checkpoints	HuggingFace	PER-Test
`recipes/LibriSpeech/G2P/hparams/hparams_g2p_rnn.yaml`	here	-	2.72%
`recipes/LibriSpeech/G2P/hparams/hparams_g2p_transformer.yaml`	here	here	2.89%

ASR-Seq2Seq

Model	Checkpoints	HuggingFace	Test_clean-WER	Test_other-WER
`recipes/LibriSpeech/ASR/seq2seq/hparams/train_BPE_5000.yaml`	here	here	2.89%	8.09%

MEDIA Dataset

ASR

Model	Checkpoints	HuggingFace	Test-ChER	Test-CER
`recipes/MEDIA/ASR/CTC/hparams/train_hf_wav2vec.yaml`	-	here	7.78%	4.78%

SLU

Model	Checkpoints	HuggingFace	Test-ChER	Test-CER	Test-CVER
`recipes/MEDIA/SLU/CTC/hparams/train_hf_wav2vec_full.yaml`	-	here	7.46%	20.10%	31.41%
`recipes/MEDIA/SLU/CTC/hparams/train_hf_wav2vec_relax.yaml`	-	here	7.78%	24.88%	35.77%

MultiWOZ Dataset

Response-Generation

Model	Checkpoints	HuggingFace	Test-PPL	Test_BLEU-4
`recipes/MultiWOZ/response_generation/gpt/hparams/train_gpt.yaml`	here	here	4.01	2.54e-04
`recipes/MultiWOZ/response_generation/llama2/hparams/train_llama2.yaml`	here	here	2.90	7.45e-04

REAL-M Dataset

Sisnr-estimation

Model	Checkpoints	HuggingFace	L1-Error
`recipes/REAL-M/sisnr-estimation/hparams/pool_sisnrestimator.yaml`	here	here	1.71dB

RescueSpeech Dataset

ASR+enhancement

Model	Checkpoints	HuggingFace	SISNRi	SDRi	PESQ	STOI	WER
`recipes/RescueSpeech/ASR/noise-robust/hparams/robust_asr_16k.yaml`	here	here	7.482	8.011	2.083	0.854	45.29%

SLURP Dataset

SLU

Model	Checkpoints	HuggingFace	scenario-accuracy	action-accuracy	intent-accuracy
`recipes/SLURP/NLU/hparams/train.yaml`	here	-	90.81%	88.29%	87.28%
`recipes/SLURP/direct/hparams/train.yaml`	here	-	81.73%	77.11%	75.05%
`recipes/SLURP/direct/hparams/train_with_wav2vec2.yaml`	here	here	91.24%	88.47%	87.55%

Switchboard Dataset

ASR

Model	Checkpoints	HuggingFace	Swbd-WER	Callhome-WER	Eval2000-WER
`recipes/Switchboard/ASR/CTC/hparams/train_with_wav2vec.yaml`	-	here	8.76%	14.67%	11.78%
`recipes/Switchboard/ASR/seq2seq/hparams/train_BPE_2000.yaml`	-	here	16.90%	25.12%	20.71%
`recipes/Switchboard/ASR/transformer/hparams/transformer.yaml`	-	here	9.80%	17.89%	13.94%

TIMIT Dataset

ASR

Model	Checkpoints	HuggingFace	Test-PER
`recipes/TIMIT/ASR/CTC/hparams/train.yaml`	here	-	14.78%
`recipes/TIMIT/ASR/seq2seq/hparams/train.yaml`	here	-	14.07%
`recipes/TIMIT/ASR/seq2seq/hparams/train_with_wav2vec2.yaml`	here	-	8.04%
`recipes/TIMIT/ASR/transducer/hparams/train.yaml`	here	-	14.12%
`recipes/TIMIT/ASR/transducer/hparams/train_wav2vec.yaml`	here	-	8.91%

Tedlium2 Dataset

ASR

Model	Checkpoints	HuggingFace	Test-WER_No_LM
`recipes/Tedlium2/ASR/transformer/hparams/branchformer_large.yaml`	here	here	8.11%

UrbanSound8k Dataset

SoundClassification

Model	Checkpoints	HuggingFace	Accuracy
`recipes/UrbanSound8k/SoundClassification/hparams/train_ecapa_tdnn.yaml`	here	here	75.4%

Voicebank Dataset

Dereverberation

Model	Checkpoints	HuggingFace	PESQ
`recipes/Voicebank/dereverb/MetricGAN-U/hparams/train_dereverb.yaml`	here	-	2.07
`recipes/Voicebank/dereverb/spectral_mask/hparams/train.yaml`	here	-	2.35

ASR

Model	Checkpoints	HuggingFace	Test-PER
`recipes/Voicebank/ASR/CTC/hparams/train.yaml`	here	-	10.12%

ASR+enhancement

Model	Checkpoints	HuggingFace	PESQ	COVL	test-WER
`recipes/Voicebank/MTL/ASR_enhance/hparams/robust_asr.yaml`	here	here	3.05	3.74	2.80

Enhancement

Model	Checkpoints	HuggingFace	PESQ
`recipes/Voicebank/enhance/MetricGAN/hparams/train.yaml`	here	here	3.15
`recipes/Voicebank/enhance/SEGAN/hparams/train.yaml`	here	-	2.38
`recipes/Voicebank/enhance/spectral_mask/hparams/train.yaml`	here	-	2.65

VoxCeleb Dataset

Speaker_recognition

Model	Checkpoints	HuggingFace	EER
`recipes/VoxCeleb/SpeakerRec/hparams/train_ecapa_tdnn.yaml`	here	here	0.80%
`recipes/VoxCeleb/SpeakerRec/hparams/train_x_vectors.yaml`	here	here	3.23%
`recipes/VoxCeleb/SpeakerRec/hparams/train_resnet.yaml`	here	here	0.95%

VoxLingua107 Dataset

Language-id

Model	Checkpoints	HuggingFace	Accuracy
`recipes/VoxLingua107/lang_id/hparams/train_ecapa.yaml`	here	here	93.3%

VoxPopuli Dataset

WHAMandWHAMR Dataset

Separation

Model	Checkpoints	HuggingFace	SI-SNR
`recipes/WHAMandWHAMR/separation/hparams/sepformer-wham.yaml`	here	here	16.5
`recipes/WHAMandWHAMR/separation/hparams/sepformer-whamr.yaml`	here	here	14.0

Enhancement

Model	Checkpoints	HuggingFace	SI-SNR	PESQ
`recipes/WHAMandWHAMR/enhancement/hparams/sepformer-wham.yaml`	here	here	14.4	3.05
`recipes/WHAMandWHAMR/enhancement/hparams/sepformer-whamr.yaml`	here	here	10.6	2.84

WSJ0Mix Dataset

Separation (2mix)

Model	Checkpoints	HuggingFace	SI-SNRi
`recipes/WSJ0Mix/separation/hparams/convtasnet.yaml`	here	-	14.8dB
`recipes/WSJ0Mix/separation/hparams/dprnn.yaml`	here	-	18.5dB
`recipes/WSJ0Mix/separation/hparams/resepformer.yaml`	here	here	18.6dB
`recipes/WSJ0Mix/separation/hparams/sepformer.yaml`	here	here	22.4dB
`recipes/WSJ0Mix/separation/hparams/skim.yaml`	here	here	18.1dB

ZaionEmotionDataset Dataset

Emotion_Diarization

Model	Checkpoints	HuggingFace	EDER
`recipes/ZaionEmotionDataset/emotion_diarization/hparams/train.yaml`	here	here	30.2%

fluent-speech-commands Dataset

SLU

Model	Checkpoints	HuggingFace	Test-accuracy
`recipes/fluent-speech-commands/direct/hparams/train.yaml`	here	-	99.60%

timers-and-such Dataset

SLU

Model	Checkpoints	HuggingFace	Accuracy-Test_real
`recipes/timers-and-such/decoupled/hparams/train_TAS_LM.yaml`	here	-	46.8%
`recipes/timers-and-such/direct/hparams/train.yaml`	here	here	77.5%
`recipes/timers-and-such/direct/hparams/train_with_wav2vec2.yaml`	here	-	94.0%
`recipes/timers-and-such/multistage/hparams/train_TAS_LM.yaml`	here	-	72.6%

Files

PERFORMANCE.md

Latest commit

History

PERFORMANCE.md

File metadata and controls

SpeechBrain Performance Report

AISHELL-1 Dataset

ASR

Aishell1Mix Dataset

Separation

BinauralWSJ0Mix Dataset

Separation

CVSS Dataset

S2ST

CommonLanguage Dataset

Language-id

CommonVoice Dataset

ASR-seq2seq

ASR-CTC

ASR-transformer

DNS Dataset

Enhancement

DVoice Dataset

ASR-CTC

Multilingual-ASR-CTC

ESC50 Dataset

SoundClassification

Fisher-Callhome-Spanish Dataset

Speech_Translation

Google-speech-commands Dataset

Command_recognition

IEMOCAP Dataset

Emotion_recognition

IWSLT22_lowresource Dataset

Speech_Translation

KsponSpeech Dataset

ASR

LibriMix Dataset

Separation

LibriParty Dataset

VAD

LibriSpeech Dataset

ASR-Transformers

ASR-Transducers

ASR-CTC

G2P

ASR-Seq2Seq

MEDIA Dataset

ASR

SLU

MultiWOZ Dataset

Response-Generation

REAL-M Dataset

Sisnr-estimation

RescueSpeech Dataset

ASR+enhancement

SLURP Dataset

SLU

Switchboard Dataset

ASR

TIMIT Dataset

ASR

Tedlium2 Dataset

ASR

UrbanSound8k Dataset

SoundClassification

Voicebank Dataset

Dereverberation

ASR

ASR+enhancement

Enhancement

VoxCeleb Dataset

Speaker_recognition

VoxLingua107 Dataset

Language-id

VoxPopuli Dataset

WHAMandWHAMR Dataset

Separation

Enhancement