MOS-Bench is a benchmark designed to benchmark the generalization abilities of subjective speech quality assessment (SSQA) models. SHEET stands for the Speech Human Evaluation Estimation Toolkit. SHEET was designed to conduct research experiments with MOS-Bench.
- MOS-Bench is the first large-scale collection of training and testing datasets for SSQA, covering a wide range of domains, including synthetic speech from text-to-speech (TTS), voice conversion (VC), singing voice synthetis (SVS) systems, and distorted speech with artificial and real noise, clipping, transmission, reverb, etc. Researchers can use the testing sets to benchmark their SSQA model.
- This repository aims to provide training recipes. While there are many off-the-shelf speech quality evaluators like DNSMOS, SpeechMOS and speechmetrics, most of them do not provide training recipes, thus are not research-oriented. Newcomers may utilize this repo as a starting point to SSQA research.
MOS-Bench currently contains 7 training sets and 12 test sets. Below is a screenshot of a summary table from our paper. For more details, please see our paper or egs/README.md
.
Models
-
LDNet
- Original repo link: https://github.com/unilight/LDNet
- Paper link: [arXiv]
- Example config:
egs/bvcc/conf/ldnet-ml.yaml
-
SSL-MOS
- Original repo link: https://github.com/nii-yamagishilab/mos-finetune-ssl/tree/main
- Paper link: [arXiv]
- Example config:
egs/bvcc/conf/ssl-mos-wav2vec2.yaml
- Notes: We made some modifications to the original implementation. Please see the paper for more details.
-
UTMOS (Strong learner)
- Original repo link: https://github.com/sarulab-speech/UTMOS22/tree/master/strong
- Paper link: [arXiv]
- Example config:
egs/bvcc/conf/utmos-strong.yaml
- Notes: After discussion with the first author of UTMOS, Takaaki, we feel that UTMOS = SSL-MOS + listener modeling + contrastive loss + several model arch and training differences. Takaaki also felt that using phoneme and reference is not really helpful for UTMOS strong alone. Therefore we did not implement every component of UTMOS strong. For instance, we did not use domain ID and data augmentation.
-
Modified AlignNet
- Original repo link: https://github.com/NTIA/alignnet
- Paper link: [arXiv]
- Example config:
egs/bvcc+nisqa+pstn+singmos+somos+tencent+tmhint-qi/conf/alignnet-wav2vec2.yaml
Features
- Modeling
- Listener modeling
- Self-supervised learning (SSL) based encoder, supported by S3PRL
- Find the complete list of supported SSL models here.
- Training
- Automatic best-n model saving and early stopiing based on given validation criterion
- Visualization, including predicted score distribution, scatter plot of utterance and system level scores
- Model averaging
- Model ensembling by stacking
You are in the right place! This is the main purpose of SHEET.
We provide complete experiment recipes, i.e., set of scripts to download and process the dataset, train and evaluate models. This structure originated from Kaldi, and is also used in many speech processing based repositories (ESPNet, ParallelWaveGAN, etc.).
Please follow the installation instructions first, then see egs/README.md for how to start.
We provide scripts to collect the test sets conveniently. These scripts can be run on Linux-like platforms with basic python requirements, such that you do not need to instal all the heavy packages, like PyTorch.
Please see the related section in egs/README.md for detailed instructions.
We utilize torch.hub
to provide a convenient way to load pre-trained SSQA models and predict scores of wav files or torch tensors.
# load pre-trained model
>>> predictor = torch.hub.load("unilight/sheet:v0.1.0", "default", trust_repo=True, force_reload=True)
# you can either provide a path to your wav file
>>> predictor.predict(wav_path="/path/to/wav/file.wav")
3.6066928
# or provide a torch tensor with shape [num_samples]
>>> predictor.predict(wav=torch.rand(16000))
1.5806346
Or you can try out our HuggingFace Spaces Demo!
You don't need to prepare an environment (using conda, etc.) first. The following commands will automatically construct a virtual environment in tools/
. When you run the recipes, the scripts will automatically activate the virtual environment.
git clone https://github.com/unilight/sheet.git
cd sheet/tools
make
If you use the training scripts, benchmarking scripts or pre-trained models from this project, please consider citing the following paper.
@article{huang2024,
title={MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models},
author={Wen-Chin Huang and Erica Cooper and Tomoki Toda},
year={2024},
eprint={2411.03715},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2411.03715},
}
This repo is greatly inspired by the following repos. Or I should say, many code snippets are directly taken from part of the following repos.
Wen-Chin Huang
Toda Labotorary, Nagoya University
E-mail: wen.chinhuang@g.sp.m.is.nagoya-u.ac.jp