ATTEST is a powerful evaluation framework designed to streamline the analysis of (synthesized) speech by integrating a variety of metrics across multiple dimensions. It consolidates speech evaluation into five distinct categories, each equipped with a set of metrics to thoroughly assess various aspects of speech quality:
- Speech Intelligibility: Focuses on how accurately a TTS model reproduces the intended text, emphasizing the clarity with which spoken words are understood. The primary metrics used include CER (Character Error Rate), WER (Word Error Rate), and PER (Phoneme Error Rate).
- Speech Prosody: Assesses the naturalness and expressiveness of speech prosody, using pitch analysis metrics such as voicing decision error (VDE), gross pitch error (GPE), fine pitch error (FFE), and logarithmic frequency root mean square error (Log F0 RMSE). ATTEST supports multiple pitch extraction engines, including Parselmouth, PyWorld, and CREPE.
- Speaker Similarity: Measures how closely the synthesized voice matches the target speaker, crucial for applications like voice cloning. Metrics for this include speaker similarity based on a comparison of embeddings obtained using the ECAPA-TDNN speaker verification model.
- Signal Quality: Analyzes the overall audio quality and intelligibility of the speech signal with metrics like PESQ, STOI, and TorchAudio-Squim.
- MOS Prediction: Uses metrics such as UTMOS and SpeechBERTScore to predict Mean Opinion Scores, simulating subjective listening tests through objective analysis.
By organizing metrics into five categories, ATTEST makes it easier to assess different qualities of speech. This structure helps developers see where a TTS model performs well and where it needs improvement, providing insights for further refinement.
Before installing, check the extra packages section as you may want to expand the default requirements.
Set up a local environment using Python 3.10:
python3.10 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
-
PyTorch Installation: If you encounter issues with installing PyTorch, please refer to the official PyTorch installation guide.
-
Espeak Phonemizer: To use the
espeak-phonemizer
backend for phonemization, you need to install theespeak-ng
system dependency. Detailed installation instructions are available in the Phonemizer installation guide.
- Nemo Text Normalization for CER, WER, PER, Character Distance, and Phoneme Distance: Install
nemo-text-processing==0.2.2rc0
. By default, text normalization is not performed to compute metrics that rely on text comparison. If you encounter issues on MacOS or Windows, refer to the official installation guide.
To start the application with a user interface (UI), use the following command:
python3 run.py ui
ATTEST offers three primary methods for evaluation and analysis:
-
Evaluate: Use this method to analyze a single project in detail. It provides both a general overview and detailed information on individual examples.
-
Compare: Use this method for comparing two projects side by side. It provides both a general overview and detailed information on individual examples.
-
Multiple Compare: Use this method for comparing several projects at once. It provides an overview of the selected metrics in both table and graph formats.
In the context of ATTEST, a "project" refers to a dataset containing real or synthetic recordings. Each project is organized into two main folders:
-
meta: Contains a filelist.txt file, which describes recordings. This file has two columns separated by a | symbol:
- The first column provides the relative path to the audio file within the wavs folder.
- The second column contains the text that is spoken in the corresponding audio file.
-
wavs: Contains the actual audio files. These files should be in WAV format with a single audio channel (mono) and a sampling rate of 22050 Hz.
Projects can be organized into groups. For example, in the egs/Demo_ZSTTS directory, Demo_ZSTTS serves as the group name, containing multiple related projects.
ATTEST provides a variety of features to evaluate different aspects of speech quality. These features are categorized into two types:
- Metric: A numerical value computed from the sample or by comparing the sample with a reference sample.
- Attribute: A property extracted from the sample. It can be text, audio, an image, a numerical value, or a more complex object.
You can enable and disable features on the sidebar in the UI. You can choose what features are enabled by default in the attest/config/config.yaml.
Below is a list of the metrics available in ATTEST. Each metric also has an identifier that can be used with the command line interface (CLI). Some metrics require a reference sample. Some metrics could be accelerated with the use of a GPU, as indicated.
-
MOS Prediction
- UTMOS (CLI Identifier:
utmos
, GPU preferred): Predicts the Mean Opinion Score (MOS) to assess the overall perceived quality of synthesized speech. - SpeechBERTScore (CLI Identifier:
speech_bert_score
, Reference required, GPU preferred): Measures the similarity between synthesized and reference speech by comparing their contextualized embeddings derived from a WavLM model. - Squim MOS (CLI Identifier:
squim_mos
, Reference required, GPU preferred): estimation of subjective Mean Opinion Score (MOS) for speech enhancement from Torchaudio.
- UTMOS (CLI Identifier:
-
Speech intelligibility
- CER (Character Error Rate) (CLI Identifier:
cer
, GPU preferred): The percentage of characters that were incorrectly predicted by the Whisper speech recognition model compared to the original text. - WER (Word Error Rate) (CLI Identifier:
wer
, GPU preferred): The percentage of words that were incorrectly predicted by the Whisper speech recognition model compared to the original text. - PER (Phoneme Error Rate) (CLI Identifier:
per
, GPU preferred): The percentage of phonemes that were incorrectly predicted, calculated by using the Whisper speech recognition model and grapheme-to-phoneme (G2P) conversion to compare the phonemes of the original text and the transcription. - Character distance (CLI Identifier:
character_distance
, GPU preferred): The number of distinct symbols between the original text and the transcription obtained from Whisper speech recognition model. - Phoneme distance (CLI Identifier:
phoneme_distance
, GPU preferred): The number of distinct phonemes between the original text and the transcription obtained from Whisper speech recognition model.
- CER (Character Error Rate) (CLI Identifier:
-
Speech intonation
- VDE (CLI Identifier:
vde
, Reference required, GPU preferred if used torchcrepe model for the extraction): Voicing decision error from Reducing F0 Frame Error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend - GPE (CLI Identifier:
gpe
, Reference required, GPU preferred if used torchcrepe model for the extraction): Gross Pitch Error from Reducing F0 Frame Error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend - FFE (CLI Identifier:
ffe
, Reference required, GPU preferred if used torchcrepe model for the extraction): F0 Frame Error from Reducing F0 Frame Error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend - logF0 RMSE (CLI Identifier:
logf0_rmse
, Reference required, GPU preferred if used torchcrepe model for the extraction): Computes the root mean square error of the logarithmic fundamental frequency (F0) between synthesized and reference speech.
- VDE (CLI Identifier:
-
Signal quality
- Squim STOI (CLI Identifier:
squim_stoi
, GPU preferred): Reference-free estimation of Wideband Perceptual Estimation of Speech Quality (PESQ) from Torchaudio. - Squim PESQ (CLI Identifier:
squim_pesq
, GPU preferred): Reference-free estimation of Short-Time Objective Intelligibility (STOI) from Torchaudio. - Squim SI-SDR (CLI Identifier:
squim_sisdr
, GPU preferred): Reference-free estimation of Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) from Torchaudio.
- Squim STOI (CLI Identifier:
-
Speaker similarity
- Speaker Similarity (ECAPA-TDNN) (CLI Identifier:
sim_ecapa
, Reference required, GPU preferred): Calculates the similarity between synthesized and reference voices using the ECAPA-TDNN speaker verification model to assess how closely the synthesized voice matches the target speaker.
- Speaker Similarity (ECAPA-TDNN) (CLI Identifier:
Below is a list of the attributes available in ATTEST. Each attribute also has an identifier that can be used with the command line interface (CLI).
- Audio (CLI Identifier:
audio
): The actual audio waveform. - Text (CLI Identifier:
text
): The original written text intended to be synthesized or spoken in the audio. - Text (normalized) (CLI Identifier:
text_norm
): The processed version of the text where the text normalization method is specified in the config (for CLI) or the settings tab (UI). - Text phonemes (CLI Identifier:
text_phonemes
): The phonetic transcription of the text. - Transcript (CLI Identifier:
transcript
): The text derived from ASR. - Transcript phonemes (CLI Identifier:
transcript_phonemes
): The phonetic transcription of the ASR-generated transcript. - Grapheme pronunciation speed (CLI Identifier:
pronunciation_speed
): The rate at which characters or letters are spoken in the audio, measured in character units per second. - Phoneme pronunciation speed (CLI Identifier:
pronunciation_speed_phonemes
): The pronunciation rate, measured in phoneme units per second. - Audio duration (CLI Identifier:
audio_duration
): The duration of the actual audio waveform. - Speech duration (CLI Identifier:
speech_duration
): The duration of actual spoken content within the audio, excluding silence at the beginning and at the end. - Silence in the begining (CLI Identifier:
silence_begin
): The duration of silence at the beginning. - Silence in the end (CLI Identifier:
silence_end
): The duration of silence at the end. - Pitch mean (CLI Identifier:
pitch_mean
, GPU preferred if used torchcrepe model for the extraction): The average fundamental frequency (F0) across the speech sample, indicating the overall pitch level. - Pitch std (CLI Identifier:
pitch_std
, GPU preferred if used torchcrepe model for the extraction): The standard deviation of the fundamental frequency, reflecting pitch variability. - Pitch plot (CLI Identifier:
pitch_plot
, GPU preferred if used torchcrepe model for the extraction): A visual representation of the pitch contour over time, illustrating intonation patterns and prosody. - Wavelet prosody plot (CLI Identifier:
wavelet_prosody
): A graphical depiction of prosodic features like pitch, energy, and wavelets, generated using the wavelet_prosody_toolkit project.
ATTEST provides metrics that vary in language compatibility:
- Language-idependent metrics: Metrics such as VDE, GPE, FFE, logF0 RMSE are language-independent, as they reflect properties unrelated to specific languages.
- Applicable to all languages: Metrics like UTMOS, SpeechBERTScore, Speaker Similarity (ECAPA-TDNN) and Squim family metrics (STOI, PESQ, SI-SDR, MOS) use components trained primarily on English data. However, since the metric reflects a language-independent property, it could generalize to audio in other languages.
- Language-specific metrics:
Refer to the metrics table for a detailed view of language compatibility.
This section provides detailed examples of how to use the ATTEST CLI for specific tasks.
-
Evaluate a single project using specific features:
python3 run.py evaluate --project <your_project> --features <first feature> <second feature> ... <last feature> [--output <output_file>]
Replace
<your_project>
with the path to your project and<first feature>
,<second feature>
, etc., with the names of the features you want to evaluate. If the--output
option is specified, the results will be saved in JSON format to the given file.Example: You can evaluate the egs/Demo_ZSTTS/StyleTTS2 project using the UTMOS metric as follows:
python3 run.py evaluate --project egs/Demo_ZSTTS/StyleTTS2 --features utmos
-
Compare two projects by specifying the features to be evaluated:
python3 run.py compare --project1 <project_1> --project2 <project_2> --features <first feature> <second feature> ... <last feature> [--output <output_file>]
Replace
<project_1>
and<project_2>
with the paths to your projects and<first feature>
,<second feature>
, etc., with the names of the features you want to compare. If the--output
option is specified, the results will be saved in JSON format to the given file.Example: You can compare the project egs/Demo_ZSTTS/StyleTTS2 with the reference project egs/Demo_ZSTTS/Reference using the UTMOS and Speaker Similarity metrics and save the results to
compare_results.json
as follows:python3 run.py compare --project1 egs/Demo_ZSTTS/Reference --project2 egs/Demo_ZSTTS/StyleTTS2 --features utmos sim_ecapa --output compare_results.json
-
Compare multiple projects simultaneously across various features:
python3 run.py multiple_compare --projects <project_1> <project_2> ... <last project> --features <first feature> <second feature> ... <last feature> [--output <output_file>]
Replace
<project_1>
,<project_2>
, etc., with the paths to your projects and<first feature>
,<second feature>
, etc., with the names of the features you want to compare. If the--output
option is specified, the results will be saved in JSON format to the given file.Example: You can compare the projects egs/Demo_ZSTTS/StyleTTS2 egs/Demo_ZSTTS/XTTSv2 with the reference project egs/Demo_ZSTTS/Reference using the UTMOS and Speaker Similarity metrics as follows:
python3 run.py multiple_compare --projects egs/Demo_ZSTTS/Reference egs/Demo_ZSTTS/StyleTTS2 egs/Demo_ZSTTS/XTTSv2 --features utmos sim_ecapa
-
Comparing multiple projects for reports:
- Use the UI multiple compare method and export results in CSV, LaTeX, or Markdown format. This method also displays histograms for each metric, showing the mean values of project scores, which can be included in your reports as well for a visual representation of the performance differences.
-
Analyzing audio quality of a single project:
- Use the UI evaluate and compare methods (if reference recordings are available). It is possible to sort the results based on a specific metric or even sort by the difference between two metrics. This can help identify where two models deviate from each other, providing insights into specific areas of performance differences.
-
Processing a large number of projects (e.g. comparison of different speech prompts for ZSTTS):
- Use CLI commands to compute metrics and get results in JSON format.
-
Filtering datasets for TTS model training:
- Compute metrics one-by-one using CLI commands.
We welcome contributions to improve ATTEST! Whether you're fixing bugs, adding new features, adding new benchmarks, or improving documentation, your help is appreciated.
To contribute:
- Fork the repository.
- Create a new branch for your feature or bugfix.
- Make your changes and commit them.
- Submit a pull request with a detailed explanation of your changes.
ATTEST is built upon and integrates various tools, libraries, and models. We would like to acknowledge the following projects for their contributions:
- Streamlit: Powers the user interface.
- wavelet_prosody_toolkit: Used for generating wavelet prosody plots.
- Discrete Speech Metrics: Utilized for the SpeechBERTScore metric.
- UTMOS: Used for the UTMOS metric.
- Whisper: Used as the ASR and forced alignment engine.
- WavLM: Contributes to the SpeechBERTScore metric.
- SpeechBrain: Used for speaker similarity using the ECAPA-TDNN model.
- OpenPhonemizer: Used as the grapheme-to-phoneme (G2P) engine.
- Phonemizer: Used as the grapheme-to-phoneme (G2P) engine.
- torchcrepe: Used as the pitch extraction engine.
- Parselmouth: Used as the pitch extraction engine.
- PyWorld: Used as the pitch extraction engine.
- VDE, GPE, FFE, logF0 RMSE: Used as speech intonation metrics.
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for the full license text.
This project uses third-party libraries and code, which are distributed under their respective licenses. A list of these dependencies and their licenses can be found in the NOTICE file.
If you find our work is useful in your research, please cite the following paper:
@inproceedings{obukhov24_interspeech,
title = {ATTEST: an analytics tool for the testing and evaluation of speech technologies},
author = {Dmitrii Obukhov and Marcel {de Korte} and Andrey Adaschik},
year = {2024},
booktitle = {Interspeech 2024},
pages = {3646--3647},
}