SpeechScore is a wrapper designed for assessing speech quality. It includes a collection of commonly used speech quality metrics, as listed below:
Index | Metrics | Short Description | Externel Link |
---|---|---|---|
1. | BSSEval {ISR, SAR, SDR} | ISR (Source Image-to-Spatial distortion Ratio) measures preservation/distortion of target source. SDR (Source-to-Distortion Ratio) measures global quality. SAR (Source-to-Artefact Ratio) measures the presence of additional artificial noise | (See the official museval page) |
2. | {CBAK, COVL, CSIG} | CSIG predicts the signal distortion mean opinion score (MOS), CBAK measures background intrusiveness, and COVL measures speech quality. CSIG, CBAK, and COVL are ranged from 1 to 5 | See paper: Evaluation of Objective Quality Measures for Speech Enhancement |
3. | DNSMOS {BAK, OVRL, SIG, P808_MOS} | DNSMOS (Deep Noise Suppression Mean Opinion Score) measures the overall quality of the audio clip based on the ITU-T Rec. P.808 subjective evaluation. It outputs 4 scores: i) speech quality (SIG), ii) background noise quality (BAK), iii) the overall quality (OVRL), and iv) the P808_MOS of the audio. DNSMOS does not require clean references. | See paper: Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors and github page |
4. | FWSEGSNR | FWSEGSNR (Frequency-Weighted SEGmental SNR) is commonly used for evaluating dereverberation performance | See paper: Evaluation of Objective Quality Measures for Speech Enhancement |
5. | LLR | LLR (Log Likelihood Ratio) measures how well an estimated speech signal matches the target (clean) signal in terms of their short-term spectral characteristics. | See paper: Evaluation of Objective Quality Measures for Speech Enhancement |
6. | LSD | LSD (Log-Spectral Distance) measures the spectral differences between a clean reference signal and a processed speech signal. | See github page |
7. | MCD | MCD (Mel-Cepstral Distortion) measures the difference between the mel-cepstral coefficients (MCCs) of an estimated speech signal and the target (clean) speech signal. | See github page |
8. | NB_PESQ | NB-PESQ (NarrowBand Perceptual Evaluation of Speech Quality) meaures speech quality that reflects human auditory perception. It is defined in the ITU-T Recommendation P.862 and is developed for assessing narrowband speech codecs and enhancement algorithms. | See github page |
9. | PESQ | PESQ (Perceptual Evaluation of Speech Quality) assesses the quality of speech signals to mimic human perception. It is standardized by the International Telecommunication Union (ITU-T P.862) and is widely used in evaluating telecommunication systems and speech enhancement algorithms. | See github page |
10. | SISDR | SI-SDR (Scale-Invariant Signal-to-Distortion Ratio) quantifies the ratio between the power of the target signal component and the residual distortion. It measures how well an estimated speech signal matches the target (clean) speech signal, while being invariant to differences in scale. | See paper: SDR - half-baked or well done? |
11. | SNR | SNR (Signal-to-Noise Ratio) is a fundamental metric used in speech quality measurement to evaluate the relative level of the desired speech signal compared to unwanted noise. It quantifies the clarity and intelligibility of speech in decibels (dB). | See paper: An effective quality evaluation protocol for speech enhancement algorithms |
12. | SRMR | SRMR (Speech-to-Reverberation Modulation Energy Ratio) evaluates the ratio of speech-dominant modulation energy to reverberation-dominant modulation energy. It quantifies the impact of reverberation on the quality and intelligibility of speech signals. SRMR does not require clean references. | See SRMRpy and SRMR Toolbox |
13. | SSNR | SSNR (Segmental Signal-to-Noise Ratio) is an extension of SNR (Signal-to-Noise Ratio) and for evaluating the quality of speech signals in shorter segments or frames. It is calculated by dividing the power of the clean speech signal by the power of the noise signal, computed over small segments of the speech signal. | See paper: An effective quality evaluation protocol for speech enhancement algorithms |
14. | STOI | STOI (Short-Time Objective Intelligibility Index) measures speech quality and intelligibility by operateing on short-time segments of the speech signal and computes a score between 0 and 1. | See github page |
If you haven't created a Conda environment for ClearerVoice-Studio yet, follow steps 1 and 2. Otherwise, skip directly to step 3.
- Clone the Repository
git clone https://github.com/modelscope/ClearerVoice-Studio.git
- Create Conda Environment
cd ClearerVoice-Studio
conda create -n ClearerVoice-Studio python=3.8
conda activate ClearerVoice-Studio
pip install -r requirements.txt
- Run demo script
cd speechscore
python demo.py
or use the following script:
# Import pprint for pretty-printing the results in a more readable format
import pprint
# Import the SpeechScore class to evaluate speech quality metrics
from speechscore import SpeechScore
# Main block to ensure the code runs only when executed directly
if __name__ == '__main__':
# Initialize a SpeechScore object with a list of score metrics to be evaluated
# Supports any subsets of the list
mySpeechScore = SpeechScore([
'SRMR', 'PESQ', 'NB_PESQ', 'STOI', 'SISDR',
'FWSEGSNR', 'LSD', 'BSSEval', 'DNSMOS',
'SNR', 'SSNR', 'LLR', 'CSIG', 'CBAK',
'COVL', 'MCD'
])
# Call the SpeechScore object to evaluate the speech metrics between 'noisy' and 'clean' audio
# Arguments:
# - {test_path, reference_path} supports audio directories or audio paths (.wav or .flac)
# - window (float): seconds, set None to specify no windowing (process the full audio)
# - score_rate (int): specifies the sampling rate at which the metrics should be computed
# - return_mean (bool): set True to specify that the mean score for each metric should be returned
print('score for a signle wav file')
scores = mySpeechScore(test_path='audios/noisy.wav', reference_path='audios/clean.wav', window=None, score_rate=16000, return_mean=False)
# Pretty-print the resulting scores in a readable format
pprint.pprint(scores)
print('score for wav directories')
scores = mySpeechScore(test_path='audios/noisy/', reference_path='audios/clean/', window=None, score_rate=16000, return_mean=True)
# Pretty-print the resulting scores in a readable format
pprint.pprint(scores)
# Print only the resulting mean scores in a readable format
#pprint.pprint(scores['Mean_Score'])
The results should be looking like below:
score for a signle wav file
{'BSSEval': {'ISR': 22.74466768594831,
'SAR': -0.1921607960486258,
'SDR': -0.23921670199308115},
'CBAK': 1.5908301020179343,
'COVL': 1.5702204013203889,
'CSIG': 2.3259366746377066,
'DNSMOS': {'BAK': 1.3532928733331306,
'OVRL': 1.3714771994335782,
'P808_MOS': 2.354834,
'SIG': 1.8698058813241407},
'FWSEGSNR': 6.414399025759913,
'LLR': 0.85330075,
'LSD': 2.136734818644327,
'MCD': 11.013451521306235,
'NB_PESQ': 1.2447538375854492,
'PESQ': 1.0545592308044434,
'SISDR': -0.23707451176264824,
'SNR': -0.9504614142497447,
'SRMR': 6.202590182397157,
'SSNR': -0.6363067113236048,
'STOI': 0.8003376411051097}
score for wav directories
{'Mean_Score': {'BSSEval': {'ISR': 23.728811184378372,
'SAR': 4.839625092004951,
'SDR': 4.9270216975279135},
'CBAK': 1.9391528046230797,
'COVL': 1.5400270840455588,
'CSIG': 2.1286157747587344,
'DNSMOS': {'BAK': 1.9004402577440938,
'OVRL': 1.860621534493506,
'P808_MOS': 2.5821499824523926,
'SIG': 2.679913397827385},
'FWSEGSNR': 9.079539440199582,
'LLR': 1.1992616951465607,
'LSD': 2.0045290996104748,
'MCD': 8.916492705343465,
'NB_PESQ': 1.431145429611206,
'PESQ': 1.141619324684143,
'SISDR': 4.778657656271212,
'SNR': 4.571920494312266,
'SRMR': 9.221118316293268,
'SSNR': 2.9965604574762796,
'STOI': 0.8585249663711918},
'audio_1.wav': {'BSSEval': {'ISR': 22.74466768594831,
'SAR': -0.1921607960486258,
'SDR': -0.23921670199308115},
'CBAK': 1.5908301020179343,
'COVL': 1.5702204013203889,
'CSIG': 2.3259366746377066,
'DNSMOS': {'BAK': 1.3532928733331306,
'OVRL': 1.3714771994335782,
'P808_MOS': 2.354834,
'SIG': 1.8698058813241407},
'FWSEGSNR': 6.414399025759913,
'LLR': 0.85330075,
'LSD': 2.136734818644327,
'MCD': 11.013451521306235,
'NB_PESQ': 1.2447538375854492,
'PESQ': 1.0545592308044434,
'SISDR': -0.23707451176264824,
'SNR': -0.9504614142497447,
'SRMR': 6.202590182397157,
'SSNR': -0.6363067113236048,
'STOI': 0.8003376411051097},
'audio_2.wav': {'BSSEval': {'ISR': 24.712954682808437,
'SAR': 9.871410980058528,
'SDR': 10.093260097048908},
'CBAK': 2.287475507228225,
'COVL': 1.509833766770729,
'CSIG': 1.9312948748797627,
'DNSMOS': {'BAK': 2.4475876421550566,
'OVRL': 2.349765869553434,
'P808_MOS': 2.809466,
'SIG': 3.490020914330629},
'FWSEGSNR': 11.744679854639253,
'LLR': 1.5452226,
'LSD': 1.8723233805766222,
'MCD': 6.819533889380694,
'NB_PESQ': 1.617537021636963,
'PESQ': 1.2286794185638428,
'SISDR': 9.794389824305073,
'SNR': 10.094302402874277,
'SRMR': 12.23964645018938,
'SSNR': 6.629427626276164,
'STOI': 0.9167122916372739}}
Any subset of the full score list is supported, specify your score list using the following objective:
mySpeechScore = SpeechScore(['.'])
We referred to speechmetrics, DNSMOS , BSSEval, pymcd, pystoi, PESQ, and segan_pytorch for implementing this repository.