👉🏻HuggingFace Space Demo👈🏻 | 👉🏻ModelScope Space Demo👈🏻
ClearVoice offers a unified inference platform for speech enhancement
, speech separation
, and audio-visual target speaker extraction
. It is designed to simplify the adoption of our pre-trained models for your speech processing purpose or the integration into your projects. Currently, we provide the following pre-trained models:
Tasks (Sampling rate) | Models (HuggingFace Links) |
---|---|
Speech Enhancement (16kHz & 48kHz) | MossFormer2_SE_48K (link), FRCRN_SE_16K (link), MossFormerGAN_SE_16K (link) |
Speech Separation (16kHz) | MossFormer2_SS_16K (link) |
Audio-Visual Target Speaker Extraction (16kHz) | AV_MossFormer2_TSE_16K (link) |
You don't need to manually download the pre-trained models—they are automatically fetched during inference.
If you haven't created a Conda environment for ClearerVoice-Studio yet, follow steps 1 and 2. Otherwise, skip directly to step 3.
- Clone the Repository
git clone https://github.com/modelscope/ClearerVoice-Studio.git
- Create Conda Environment
cd ClearerVoice-Studio
conda create -n ClearerVoice-Studio python=3.8
conda activate ClearerVoice-Studio
pip install -r requirements.txt
It should also work for python 3.9, 3.10 and 3.12!
Note: 在ubuntu和windows安装过程中,如果遇到关于c++构建环境的前置安装以及pip setuptools wheel的工具更新问题,请自行手动安装解决 (感谢@RichardQin1)。
- Run Demo
cd clearvoice
python demo.py
or
cd clearvoice
python demo_with_more_comments.py
- You may activate each demo case by setting to True in
demo.py
anddemo_with_more_comments.py
. - Supported audio format: .flac .wav
- Supported video format: .avi .mp4 .mov .webm
- Use Scripts
Use MossFormer2_SE_48K
model for fullband (48kHz) speech enhancement task:
from clearvoice import ClearVoice
myClearVoice = ClearVoice(task='speech_enhancement', model_names=['MossFormer2_SE_48K'])
#process single wave file
output_wav = myClearVoice(input_path='samples/input.wav', online_write=False)
myClearVoice.write(output_wav, output_path='samples/output_MossFormer2_SE_48K.wav')
#process wave directory
myClearVoice(input_path='samples/path_to_input_wavs', online_write=True, output_path='samples/path_to_output_wavs')
#process wave list file
myClearVoice(input_path='samples/scp/audio_samples.scp', online_write=True, output_path='samples/path_to_output_wavs_scp')
Parameter Description:
task
: Choose one of the three tasksspeech_enhancement
,speech_separation
, andtarget_speaker_extraction
model_names
: List of model names, choose one or more models for the taskinput_path
: Path to the input audio/video file, input audio/video directory, or a list file (.scp)online_write
: Set toTrue
to enable saving the enhanced/separated audio/video directly to local files during processing, otherwise, the enhanced/separated audio is returned. (Only supportsFalse
forspeech_enhancement
,speech_separation
when processing single wave file`)output_path
: Path to a file or a directory to save the enhanced/separated audio/video file
这里给出了一个较详细的中文使用教程:https://stable-learn.com/zh/clearvoice-studio-tutorial
Speech enhancement models: We evaluated our released speech enhancement models on the popular benchmarks: VoiceBank+DEMAND testset (16kHz & 48kHz) and DNS-Challenge-2020 (Interspeech) testset (non-reverb, 16kHz). Different from the most published papers that tailored each model for each test set, our evaluation here uses unified models on the two test sets. The evaluation metrics are generated by SpeechScore.
VoiceBank+DEMAND testset (tested on 16kHz)
Model | PESQ | NB_PESQ | CBAK | COVL | CSIG | STOI | SISDR | SNR | SRMR | SSNR | P808_MOS | SIG | BAK | OVRL | ISR | SAR | SDR | FWSEGSNR | LLR | LSD | MCD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Noisy | 1.97 | 3.32 | 2.79 | 2.70 | 3.32 | 0.92 | 8.44 | 9.35 | 7.81 | 6.13 | 3.05 | 3.37 | 3.32 | 2.79 | 28.11 | 8.53 | 8.44 | 14.77 | 0.78 | 1.40 | 4.15 |
FRCRN_SE_16K | 3.23 | 3.86 | 3.47 | 3.83 | 4.29 | 0.95 | 19.22 | 19.16 | 9.21 | 7.60 | 3.59 | 3.46 | 4.11 | 3.20 | 12.66 | 21.16 | 11.71 | 20.76 | 0.37 | 0.98 | 0.56 |
MossFormerGAN_SE_16K | 3.47 | 3.96 | 3.50 | 3.73 | 4.40 | 0.96 | 19.45 | 19.36 | 9.07 | 9.09 | 3.57 | 3.50 | 4.09 | 3.23 | 25.98 | 21.18 | 19.42 | 20.20 | 0.34 | 0.79 | 0.70 |
MossFormer2_SE_48K | 3.16 | 3.77 | 3.32 | 3.58 | 4.14 | 0.95 | 19.38 | 19.22 | 9.61 | 6.86 | 3.53 | 3.50 | 4.07 | 3.22 | 12.05 | 21.84 | 11.47 | 16.69 | 0.57 | 1.72 | 0.62 |
DNS-Challenge-2020 testset (tested on 16kHz)
Model | PESQ | NB_PESQ | CBAK | COVL | CSIG | STOI | SISDR | SNR | SRMR | SSNR | P808_MOS | SIG | BAK | OVRL | ISR | SAR | SDR | FWSEGSNR | LLR | LSD | MCD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Noisy | 1.58 | 2.16 | 2.66 | 2.06 | 2.72 | 0.91 | 9.07 | 9.95 | 6.13 | 9.35 | 3.15 | 3.39 | 2.61 | 2.48 | 34.57 | 9.09 | 9.06 | 15.87 | 1.07 | 1.88 | 6.42 |
FRCRN_SE_16K | 3.24 | 3.66 | 3.76 | 3.63 | 4.31 | 0.98 | 19.99 | 19.89 | 8.77 | 7.60 | 4.03 | 3.58 | 4.15 | 3.33 | 8.90 | 20.14 | 7.93 | 22.59 | 0.50 | 1.69 | 0.97 |
MossFormerGAN_SE_16K | 3.57 | 3.88 | 3.93 | 3.92 | 4.56 | 0.98 | 20.60 | 20.44 | 8.68 | 14.03 | 4.05 | 3.58 | 4.18 | 3.36 | 8.88 | 20.81 | 7.98 | 21.62 | 0.45 | 1.65 | 0.89 |
MossFormer2_SE_48K | 2.94 | 3.45 | 3.36 | 2.94 | 3.47 | 0.97 | 17.75 | 17.65 | 9.26 | 11.86 | 3.92 | 3.51 | 4.13 | 3.26 | 8.55 | 18.40 | 7.48 | 16.10 | 0.98 | 3.02 | 1.15 |
VoiceBank+DEMAND testset (tested on 48kHz) (We included our evaluations on other open-sourced models using SpeechScore)
Model | PESQ | NB_PESQ | CBAK | COVL | CSIG | STOI | SISDR | SNR | SRMR | SSNR | P808_MOS | SIG | BAK | OVRL | ISR | SAR | SDR | FWSEGSNR | LLR | LSD | MCD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Noisy | 1.97 | 2.87 | 2.79 | 2.70 | 3.32 | 0.92 | 8.39 | 9.30 | 7.81 | 6.13 | 3.07 | 3.35 | 3.12 | 2.69 | 33.75 | 8.42 | 8.39 | 13.98 | 0.75 | 1.45 | 5.41 |
MossFormer2_SE_48K | 3.15 | 3.77 | 3.33 | 3.64 | 4.23 | 0.95 | 19.36 | 19.22 | 9.61 | 7.03 | 3.53 | 3.41 | 4.10 | 3.15 | 4.08 | 21.23 | 4.06 | 14.45 | NA | 1.86 | 0.53 |
Resemble_enhance | 2.84 | 3.58 | 3.14 | NA | NA | 0.94 | 12.42 | 12.79 | 9.08 | 7.07 | 3.53 | 3.42 | 3.99 | 3.12 | 13.62 | 12.66 | 10.31 | 14.56 | 1.50 | 1.66 | 1.54 |
DeepFilterNet | 3.03 | 3.71 | 3.29 | 3.55 | 4.20 | 0.94 | 15.71 | 15.66 | 9.66 | 7.19 | 3.47 | 3.40 | 4.00 | 3.10 | 28.01 | 16.20 | 15.79 | 15.69 | 0.55 | 0.94 | 1.77 |
- Resemble_enhance (Github) is an open-sourced 44.1kHz pure speech enhancement platform from Resemble-AI since 2023, we resampled to 48khz before making evaluation.
- DeepFilterNet (Github) is a low complexity speech enhancement framework for Full-Band audio (48kHz) using on deep filtering.
Note: We observed anomalies in two speech metrics, LLR and LSD, after processing with the 48 kHz models. We will further investigate the issue to identify the cause.