WavLMMSDD

This repository combines WavLM, a powerful speech representation model from Microsoft, with MSDD (Multi-Scale Diarization Decoder), a state-of-the-art approach for speaker diarization from Nvidia. By merging WavLM’s robust feature extraction capabilities with MSDD’s advanced clustering and segmentation, this project enables accurate identification of multiple speakers in audio streams—especially in challenging, noisy, or overlapping speech scenarios.

In particular, this setup uses the Diarization MSDD Telephonic model (diar_msdd_telephonic in NeMo), making it well-suited for telephony or call center environments where speech overlap and background noise are common. Use this repository as a starting point for projects that demand robust speaker diarization in environments where speech overlap or varied audio conditions are critical factors.

Note: If you would like to contribute to this repository, please read the CONTRIBUTING first.

Architecture

Features

Key Capabilities

WavLM-Based Embeddings: Leverages WavLM to generate high-quality speech representations, improving speaker identification.
Multi-Scale Diarization (MSDD): Applies multi-scale inference for precise speaker segmentation, even with overlapping speech.
Scalable Pipeline: Modular design allows easy integration and customization for various diarization tasks or research experiments.

Models

WavLM-Base-Plus for Speaker Verification
WavLM-Base for Speaker Verification
Nvidia NeMo Diarization MSDD Telephonic

Reports

Benchmark

Below is an example benchmark comparing metrics for different models on a The Ami Corpus dataset. These models use Multi-Scale Diarization Decoder (MSDD) with different embedding backbones: TitaNet and WavLM.

INFO

Experiments were conducted on an NVIDIA GeForce RTX 3060 using CUDA 12.6 (Driver Version 560.35.03).

We randomly selected 10 samples from the AMI Corpus (Array1-01.tar.gz), each 60 seconds long.

MSDD (Titanet) and MSDD (WavLMMSDD) refer to using TitaNet vs. WavLM as the speaker-embedding model.

For a detailed Jupyter notebook demonstrating how this benchmark was performed, see:
> notebook/benchmark.ipynb

Model	DER	FA	MISS	CER	Duration(sec)
MSDD + TitaNet	0.9963	0.0010	0.9946	0.0015	644
MSDD + WavLMBasePlus	0.9961	0.0010	0.9946	0.0016	18

DER: Diarization Error Rate
FA: False Alarm Rate
MISS: Missed Detection Rate
CER: Confusion Error Rate

Installation

The Python Package Index (PyPI)

pip insall wavlmmsdd

Usage

# Standard library imports
from typing import Annotated

# Local imports
from wavlmmsdd.audio.diarization.diarize import Diarizer
from wavlmmsdd.audio.feature.embedding import WavLMSV
from wavlmmsdd.audio.preprocess.resample import Resample
from wavlmmsdd.audio.preprocess.convert import Convert
from wavlmmsdd.audio.utils.utils import Build

def main() -> Annotated[None, "No return value"]:
    """
    Demonstrate the audio processing workflow from a WAV file
    to a diarization result.

    This function performs the following steps:
    1. Resamples the audio to 16 kHz.
    2. Converts the audio to mono.
    3. Builds a manifest file.
    4. Obtains embeddings.
    5. Runs diarization.

    Returns
    -------
    None

    Examples
    --------
    >>> main()
    No direct output is produced, but the specified audio file is
    processed and the results are saved or printed as logs.
    """
    
    # Audio Path
    audio_path = "audio.wav"

    # Resample to 16000 Khz
    resampler = Resample(audio_file=audio_path)
    wave_16k, sr_16k = resampler.to_16k()

    # Convert to Mono
    converter = Convert(waveform=wave_16k, sample_rate=sr_16k)
    converter.to_mono()
    saved_path = converter.save()

    # Build Manifest File
    builder = Build(saved_path)
    manifest_path = builder.manifest()

    # Embedding
    embedder = WavLMSV()

    # Diarization
    diarizer = Diarizer(embedding=embedder, manifest_path=manifest_path)
    diarizer.run()

if __name__ == "__main__":
    main()

File Structure

.
├── .data
│   └── example
│       └── ae.wav
├── .docs
│   ├── documentation
│   │   ├── CONTRIBUTING.md
│   │   └── RESOURCES.md
│   └── img
│       └── architecture
│           ├── WavLMMSDDArchitecture.drawio
│           └── WavLMMSDDArchitecture.gif
├── environment.yaml
├── .github
│   ├── CODEOWNERS
│   └── workflows
│       └── pypi.yaml
├── .gitignore
├── LICENSE
├── MANIFEST.in
├── notebook
│   └── benchmark.ipynb
├── pyproject.toml
├── README.md
├── requirements.txt
└── src
    └── wavlmmsdd
        ├── audio
        │   ├── config
        │   │   ├── config.yaml
        │   │   ├── diar_infer_telephonic.yaml
        │   │   └── schema.py
        │   ├── diarization
        │   │   └── diarize.py
        │   ├── feature
        │   │   └── embedding.py
        │   ├── preprocess
        │   │   ├── convert.py
        │   │   └── resample.py
        │   └── utils
        │       └── utils.py
        └── main.py

18 directories, 24 files

Version Control System

Releases

v0.1.0 .zip
v0.1.0 .tar.gz

Branches

main
develop

Upcoming

WavLM Large: Integrate the WavLM Large model.

Documentations

Licence

LICENSE

Links

Team

Bunyamin Ergen

Contact

Mail

Citation

@software{       WavLMMSDD,
  author       = {Bunyamin Ergen},
  title        = {{WavLMMSDD}},
  year         = {2025},
  month        = {02},
  url          = {https://github.com/bunyaminergen/WavLMMSDD},
  version      = {v0.1.0},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WavLMMSDD

Table of Contents

Architecture

Features

Key Capabilities

Models

Reports

Benchmark

Installation

The Python Package Index (PyPI)

Usage

File Structure

Version Control System

Releases

Branches

Upcoming

Documentations

Licence

Links

Team

Contact

Citation

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.data/example		.data/example
.docs		.docs
.github		.github
notebook		notebook
src		src
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

bunyaminergen/WavLMMSDD

Folders and files

Latest commit

History

Repository files navigation

WavLMMSDD

Table of Contents

Architecture

Features

Key Capabilities

Models

Reports

Benchmark

Installation

The Python Package Index (PyPI)

Usage

File Structure

Version Control System

Releases

Branches

Upcoming

Documentations

Licence

Links

Team

Contact

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages