Awesome-Speaker-Diarization

Some comprehensive papers about speaker diarization (SD).

If you discover any unnoticed documents, please open issues or pull requests (recommended).

Overview
Reviews
EEND (End-to-End Neural Diarization)-based
- Simulated Dataset
- Post-Processing
Using Target Speaker Embedding
Clustering-based
- Embedding
- VBx
- Scoring
Online
Self-Supervised
Multitask
- With Separation
- With ASR
Multi-Channel
Measurement
Multi-Modal
- With NLP
- With Vision
Challenge
- VoxSRC (VoxCeleb Speaker Recognition Challenge)
- MISP (Multimodal Information Based Speech Processing) (ICASSP Challenge)

Overview

DIHARD Keynote Session: The yellow brick road of diarization, challenges and other neural paths [Slides] [Video]

Reviews

“A review of speaker diarization: Recent advances with deep learning”, in Computer Speech & Language, Volume 72, 2023. (USC) [Paper]
"An Experimental Review of Speaker Diarization methods with application to Two-Speaker Conversational Telephone Speech recordings", in Computer Speech & Language, 2023. [Paper]
"Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning," in Submitted to IEEE/ACM TASLP, 2024. [Paper]

EEND (End-to-End Neural Diarization)-based

BLSTM-EEND: "End-to-End Neural Speaker Diarization with Permutation-Free Objectives", in Proc. Interspeech, 2019. (Hitachi) [Paper]
SA-EEND (1): “End-to-End Neural Speaker Diarization with Self-attention”, in Proc. ASRU, 2019. (Hitachi) [Paper] [Code] [Pytorch] [Review]
SA-EEND (2): “End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification”, in arXiv:2003.02966, 2020. (Hitachi) [Paper] [Review]
SC-EEND: "Neural Speaker Diarization with Speaker-Wise Chain Rule", in arXiv:2006.01796, 2020. (Hitachi) [Paper] [Review]
EEND-EDA (1): “End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors”, in Proc. Interspeech, 2020. (Hitachi) [Paper] [Review] [Code]
EEND-EDA (2): “Encoder-Decoder Based Attractor Calculation for End-to-End Neural Diarization”, in IEEE/ACM TASLP, 2022. (Hitachi) [Paper] [Review] [Code]
CB-EEND: "End-to-end Neural Diarization: From Transformer to Conformer", in Proc. Interspeech, 2021. (Amazon) [Paper] [Review]
TDCN-SA: "End-to-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings", in Proc. ICASSP, 2021. (Google) [Paper] [Review]
"End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection", in Proc. IEEE SLT, 2021. (Hitachi) [Paper]
EEND-VC (1): "Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds", in Proc. ICASSP, 2021. (NTT) [Paper] [Review] [Code]
EEND-VC (2): "Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech", in Proc. Interspeech, 2021. (NTT) [Paper] [Review] [Code]
"Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty," in Proc. Interspeech, 2021. (Fano Labs) [Paper]
EEND-GLA: "Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors", in Proc. ASRU, 2021. (Hitachi) [Paper] [Reivew]
"DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding", in Proc. ICASSP, 2022. (Google) [Paper]
RX-EEND: “Auxiliary Loss of Transformer with Residual Connection for End-to-End Speaker Diarization”, in Proc. ICASSP, 2022. (GIST) [Paper] [Review]
"End-to-end speaker diarization with transformer", in Proc. arXiv, 2022. [Paper]
EEND-VC-iGMM: "Tight integration of neural and clustering-based diarization through deep unfolding of infinite Gaussian mixture model", in Proc. ICASSP, 2022. (NTT) [Paper]
EDA-RC: "Robust End-to-end Speaker Diarization with Generic Neural Clustering", in Proc. Interspeech, 2022. (SJTU) [Paper]
EEND-NAA: "End-to-End Neural Speaker Diarization with an Iterative Refinement of Non-Autoregressive Attention-based Attractors", in Proc. Interspeech, 2022. (JHU) [Paper]
Graph-PIT: "Utterance-by-utterance overlap-aware neural diarization with Graph-PIT", in Proc. Interspeech, 2022. (NTT) [Paper] [Code]
"Efficient Transformers for End-to-End Neural Speaker Diarization", in Proc. IberSPEECH, 2022. [Paper]
"Improving Transformer-based End-to-End Speaker Diarization by Assigning Auxiliary Losses to Attention Heads", in Proc. ICASSP, 2023. (HU) [Paper]
EEND-NA: “Neural Diarization with Non-Autoregressive Intermediate Attractors”, in Proc. ICASSP, 2023. (LINE) [Paper]
EEND-EDA-SpkAtt: "Towards End-to-end Speaker Diarization in the Wild", in arXiv:2211.01299v1, 2022. [Paper]
"TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization", in Proc. ICASSP, 2023. (Alibaba) [Paper] [Code]
EEND-IAAE: "End-to-end neural speaker diarization with an iterative adaptive attractor estimation," in Neural Networks, Elsevier. [Paper] [Code]
"Improving End-to-End Neural Diarization Using Conversational Summary Representations", in Proc. Interspeech, 2023. (Fano Labs) [Paper]
AED-EEND: “Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor”, in Proc. Interspeech, 2023. (SJTU) [Paper] [Review]
"Self-Distillation into Self-Attention Heads for Improving Transformer-based End-to-End Neural Speaker Diarization", in Proc. Interspeech, 2023. (HU) [Paper]
"Powerset Multi-class Cross Entropy Loss for Neural Speaker Diarization", in Proc. Interspeech, 2023. (Pyannote) [Paper] [Code]
"End-to-End Neural Speaker Diarization with Absolute Speaker Loss", in Proc. Interspeech, 2023. (Pyannote) [Paper]
"Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization", in Electronics, 2023. [Paper]
EEND-TA: "Transformer Attractors for Robust and Efficient End-to-End Neural Diarization," in Proc. ASRU, 2023. (Fano Labs) [Paper]
"Robust End-to-End Diarization with Domain Adaptive Training and Multi-Task Learning," in Proc. ASRU, 2023. (Fano Labs) [Paper]
"NTT speaker diarization system for CHiME-7: multi-domain, multi-microphone End-to-end and vector clustering diarization," in Proc. ICASSP, 2024. (NTT) [Paper]
AED-EEND-EE: "Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer," in IEEE/ACM TASLP, 2024. (SJTU) [Paper] [Review]
"DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors," in IEEE/ACM TASLP, 2024. (BUT) [Paper] [Code] [Review]
"EEND-DEMUX: End-to-End Neural Speaker Diarization via Demultiplexed Speaker Embeddings," in Submitted to IEEE SPL, 2024. (SNU) [Paper] [Review]
"EEND-M2F: Masked-attention mask transformers for speaker diarization," in arXiv:2401.12600, 2024. (Fano Labs) [Paper] [Review]
EEND-NAA (2): "End-to-End Neural Speaker Diarization with Non-Autoregressive Attractors", in IEEE/ACM TASLP, 2024. (JHU) [Paper]
"From Modular to End-to-End Speaker Diarization," Ph.D. thesis, 2024. (BUT) [Paper]

Related Speaker information

"Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?," in Proc. Odyssey, 2024. [Paper]
"Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios," in Proc. Odyssey, 2024. [Paper]

Simulated Dataset

Concat-and-sum: “End-to-end neuarl speaker diarization with permuation-free objectives”, in Proc. Interspeech, 2019. [Paper]
“From simulated mixtures to simulated conversations as training data for end-to-end neural diarization” , in Proc. Interspeech, 2022. (BUT) [Paper] [Code] [Review]
Markov selection: “Improving the naturalness of simulated conversations for end-to-end neural diarization”, in Proc. Odyssey, 2022. (Hitachi) [Paper]
"Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization", in Proc. ICASSP, 2023. (BUT) [Paper] [Code] [Review]
EEND-EDA-SpkAtt: "Towards End-to-end Speaker Diarization in the Wild", in arXiv:2211.01299v1, 2022. [Paper]
"Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation," in CHiME-7 Workshop, 2023. (NVIDIA) [Paper]
"Enhancing low-latency speaker diarization with spatial dictionary learning," in Proc. ICASSP, 2024. (NTU) [Paper] [Poster]
"Improving Neural Diarization through Speaker Attribute Attractors and Local Dependency Modeling," in Proc. ICASSP, 2024. (OSU) [Paper]

Post-Processing

EENDasP: "End-to-End Speaker Diarization as Post-Processing", in Proc. ICASSP, 2021. (Hitachi) [Paper] [Review [Code]
Dover-Lap: "DOVER-Lap: A Method for Combining Overlap-aware Diarization Outputs", in Proc. IEEE SLT, 2021. (JHU) [Paper] [Review] [Code]
"DiaCorrect: Error Correction Back-end For Speaker Diarization," in Proc. ICASSP, 2024. (BUT) [Paper] [Code]

Using Target Speaker Embedding

TS-VAD: "Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario", in Proc. Interspeech, 2020. [Paper] [Code] [PPT]
“The STC system for the CHiME-6 challenge,” in CHiME Workshop, 2020. [Paper]
SEND (1): "Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information," in arXiv:2111.13694, 2021. (Alibaba) [Paper]
SEND (2): "Speaker Embedding-aware Neural Diarization: an Efficient Framework for Overlapping Speech Diarization in Meeting Scenarios," in arXiv:2203.09767, 2022 (Alibaba) [Paper]
MTEAD: "Multi-target Filter and Detector for Unknown-number Speaker Diarization", in IEEE SPL, 2022. [Paper]
SOND: "Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis", in Proc. EMNLP, 2022. (Alibaba) [Paper] [Code]
EDA-TS-VAD: “Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization”, in Proc. ICASSP, 2023. (Microsoft) [Paper]
Seq2Seq-TS-VAD: “Target-Speaker Voice Activity Detection via Sequence-to-Sequence Prediction”, in Proc. ICASSP, 2023. (DKU) [Paper] [Review]
QM-TS-VAD: "Unsupervised Adaptation with Quality-Aware Masking to Improve Target-Speaker Voice Activity Detection for Speaker Diarization", in Proc. Interspeech, 2023. (USTC) [Paper]
"ANSD-MA-MSE: Adaptive Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding," in IEEE/ACM TASLP, 2023. (USTC) [Paper] [Code]
NSD-MS2S: "Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture, " in Proc. ICASSP, 2024. (USTC) [Paper] [Code]
PET-TSVAD: "Profile-Error-Tolerant Target-Speaker Voice Activity Detection," in Proc. ICASSP, 2024. (Microsoft) [Paper]

Target Speech Diarization

PTSD: "Prompt-driven Target Speech Diarization," in Proc. ICASSP, 2024. (NUS) [Paper]

With Separation or Target Speaker Extraction

"Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis," in Proc. SLT, 2021. (JHU) [Paper] [Blog] [Review]
EEND-SS: "Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers”, in Proc. SLT, 2022. (CMU) [Paper] [Review]
"TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings", in IEEE/ACM TASLP, 2024. [Paper]
"Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings," in arXiv:2401.15993, 2024. (Tencent) [Paper] [Demo]
"PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings," in Proc. Odyssey, 2024. [Paper] [Code]
MC-EEND: "Multi-channel Conversational Speaker Separation via Neural Diarization," in IEEE/ACM TASLP, 2024. (OSU) [Paper]
"USED: Universal Speaker Extraction and Diarization," in submitted to IEEE/ACM TASLP, 2024. (CUHK) [Paper] [Demo] [Util] [Review]
"Neural Blind Source Separation and Diarization for Distant Speech Recognition," in Proc. Interspeech, 2024. (AIST) [Paper]
"TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024," in Proc. Interspeech, 2024. (Pyannote) [Paper]

Multi-Channel

"Multi-Channel End-to-End Neural Diarization with Distributed Microphones", in Proc. ICASSP, 2022. (Hitachi) [Paper]
"Multi-Channel Speaker Diarization Using Spatial Features for Meetings", in Proc. ICASSP, 2022. (Tencent) [Paper]
"Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization," in Proc. IEEE SLT, 2023. (Hitachi) [Paper]
"Semi-supervised multi-channel speaker diarization with cross-channel attention", in Proc. ASRU, 2023. (USTC) [Paper]
"UniX-Encoder: A Universal X-Channel Speech Encoder for Ad-Hoc Microphone Array Speech Processing," in arXiv:2310.16367, 2024. (JHU, Tencent) [Paper]
"Channel-Combination Algorithms for Robust Distant Voice Activity and Overlapped Speech Detection," in IEEE/ACM TASLP, 2024. [Paper]
"A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition," in Proc. ICASSP, 2024. (USTC) [Paper]
MC-EEND: "Multi-channel Conversational Speaker Separation via Neural Diarization," in IEEE/ACM TASLP, 2024. (OSU) [Paper]
"ASoBO: Attentive Beamformer Selection for Distant Speaker Diarization in Meetings," in Proc. Interspeech, 2024. (LIUM) [Paper]

Online

"Supervised online diarization with sample mean loss for multi-domain data", in Proc. ICASSP, 2020 [Paper] [Code]
"Online End-to-End Neural Diarization with Speaker-Tracing Buffer", in Proc. IEEE SLT, 2021. (Hitachi) [Paper]
BW-EDA-EEND: "BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a Variable Number of Speakers", in Proc. Interspeech, 2021. (Amazon) [Paper]
FS-EEND: "Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers", in Proc. Interspeech, 2021. (Hitachi) [Paper] [Reivew]
Diart: "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation", in Proc. ASRU, 2021. [Paper] [Code]
"Low-Latency Online Speaker Diarization with Graph-Based Label Generation", in Proc. Odyssey, 2022. (DKU) [Paper]
EEND-GLA: "Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors", in IEEE/ACM TASLP, 2022. (Hitachi) [Paper]
Online TS-VAD: "Online Target Speaker Voice Activity Detection for Speaker Diarization", in Proc. Interspeech, 2022. (DKU) [Paper]
"Absolute decision corrupts absolutely: conservative online speaker diarisation", in Proc. ICASSP, 2023. (Naver) [Paper]
"A Reinforcement Learning Framework for Online Speaker Diarization", in Under Review. NeruIPS, 2023. (CU) [Paper]
OTS-VAD: "End-to-end Online Speaker Diarization with Target Speaker Tracking," in Submitted IEEE/ACM TASLP, 2023. (DKU) [Paper]
FS-EEND: "Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors," in Proc. ICASSP, 2024. (Hangzhou) [Paper] [Code]
"Online speaker diarization of meetings guided by speech separation," in Proc. ICASSP, 2024. (LTCI) [Paper] [Code]
"Interrelate Training and Clustering for Online Speaker Diarization," in IEEE/ACM TASLP, 2024. [Paper]

Clustering-based

UIS-RNN: "Fully Supervised Speaker Diarization" (Google) [Paper] [Code]
DNC: "Discriminative Neural Clustering for Speaker Diarisation", in Proc. IEEE SLT, 2019. [Paper] [Code] [Review]
Pyannote: "pyannote.audio: neural building blocks for speaker diarization", in Proc. ICASSP, 2020. (CNRS) [Paper] [Code] [Video]
NME-SC: “Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap”, IEEE SPL, 2019. [Paper] [Code]
Resegmentation with VB: “Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection”, in Proc. ICASSP, 2020. [Paper]
Pyannote 2.0: "End-to-end speaker segmentation for overlap-aware resegmentation", in Proc. Interspeech, 2021. (CNRS) [Paper] [Code] [Video]
UMAP-Leiden: "Reformulating Speaker Diarization as Community Detection With Emphasis On Topological Structure", in Proc. ICASSP, 2022. (Alibaba) [Paper]
SCALE: "Spectral Clustering-aware Learning of Embeddings for Speaker Diarisation", in Proc. ICASSP, 2023. (CAM) [Paper]
SHARC: "Supervised Hierarchical Clustering using Graph Neural Networks for Speaker Diarization", in Proc. ICASSP, 2023. (IISC) [Paper]
CDGCN: "Community Detection Graph Convolutional Network for Overlap-Aware Speaker Diarization," in Proc. ICASSP, 2023. (XMU) [Paper]
"Pyannote.Audio 2.1: Speaker Diarization Pipeline: Principle, Benchmark and Recipe", in Proc. Interspeech, 2023. (CNRS) [Paper]
GADEC: "Graph attention-based deep embedded clustering for speaker diarization,", in Speech Communication, 2023. (NJUPT) [Paper]
"Overlap-aware End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization," in submitted to IEEE/ACM TASLP, 2024. [Paper]
"Apollo's Unheard Voices: Graph Attention Networks for Speaker Diarization and Clustering for Fearless Steps Apollo Collection," in Proc. ICASSP, 2024. (UTD) [Paper]
"Multi-View Speaker Embedding Learning for Enhanced Stability and Discriminability," in Proc. ICASSP, 2024. (Tsinghua) [Paper]
"Towards Unsupervised Speaker Diarization System for Multilingual Telephone Calls Using Pre-trained Whisper Model and Mixture of Sparse Autoencoders," in arXiv:2407.01963, 2024. [Paper]
"Investigating Confidence Estimation Measures for Speaker Diarization," in Proc. Interspeech, 2024. [Paper]
"Once more Diarization: Improving meeting transcription systems through segment-level speaker reassignment," in Proc. Interspeech, 2024. (PU) [Paper]

Embedding (With Clustering)

"Multi-Scale Speaker Diarization With Neural Affinity Score Fusion", in Proc. ICASSP, 2021. (USC) [Paper]
AA+DR+NS: "Adapting Speaker Embeddings for Speaker Diarisation", in Proc. Interspeech, 2021. (Naver) [Paper] [Review]
GAT+AA: "Multi-scale speaker embedding-based graph attention networks for speaker diarisation", in Proc. ICASSP, 2022. (Naver) [Paper]
MSDD: "Multi-scale Speaker Diarization with Dynamic Scale Weighting", in Proc. Interspeech, 2022. (NVIDIA) [Paper] [Code] [Blog]
"In Search of Strong Embedding Extractors For Speaker Diarization", in Proc. ICASSP, 2023. (Naver) [Paper] [Review]
PRISM: "PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification", in Proc. Interspeech, 2022. (Alibaba) [Paper]
DR-DESA: "Advancing the dimensionality reduction of speaker embeddings for speaker diarisation: disentangling noise and informing speech activity", in Proc. ICASSP, 2023. (Naver) [Paper] [Review]
HEE: "High-resolution embedding extractor for speaker diarisation", in Proc. ICASSP, 2023. (Naver) [Paper] [Review]
"Frame-wise and overlap-robust speaker embeddings for meeting diarization", in Proc. ICASSP, 2023. (PU) [Paper] [Review]
"A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures", in Proc. Interspeech, 2023. (PU) [Paper]
"Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios", in Proc. ICASSP, 2024. (PU) [Paper] [Review]
"Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization," in Proc. Odyssey, 2024. (IDLab) [Paper]

With Speaker Identification

"Uncertainty Quantification in Machine Learning for Joint Speaker Diarization and Identification, in Submitted to IEEE/ACM TASLP, 2024. [Paper]

Speaker Recogniton & Verification

"Xi-Vector Embedding for Speaker Recognition," in IEEE, SPL. (A*STAR) [Paper] [Review]
"Build a SRE Challenge System: Lessons from VoxSRC 2022 and CNSRC 2022," in Proc. Interspeech, 2023. (SJTU) [Paper]
RecXi "Disentangling Voice and Content with Self-Supervision for Speaker Recognition," in Proc. NeurIPS, 2023. (A*STAR) [Paper]
"ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings," in Proc. ASRU, 2023. (IDLab) [Paper] [Model] [Review]
"Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification," in ICASSP, 2024. (Naver) [Paper]
"Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition," in ICASSP, 2024. (CUHK) [Paper]

Scoring

LSTM scoring: "LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization", in Proc. Interspeech, 2019. (DKU) [Paper]
"Self-Attentive Similarity Measurement Strategies in Speaker Diarization", in Proc. Interspeech, 2020. (DKU) [Paper]
“Similarity Measurement of Segment-Level Speaker Embeddings in Speaker Diarization”, IEEE/ACM TASLP, 2023. (DKU) [Paper]

Varational Bayes and HMM

VBx Series

"Speaker Diarization based on Bayesian HMM with Eigenvoice Priors", in Proc. Odyssey, 2018. (BUT) [Paper]
"VB-HMM Speaker Diarization with Enhanced and Refined Segment Representation", in Proc. Odyssey, 2018. (Tsinghua) [Paper]
“Analysis of Speaker Diarization Based on Bayesian HMM With Eigenvoice Priors”, IEEE/ACM TASLP, 2019. (BUT) [Paper]
"BUT System Description for DIHARD Speech Diarization Challenge 2019", in arXiv:1910.08847, 2019. (BUT) [Paper]
"Bayesian HMM Based x-Vector Clustering for Speaker Diarization", in Proc. Interspeech, 2019. (BUT) [Paper]
"Optimizing Bayesian Hmm Based X-Vector Clustering for the Second Dihard Speech Diarization Challenge", in Proc. ICASSP, 2020. (BUT) [Paper]
"Analysis of the but Diarization System for Voxconverse Challenge", in Proc. ICASSP, 2021. (BUT) [Paper] [Code]
"Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks", in Computer Speech & Language, 2022. (BUT) [Paper]
MS-VBx: "Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization", in Proc. Interspeech, 2023. (NTT) [Paper]
DVBx: "Discriminative Training of VBx Diarization", in Proc. ICASSP, 2024. (BUT) [Paper] [Code]

Variational Bayes

"Variational Bayesian methods for audio indexing", in Proc. ICMI-MLMI, 2005. [Paper]
"Bayesian analysis of speaker diarization with eigenvoice priors", in CRIM, Montreal, Technical Report, 2008. [Paper]
"Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach", IEEE/ACM TASLP, 2013. [Paper]
"Diarization resegmentation in the factor analysis subspace", in Proc. ICASSP, 2015. [Paper]
"Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge", in Proc. Interspeech, 2018. [Paper]

Normalization

"Analysis of i-vector length normalization in speaker recognition systems", in Proc. Interspeech, 2011. [Paper]

PLDA (Probabilistic Linear Discriminant Analysis)

"The speaker partitioning problem", in Proc. Odyssey, 2018. [Paper]
"Discriminatively trained probabilistic linear discriminant analysis for speaker verification", in Proc. ICASSP, 2021. [Paper]
"Speaker diarization with plda i-vector scoring and unsupervised calibration", in Proc. IEEE SLT, 2014. [Paper]
"Iterative PLDA Adaptation for Speaker Diarization", in Proc. Interspeech, 2016. [Paper]
"Domain Adaptation of PLDA Models in Broadcast Diarization by Means of Unsupervised Speaker Clustering, in Proc. Interspeech, 2017. [Paper]
"Estimation of the Number of Speakers with Variational Bayesian PLDA in the DIHARD Diarization Challenge", in Proc. Interspeech, 2018. [Paper]
DCA-PLDA "A Speaker Verification Backend with Robust Performance across Conditions”, in Computer & Language, 2022. [Paper] [Code]
"Generalized domain adaptation framework for parametric back-end in speaker recognition", in arXiv:2305.15567, 2023. [Paper]

With ASR

"Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR," in Proc. ICASSP, 2022. [Paper]
"Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription", in Proc. Interspeech, 2022. [Paper]
"Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator", in Proc. Interspeech, 2023. (CUHK) [Paper]
"Multi-resolution Approach to Identification of Spoken Languages and to Improve Overall Language Diarization System using Whisper Model", in Proc. Interspeech, 2023.
"Speaker Diarization for ASR Output with T-vectors: A Sequence Classification Approach", in Proc. Interspeech, 2023. [Paper]
"Lexical Speaker Error Correction: Leveraging Language Models for Speaker Diarization Error Correction", in Proc. Interspeech, 2023. (Amazon) [Paper]
"Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach,", in Proc. ICASSP, 2024. (NVIDIA) [Paper]
WEEND: "Towards Word-Level End-to-End Neural Speaker Diarization with Auxiliary Network," in arXiv:2309.08489, 2024. (Google) [Paper] [Supplementary]
"One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition", in Proc. ICASSP, 2024. (CMU) [Paper]
"Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization," in arXiv:2309.16482, 2024. (PU) [Paper]
“Joint Inference of Speaker Diarization and ASR with Multi-Stage Information Sharing," in Proc. ICASSP, 2024. (DKU) [Paper]
"Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers" in Proc. ICASSP, 2024. (Idiap) [Paper]
"A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition," in Proc. ICASSP, 2024. (USTC) [Paper]
Speaker-attributed ASR
- "SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR," in Proc. ASRU, 2023. (Alibaba) [Paper]
- "Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition," in arXiv:2312.10959, 2024. (NICT) [Paper]
- "On Speaker Attribution with SURT," in Proc. Odyssey, 2024. (JHU) [Paper]
- "Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications," in Proc. Odyssey, 2024. (CNRS) [Paper]

Language Diarization

"End-to-End Spoken Language Diarization with Wav2vec Embeddings", in Proc. Interspeech, 2023. [Paper] [Code]
"Multi-resolution Approach to Identification of Spoken Languages and To Improve Overall Language Diarization System Using Whisper Model," in Proc. Interspeech, 2023. [Paper]

With NLP (LLM)

"Exploring Speaker-Related Information in Spoken Language Understanding for Better Speaker Diarization", in Proc. ACL, 2023. (Alibaba) [Paper]
MMSCD, "Encoder-decoder multimodal speaker change detection", in Proc. Interspeech, 2023. (Naver) [Paper]
"Aligning Speakers: Evaluating and Visualizing Text-based Diarization Using Efficient Multiple Sequence Alignment,", in Proc. ICTAI, 2023. [Paper]
"DiariST: Streaming Speech Translation with Speaker Diarization," in Proc. ICASSP, 2024. (Microsoft) [Paper] [Code]
JPCP: "Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation," in arXiv:2309.10456, 2024. (Alibaba) [Paper]
"DiarizationLM: Speaker Diarization Post-Processing with Large Language Models," in Submitted to ICLR, 2024. (Google) [Paper] [Code]
"LLM-based speaker diarization correction: A generalizable approach," in Submitted to IEEE/ACM TASLP, 2024. [Paper]
"AG-LSEC: Audio Grounded Lexical Speaker Error Correction," in Proc. Interspeech, 2024. (Amazon) [Paper]

With Vision

"Who said that?: Audio-visual speaker diarisation of real-world meetings", in Proc. Interspeech, 2019. (Naver) [Paper]
"Self-supervised learning for audio-visual speaker diarization", in Proc. ICASSP, 2020. (Tencent) [Paper] [Blog]
AVA-AVD (AVR-Net): "AVA-AVD: Audio-Visual Speaker Diarization in the Wild", in Proc. ACM MM, 2022. [Paper] [Code] [Video]
"End-to-End Audio-Visual Neural Speaker Diarization", in Proc. Interspeech, 2022. (USTC) [Paper] [Code] [Review]
DyViSE: "DyViSE: Dynamic Vision-Guided Speaker Embedding for Audio-Visual Speaker Diarization", in Proc. MMSP, 2022. (THU) [Paper] [Code]
"Audio-Visual Speaker Diarization in the Framework of Multi-User Human-Robot Interaction", in Proc. ICASSP, 2023. [Paper]
STHG: "Spatial-Temporal Heterogeneous Graph Learning for Advanced Audio-Visual Diarization, in Proc. CVPR, 2023. (Intel) [Paper]
"Speaker Diarization of Scripted Audiovisual Content," in arXiv:2308.02160, 2024. (Amazon) [Paper]
"Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings," in Proc. ACM MM, 2023. [Paper]
"Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization," in Springer Computer Science proceedings, 2023. [Paper]
EEND-EDA++: "Late Audio-Visual Fusion for In-The-Wild Speaker Diarization," in arXiv:2211.01299v2, 2023. [Paper]
"AFL-Net: Integrating Audio, Facial, and Lip Modalities with Cross-Attention for Robust Speaker Diarization in the Wild," in Proc. ICASSP, 2024. (Tencent) [Paper] [Demos]
"Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation," in Proc. AAAI, 2024. (Tencent) [Paper]
"Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization," in Submitted to IEEE/ACM TASLP. (DKU) [Paper]
"3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization," in arXiv:2403.19971, 2024. (Alibaba) [Paper] [Code]
"Target Speech Diarization with Multimodal Prompts," in Submitted to IEEE/ACM TASLP, 2024. (NUS) [Paper]
MFV-KSD: "Multi-Stage Face-Voice Association Learning with Keynote Speaker Diarization," in Submitted to ACM MM, 2024. [Paper] [Code]

Related Spoofing

"Spoof Diarization: "What Spoofed When" in Partially Spoofed Audio," in Proc. Interspeech, 2024. (IITK) [Paper]

Related TTS

Speaker Anonymization

"A Benchmark for Multi-speaker Anonymization," in Submitted to IEEE/ACM TASLP, 2024. (SIT) [Paper] [Code]

Singing Diarization

"Song Data Cleansing for End-to-End Neural Singer Diarization Using Neural Analysis and Synthesis Framework," in Proc. Interspeech, 2024. (LY) [Paper]

With Emotion

"Speech Emotion Diarization: Which Emotion Appears When?," in Proc. ASRU, 2023. (Zaion) [Paper]
"EmoDiarize: Speaker Diarization and Emotion Identification from Speech Signals using Convolutional Neural Networks," in arxiv:2310.12851, 2023. [Paper]
"ED-TTS: Multi-scale Emotion Modeling using Cross-domain Emotion Diarization for Emotional Speech Synthesis, in Proc. ICASSP, 2024. [Paper]

Personal VAD

"Personal VAD: Speaker-Conditioned Voice Activity Detection", in Proc. Odyssey, 2020. (Google) [Paper]
"SVVAD: Personal Voice Activity Detection for Speaker Verification", in Proc. Interspeech, 2023. [Paper]

VAD & OSD & SCD

"Overlapped Speech Detection in Broadcast Streams Using X-vectors," in Proc. Interspeech, 2022. [Paper]
"Overlapped speech and gender detection with WavLM pre-trained features," in Proc. Interspeech, 2022. [Paper]
"Microphone Array Channel Combination Algorithms for Overlapped Speech Detection," in Proc. Interspeech, 2022. [Paper]
"Multitask Detection of Speaker Changes, Overlapping Speech and Voice Activity Using wav2vec 2.0," in Proc. ICASSP, 2023. [Paper] [Code]
"Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction," in Proc. Interspeech, 2023. [Paper]
"Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains," in arxiv:2307.13012, 2023. [Paper]
"Advancing the study of Large-Scale Learning in Overlapped Speech Detection," in arXiv:2308.05987, 2023. [Paper]
"USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained Foundation Models," in Proc. ICASSP, 2024. (Google) [Paper]
"Channel-Combination Algorithms for Robust Distant Voice Activity and Overlapped Speech Detection," in IEEE/ACM TASLP, 2024. [Paper]

Dataset

Voxconverse: "Spot the conversation: speaker diarisation in the wild", in Proc. Interspeech, 2020. (VGG, Naver) [Paper] [Code] [Dataset]
MSDWild: Multi-modal Speaker Diarization Dataset in the Wild, in Proc. Interspeech, 2020. [Paper] [Dataset]
"LibriMix: An Open-Source Dataset for Generalizable Speech Separation," in arXiv:2005.11262, 2020. [Paper] [Code]
Ego4D: " Around the World in 3,000 Hours of Egocentric Video," in Proc. CVPR, 2022. (Meta) [Paper] [Code] [Page]
AliMeeting: "Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge," in Proc. ICASSP, 2022. (Alibaba) [Paper] [Dataset] [Code]
"VoxBlink: X-Large Speaker Verification Dataset on Camera", in Proc. ICASSP, 2024. [Paper] [Dataset]
"NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription," in arXiv:2401.08887, 2024. (MS) [Paper]
"A Comparative Analysis of Speaker Diarization Models: Creating a Dataset for German Dialectal Speech," in Proc. ACL, 2024. [Paper]
"Conversations in the wild: Data collection, automatic generation and evaluation," in Computer Speech & Language, 2025. [Paper]
"ALLIES: A Speech Corpus for Segmentation, Speaker Diarization, Speech Recognition and Speaker Change Detection," in Proc. ACL, 2024. (LIUM) [Paper]""

Self-Supervised

“Self-supervised Speaker Diarization”, in Proc. Interspeech, 2022. [Paper]
CSDA: "Continual Self-Supervised Domain Adaptation for End-to-End Speaker Diarization", in Proc. IEEE SLT, 2022. (CNRS) [Paper] [Code]

Semi-Supervised

"Active Learning Based Constrained Clustering For Speaker Diarization", in IEEE/ACM TASLP, 2017. (UT) [Paper]

Measurement

BER: “Balanced Error Rate For Speaker Diarization”, in Proc. arXiv:2211.04304, 2022 [Paper] [Code]

Child-Adult

"Robust Self Supervised Speech Embeddings for Child-Adult Classification in Interactions involving Children with Autism," in Proc. Interspeech, 2023. [Paper]
"Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions," in Proc. Interspeech, 2024. (USC) [Paper]

Challenge

VoxSRC (VoxCeleb Speaker Recognition Challenge)

VosSRC-20 Track4

[Workshop]

1st: Microsoft [Tech Report] [Video]
2nd: BUT [Tech Report] [Video]
3rd: DKU [Tech Report]

VosSRC-21 Track4

[Workshop]

1st: DKU [Tech Report] [Video]
2nd: Bytedance [Tech Report] [Video]
3rd: Tencent [Tech Report]

VoxSRC-22 Track4

[Paper] [Workshop]

1st: DKU [Tech Report] [slide] [Video]
2nd: KristonAI [Tech Report] [slide] [Video]
3rd: GIST [Tech Report] [slide] [Video] [Reivew]

VoxSRC-23 Track4

[Paper] [Workshop]

1st: DKU [Tech Report] [Slide] [Video]
2nd: KrispAI [Tech Report] [Slide] [Video]
3rd: Pyannote [Tech Report] [Slide] [Video]
4th: GIST [Tech Report]
Wespeaker [Tech Report]

M2MeT (Multi-channel Multi-party Meeting Transcription Grand Challenge)

2022 M2MeT

[Introduction Paper] [Summary Paper] [Dataset-AliMeeting] [Code]

1st: DKU [Paper]
2nd: CUHK-TENCENT [Paper]

MISP (Multimodal Information Based Speech Processing)

2022 MISP Track1

[Introduction Paper] [Page] [Basline Code]

1st: WHU-Alibaba [Paper] [Review]
2nd: SJTU [Paper]
3rd: NPU-ASLP [Paper]

DIHARD

2020 DIHARD III

[Page] [Paper] [Program]

Track1

1st: USTC [Paper] Slides] [Video]
2nd: Hitachi [Paper] [Slide] [Video]
3rd: Naver Clova [Paper] [Slide] [Video]

Track2

1st: USTC-NELSLIP [Paper] Slides] [Video]
2nd: Hitachi [Paper] [Slide] [Video]
3rd: DKU [Paper] [Slide] [Video]

Etc.

"End-to-end speaker diarization system for the third dihard challenge system description," in DIHARD III Tech. Report, 2021

The DISPLACE Challenge 2023

"The DISPLACE Challenge 2023 - DIarization of SPeaker and LAnguage in Conversational Environments," in Proc. Interspeech, 2023. [Paper] [Page]
"The SpeeD--ZevoTech submission at DISPLACE 2023," in Proc. Interspeech, 2023. [Paper]

MERLIon CCS Challenge 2023

"MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization," in Proc. Interspeech, 2023. [Paper] [Page]

CHiME-6

[Overview] [Paper]

ICMC-ASR Grand Challenge (ICASSP2024)

"ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge," 2023. [Paper]
"The NUS-HLT System for ICASSP2024 ICMC-ASR Grand Challenge," in Technical Report, 2023. [Paper]

The Second DISPLACE

"The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments," in Proc. Interspeech, 2024. [Paper]"

CHiME-8

"The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization," 2024. [Paper]

Other Awesome-list

https://github.com/wq2012/awesome-diarization

https://github.com/xyxCalvin/awesome-speaker-diarization

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
Conference/2024_ICASSP		Conference/2024_ICASSP
README.md		README.md

DongKeon/Awesome-Speaker-Diarization

Folders and files

Latest commit

History

Repository files navigation