LibriSpeechMix is the dastaset used in Serialized Output Training for End-to-End Overlapped Speech Recognition and Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers for evaluating multi-talker speech recognition systems. The dataset has been derived from the LibriSpeech "dev_clean" and "test_clean" sets.
- Notable features
- Consists of partially overlapped speech utterances (instead of commonly-used fully overlapped utterances), which is closer to real senarios.
- Designed for ASR evaluation.
- The dataset comprises single-speaker, two-speaker-mixture, and three-speaker-mixtures datasets. Each utterance in the original LibriSpeech evaluation data is used exactly N times in the N-speaker set, which allows the WERs to be compared across the three speaker number conditions.
- Each mixed audio does not contain multiple utterances of the same speaker.
- Includes the information for speaker profile extraction, which is suitable for speaker-attributed automatic speech recogntion (SA-ASR) experiments.
- The dataset was used for the papers listed below.
- Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka: Serialized Output Training for End-to-End Overlapped Speech Recognition, Proc. Interspeech, pp. 2797-2801, 2020. [pdf]
- Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou , Takuya Yoshioka: Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers. Proc. Interspeech, pp. 36-40, 2020. [pdf]
- Intersted readers are also referred to the following related paper.
- Naoyuki Kanda, Zhong Meng, Liang Lu, Yashesh Gaur, Xiaofei Wang, Zhuo Chen, Takuya Yoshioka: Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR, arXiv:2011.02921, 2020. [pdf]
- Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka: Investigation of End-To-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings. Proc. SLT, 2021 (to appear). [pdf]
- Linux
- python3
- flac
The following commands first download the LibriSpeech evaluation data ("dev_clean" and "test_clean") and then generate the mixed audio.
$ pip install soundfile librosa numpy
$ bash run.sh
The mixed audio files are generated under ./data/ directory according to the information in *.jsonl file.
list/
├── dev-clean-1mix.jsonl
├── dev-clean-2mix.jsonl
├── dev-clean-3mix.jsonl
├── test-clean-1mix.jsonl
├── test-clean-2mix.jsonl
└── test-clean-3mix.jsonl
Element | Type | Meaning |
---|---|---|
id | Required | Utterance id |
mixed_wav | Required | Path to the mixed audio (NOTE: relative path from ./data/) |
texts | Required | Transcription |
speaker_profile | Option for SA-ASR | Audio list for speaker profile extraction (NOTE: relative path from ./data/) |
speaker_profile_index | Option for SA-ASR | Index of speaker profile corresponding to each utterance in the mixed audio |
wavs | Original wav files used to generage the mixed audio (NOTE: relative path from ./data) | |
delays | Delay (in second) applied to each utterance before mixing the signals | |
speakers | Speaker id of each utterance | |
durations | Duration (in second) of each original wav file | |
genders | Gender of the speaker of each utterance in the mixed audio |
{
"id": "dev-clean-2mix/dev-clean-2mix-0000",
"mixed_wav": "dev-clean-2mix/dev-clean-2mix-0000.wav",
"texts": [
"MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL",
"THAT ENCHANTMENT HAD POSSESSED HIM USURPING AS IT WERE THE THRONE OF HIS LIFE AND DISPLACING IT WHEN IT CEASED HE WAS NOT HIS OWN MASTER"],
"speaker_profile": [
["dev-clean/6241/61943/6241-61943-0008.wav", "dev-clean/6241/61946/6241-61946-0001.wav"],
["dev-clean/174/84280/174-84280-0011.wav", "dev-clean/174/50561/174-50561-0005.wav"],
["dev-clean/1988/147956/1988-147956-0007.wav", "dev-clean/1988/24833/1988-24833-0011.wav"],
["dev-clean/7850/281318/7850-281318-0016.wav", "dev-clean/7850/286674/7850-286674-0013.wav"],
["dev-clean/1919/142785/1919-142785-0024.wav", "dev-clean/1919/142785/1919-142785-0034.wav"],
["dev-clean/6295/244435/6295-244435-0023.wav", "dev-clean/6295/64301/6295-64301-0002.wav"],
["dev-clean/2428/83699/2428-83699-0005.wav", "dev-clean/2428/83705/2428-83705-0025.wav"],
["dev-clean/1272/141231/1272-141231-0022.wav", "dev-clean/1272/128104/1272-128104-0005.wav"]],
"speaker_profile_index": [7, 5],
"wavs": ["dev-clean/1272/128104/1272-128104-0000.wav", "dev-clean/6295/64301/6295-64301-0026.wav"],
"delays": [0.0, 4.469242864375414],
"speakers": ["1272", "6295"],
"durations": [5.855, 10.43],
"genders": ["m", "m"]
}
- ./list/optional/ directory contains optional *jsonl files with different profile settings for SA-ASR.
- Each file has a name of [dev|test]-clean-[1|2|3]mix-8prof-[1|2|5|10]utt.jsonl.
- [dev|test] indicates if this is development data or test data
- [1|2|3]mix indicates the number of mixed audio
- [1|2|5|10]utt indicates the number of utterances for extracting a speaker profile for each speaker
- Files with a suffix of '-8prof-2utt.jsonl' is identical to the files in ./list/ directory.
@inproceedings{kanda2020serialized,
title={Serialized Output Training for End-to-End Overlapped Speech Recognition},
author={Kanda, Naoyuki and Gaur, Yashesh and Wang, Xiaofei and Meng, Zhong and Yoshioka, Takuya},
booktitle={Proc. Interspeech},
pages={2797--2801},
year={2020}
}
@inproceedings{kanda2020joint,
title={Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers},
author={Kanda, Naoyuki and Gaur, Yashesh and Wang, Xiaofei and Meng, Zhong and Chen, Zhuo and Zhou, Tianyan and Yoshioka, Takuya},
booktitle={Proc. Interspeech},
pages={36--40},
year={2020}
}