Arguably the largest public Russian STT dataset up to date:
- ~7m utterances (1-2m with less perfect annotation, see #7);
- ~7000 hours;
- 855 GB (in
.wav
format inint16
); - (new!) A new domain - radio;
- (new!) A larger YouTube dataset with 1000+ additional hours;
- (new!) A small (300 hours) YouTube dataset downloaded in maximum quality;
- (new!) 18 hours in 3 validation sets for YouTube / books / public calls with ground truth annotation;
Prove us wrong! Open issues, collaborate, submit a PR, contribute, share your datasets! Let's make STT in Russian (and more) as open and available as CV models.
Planned releases:
- 1000-10,000 additional hours of books;
- Data quality distillation and improvement / annotation improvement;
- EVEN MOAR DATA (give us your ideas where to find it!);
1000+ additional hours of YouTube;Some validation / test sets;Plain benchmarks, "bad files";Mp3 torrent;Wav torrent;Radio set- ... and more!;
Table of contents
- Dataset composition
- Downloads
- Annotation methodology
- Audio normalization
- Disk db methodology
- Helper functions
- Contacts
- Acknowledgements
- FAQ
- License
- Donations
Dataset | Utterances | Hours | GB | Av s/chars | Comment | Annotation | Quality/noise |
---|---|---|---|---|---|---|---|
audiobook_2 | 1,149,404 | 1,511 | 162 | 4.7s / 56 | Books | Alignment (*) | 95% / crisp |
radio_2 | 651,645 | 1,439 | 154 | 7.95s / 110 | Radio | Alignment (*) | TBC, should be high |
public_youtube1120 | 1,410,979 | 1,104 | 237 | 2.82s / 34 | Yutube videos | Subtitles | 95% / ~crisp |
public_youtube700 | 759,483 | 701 | 75 | 3.3s / 43 | Youtube videos | Subtitles | 95% / ~crisp |
tts_russian_addresses | 1,741,838 | 754 | 81 | 1.6s / 20 | Russian addresses | TTS 4 voices | 100% / crisp |
asr_public_phone_calls_2 | 603,797 | 601 | 66 | 3.6s / 37 | Phone calls | ASR | 70% / noisy |
public_youtube1120_hq | 369,245 | 291 | 31 | 2.84s / 37 | YouTube videos HQ sound | Subtitles | 95% / ~crisp |
asr_public_phone_calls_1 | 233,868 | 211 | 23 | 3.3s / 29 | Phone calls | ASR | 70% / noisy |
asr_public_stories_2 | 78,186 | 78 | 9 | 3.5s / 43 | Books | ASR | 80% / crisp |
asr_public_stories_1 | 46,142 | 38 | 4 | 3.0s / 30 | Books | ASR | 80% / crisp |
public_series_1 | 20,243 | 17 | 2 | 3.1s / 38 | Youtube videos | Subtitles | 95% / ~crisp |
ru_RU | 5,826 | 17 | 2 | 11s / 12 | Public dataset | Alignment | 99% / crisp |
voxforge_ru | 8,344 | 17 | 2 | 7.5s / 77 | Public dataset | Reading | 100% / crisp |
russian_single | 3,357 | 9 | 1 | 9.3s / 102 | Public dataset | Alignment | 99% / crisp |
asr_calls_2_val | 12,950 | 7,7 | 2 | 2.15s / 34 | Phone calls | Manual annotation | 99% / crisp |
public_lecture_1 | 6,803 | 6 | 1 | 3.4s / 47 | Lectures | Subtitles | 95% / crisp |
buriy_audiobooks_2_val | 7,850 | 4,9 | 1 | 2.25s / 31 | Books | Manual annotation | 99% / crisp |
public_youtube700_val | 7,311 | 4,5 | 1 | 2.2 / 35 | Youtube videos | Manual annotation | 99% / crisp |
Total | 7,117,271‬ | 6,812 | 855 |
(*) Automatic alignment
This alignment was performed using Yuri's alignment tool. Contact him if you need alignment for your own dataset.
New train datasets added:
- 1,439 hours radio_2;
- 1,104 hours public_youtube1120;
- 291 hours public_youtube1120_hq;
New validation datasets added:
- 8 hours asr_calls_2_val;
- 5 hours buriy_audiobooks_2_val;
- 5 hours public_youtube700_val;
Also shared a wav version via torrent.
Click to expand
Added the forgotten txt files to mp3 archives. Updating the torrent.
Torrent created and uploaded to academictorrents.
Quickly converted the dataset to MP3 thanks to the community! Waiting for our account for academic torrents to be approved. v0.4 will boast MP3 download links.
If you want to support the project, you can:
- Help us with hosting (create a mirror) / provide a reliable node for torrent;
- Help us with writing some helper functions;
- Donate (each coffee pays for several full downloads) / use our DO referral link to help;
We are converting the dataset to MP3 now.
Please contact us using the below contacts, if you would like to help.
Save us a couple of bucks, download via torrent:
You can download separate files via torrent. Try several torrent clients if some do not work.
Meta data file.
Dataset | GB, wav | GB, mp3 | Wav | Mp3 | Source | Manifest |
---|---|---|---|---|---|---|
audiobook_2 | 162 | 21.0 | torrent | part1 | Sources from the Internet + alignment | link |
radio_2 | 154 | 25.7 | torrent | part1 | Radio | link |
public_youtube1120 | 237 | 32.4 | torrent | part1 | YouTube videos | link |
asr_public_phone_calls_2 | 66 | 7.5 | torrent | part1 | Sources from the Internet + ASR | link |
public_youtube1120_hq | 31 | 8.6 | torrent | parе1 | YouTube videos | link |
asr_public_stories_2 | 9 | 1.1 | torrent | part1 | Sources from the Internet + alignment | link |
tts_russian_addresses_rhvoice_4voices | 80.9 | 9.9 | torrent | part1 | TTS | link |
public_youtube700 | 75.0 | 9.6 | torrent | part1 | YouTube videos | link |
asr_public_phone_calls_1 | 22.7 | 2.6 | torrent | part1 | Sources from the Internet + ASR | link |
asr_public_stories_1 | 4.1 | 0.5 | torrent | part1 | Public stories | link |
public_series_1 | 1.9 | 0.2 | torrent | part1 | Public series | link |
ru_RU | 1.9 | 0.2 | torrent | part1 | Caito.de dataset | link |
voxforge_ru | 1.9 | 0.2 | torrent | part1 | Voxforge dataset | link |
russian_single | 0.9 | 0.1 | torrent | part1 | Russian single speaker dataset | link |
asr_calls_2_val | 2 | 0.2 | torrent | part1 | Sources from the Internet | link |
public_lecture_1 | 0.7 | 0.1 | torrent | part1 | Sources from the Internet + manual | link |
buriy_audiobooks_2_val | 1 | 0.15 | torrent | part1 | Books + manual | link |
public_youtube700_val | 2 | 0.13 | torrent | part1 | YouTube videos + manual | link |
Total | 855 | 87.5 |
- Download each dataset separately:
Via wget
wget https://ru-open-stt.ams3.digitaloceanspaces.com/some_file
For multi-threaded downloads use aria2 with -x
flag, i.e.
aria2c -c -x5 https://ru-open-stt.ams3.digitaloceanspaces.com/some_file
If necessary, merge chunks like this:
cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz
- Download the meta data and manifests for each dataset:
- Merge files (where applicable), unpack and enjoy!
Including links to deprecated files.
md5sum /path/to/downloaded/file
Click to expand
type | md5sum | file |
---|---|---|
audio | f24e21c69c03062d667caf0f055244f2 | asr_public_stories_2_mp3.tar.gz |
audio | a6f888c53d7cbded85ab51627ef57c96 | asr_public_phone_calls_1_mp3.tar.gz |
audio | f707e34f488c62af2e3142085ff595ad | asr_public_phone_calls_2_mp3.tar.gz |
audio | baa491ed0b526b2a989b8c4a8897429d | asr_public_stories_1_mp3.tar.gz |
audio | 42b9c8c2e31100d6c5b972c9ac000167 | private_buriy_audiobooks_2_mp3.tar.gz |
audio | 7a5704721012fafa115e7316e5f6e058 | public_lecture_1_mp3.tar.gz |
audio | 16cf820330f9f8b388395d777b2331ac | public_series_1_mp3.tar.gz |
audio | dd048e7110c0c852c353759dad8fec0f | public_youtube700_mp3.tar.gz |
audio | 579e9d98bd159a27d3573641edee69b0 | ru_ru_mp3.tar.gz |
audio | 177b041594684623ec7d038613e1330d | russian_single_mp3.tar.gz |
audio | d7ce4c4116dcc655be2b466f82c98b6e | tts_russian_addresses_rhvoice_4voices_mp3.tar.gz |
audio | 25ea6d9e249a242ecc217acc28c8077b | voxforge_ru_mp3.tar.gz |
audio | 97cd6b56ba1eb5088bc5643dce054028 | asr_calls_2_val_mp3.tar.gz |
audio | 69a465e218fc1f597f7b5da836952d9d | radio_2_mp3.tar.gz |
audio | 0cc0f50db85ec4271696b4eb03a2203c | buriy_audiobooks_2_val_mp3.tar.gz |
audio | f5d2e3d13b47e1566ba0b021f00788cf | public_youtube1120_hq_mp3.tar.gz |
audio | 12eb78a9ab7c3d39bbe2842b8d6550ca | public_youtube1120_mp3.tar.gz |
audio | f6b6034e1e91d9a0a5069fc9ad2ed545 | public_youtube700_val_mp3.tar.gz |
manifest | b0ce7564ba90b121aeb13aada73a6e30 | asr_public_phone_calls_1.csv |
manifest | 6867d14dfdec1f9e9b8ca2f1de9ceda6 | asr_public_phone_calls_2.csv |
manifest | 0bdd77e15172e654d9a1999a86e92c7f | asr_public_stories_1.csv |
manifest | f388013039d94dc36970547944db51c7 | asr_public_stories_2.csv |
manifest | 3b67e27c1429593cccbf7c516c4b582d | private_buriy_audiobooks_2.csv |
manifest | 04027c20eb3aff05f6067957ecff856b | public_lecture_1.csv |
manifest | 89da3f1b6afcd4d4936662ceabf3033e | public_series_1.csv |
manifest | a81dfb018c88d0ecd5194ab3d8ff6c95 | public_youtube700.csv |
manifest | c858f020729c34ba0ab525bbb8950d0c | ru_RU.csv |
manifest | 0275525914825dec663fd53390fdc9a0 | russian_single.csv |
manifest | 52f406f4e30fcc8c634f992befd91beb | tts_russian_addresses_rhvoice_4voices.csv |
audio | 7533581bb26975212817bcacb25546d0 | asr_public_stories_2.tar.gz |
manifest | 0cdbd085ffa6dab4bfdce7c3ed31fcfe | asr_calls_2_val.csv |
manifest | 4e0b73e0d00374482a0f2286acf314a0 | buriy_audiobooks_2_val.csv |
manifest | 6b9ce6828a55d2741d51bc3503345db5 | public_youtube1120.csv |
manifest | 33040a25cad99e70a81e9e54ff8c758e | public_youtube1120_hq.csv |
manifest | 525bd20802e529dcabf9e44345a50d0b | public_youtube700_val.csv |
manifest | 2996fe938cdfb37dc6e359e4384c9bfe | radio_2.csv |
You can use this script or this script with this config file. Please check the config first. You can also contribute a similar script in python.
The dataset is compiled using open domain sources. Some audio types are annotated automatically and verified statistically / using heuristics.
All files are normalized for easier / faster runtime augmentations and processing as follows:
- Converted to mono, if necessary;
- Converted to 16 kHz sampling rate, if necessary;
- Stored as 16-bit integers;
Each audio file is hashed. Its hash is used to create a folder hierarchy for more optimal fs operation.
target_format = 'wav'
wavb = wav.tobytes()
f_hash = hashlib.sha1(wavb).hexdigest()
store_path = Path(root_folder,
f_hash[0],
f_hash[1:3],
f_hash[3:15]+'.'+target_format)
Use helper functions from here for easier work with manifest files.
See example
from utils.open_stt_utils import read_manifest
manifest_df = read_manifest('path/to/manifest.csv')
See example
from utils.open_stt_utils import (plain_merge_manifests,
check_files,
save_manifest)
train_manifests = [
'path/to/manifest1.csv',
'path/to/manifest2.csv',
]
train_manifest = plain_merge_manifests(train_manifests,
MIN_DURATION=0.1,
MAX_DURATION=100)
check_files(train_manifest)
save_manifest(train_manifest,
'my_manifest.csv')
Please contact us here or just create a GitHub issue!
Authors (in alphabetic order):
- Anna Slizhikova;
- Alexander Veysov;
- Diliara Nurtdinova;
- Dmitry Voronin;
- Yuri Baburov;
This repo would not be possible without these people:
- Many thanks for helping to encode the initial bulk of the data into mp3 to akreal;
- 18 hours of ground truth annotation datasets for validation are a courtesy of activebc;
Kudos!
Mostly we used pydub
(via ffmpeg) to convert to MP3.
We omitted blank files (YouTube mostly).
We used the following parameters:
- 16kHz;
- 32 kbps;
- Mono;
Usually 128-192 kbps is enough for music with sr of 44 kHz, 64-96 is enough for speech.
But here we have mono, 16 kHz and usually only one speaker. So 32 kbps was a good choice.
We did not use other formats like .ogg
, because .mp3
is much more popular.
See example
from pydub import AudioSegment
sound = AudioSegment.from_file(temp_path,
format="wav")
file_handle = sound.export(store_mp3_path,
format="mp3",
parameters =["-ar", "{}".format(str(16000)),"-ac", "1"],
bitrate="{}k".format(str(32)))
It is up to you, but to save space and spare CPU during training, I would suggest the following pipeline to extract the files:
See example
# you can also use pydub, torchaudio, sox or whatever
# we ended up using scipy for speed
# this example also includes hashing step which is not necessary
import librosa
import hashlib
import numpy as np
from pathlib import Path
from scipy.io import wavfile
def save_wav_diskdb(wav,
root_folder='../data/ru_open_stt/',
target_sr=16000):
assert type(wav) == np.ndarray
assert wav.dtype == np.dtype('int16')
assert len(wav.shape)==1
target_format = 'wav'
wavb = wav.tobytes()
# f_path = Path(audio_path)
f_hash = hashlib.sha1(wavb).hexdigest()
store_path = Path(root_folder,
f_hash[0],
f_hash[1:3],
f_hash[3:15]+'.'+target_format)
store_path.parent.mkdir(parents=True,
exist_ok=True)
wavfile.write(filename=str(store_path),
rate=target_sr,
data=wav)
return str(store_path)
root_folder = '../data/'
# save to int16, mono, 16 kHz to save space
target_dtype = np.dtype('int16')
target_sr = 16000
# librosa reads mp3
wav, sr = librosa.load(source_mp3_path,
mono=True,
sr=target_sr)
# librosa converts to float32 by default
wav = (wav * 32767).astype(target_dtype) # cast to int
wav_path = save_wav_diskdb(wav,
root_folder=root_folder,
target_sr=target_sr)
Even though OGG is considered to be better for speech with higher compression, we opted for a more conventional well known format.
See example
from scipy.io import wavfile
sample_rate, sound = wavfile.read(path)
abs_max = np.abs(sound).max()
sound = sound.astype('float32')
if abs_max>0:
sound *= 1/abs_max
We are not altruists, life just is not a zero sum game.
Consider the progress in computer vision, that was made possible by:
- Public datasets;
- Public pre-trained models;
- Open source frameworks;
- Open research;
TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English. Ultimately it leads to worse-off situation for the general community.
Blank files in Youtube dataset. Removed in mp3 archive. Meta-data not cleaned;- Some files that have low values / crash with tochaudio;
- Looks like scipy does not always write meta-data when saving wavs (or you should save (N,1) shaped file) - this can be fixed as shown above;
License:
- cc-by-nc and commercial usage available after agreement with dataset authors;
- Except for radio_2, which is public domain;
- Except for VoxForge, its license is GNU GPL 3.0;
- Except for Caito.de dataset, its licence is here.
Donate (each coffee pays for several full downloads) / use our DO referral link to help.