Arguably the largest public Russian STT dataset up to date:
- ~16m utterances (1-2m with less perfect annotation, see #7);
- ~20 000 hours;
- 2,3 TB (in
.wav
format inint16
); - (new!) A new domain - public speech;
- (new!) A huge Radio dataset update with 10 000+ hours;
Prove us wrong! Open issues, collaborate, submit a PR, contribute, share your datasets! Let's make STT in Russian (and more) as open and available as CV models.
Planned releases:
- Refine and publish speaker labels, probably add speakers for old datasets;
- Improve / re-upload some of the existing datasets, refine the STT labels;
- Probably add new languages;
- Add pre-trained models;
- Dataset composition
- Downloads
- Annotation methodology
- Audio normalization
- Disk db methodology
- Helper functions
- Contacts
- Acknowledgements
- FAQ
- License
- Donations
Dataset | Utterances | Hours | GB | Av s/chars | Comment | Annotation | Quality/noise |
---|---|---|---|---|---|---|---|
radio_v4 | 7,603,192 | 10,430 | 1,195 | 4.94s / 68 | Radio | Alignment (*) | 95% / crisp |
public_speech | 1,700,060 | 2,709 | 301 | 5,73s / 79 | Public speech | Alignment (*) | 95% / crisp |
audiobook_2 | 1,149,404 | 1,511 | 162 | 4.7s / 56 | Books | Alignment (*) | 95% / crisp |
radio_2 | 651,645 | 1,439 | 154 | 7.95s / 110 | Radio | Alignment (*) | TBC, should be high |
public_youtube1120 | 1,410,979 | 1,104 | 237 | 2.82s / 34 | Youtube videos | Subtitles | 95% / ~crisp |
public_youtube700 | 759,483 | 701 | 75 | 3.3s / 43 | Youtube videos | Subtitles | 95% / ~crisp |
tts_russian_addresses | 1,741,838 | 754 | 81 | 1.6s / 20 | Russian addresses | TTS 4 voices | 100% / crisp |
asr_public_phone_calls_2 | 603,797 | 601 | 66 | 3.6s / 37 | Phone calls | ASR | 70% / noisy |
public_youtube1120_hq | 369,245 | 291 | 31 | 2.84s / 37 | YouTube videos HQ sound | Subtitles | 95% / ~crisp |
asr_public_phone_calls_1 | 233,868 | 211 | 23 | 3.3s / 29 | Phone calls | ASR | 70% / noisy |
radio_v4_add | 92,679 | 157 | 18 | 6.1s / 80 | Radio | Alignment (*) | 95% / crisp |
asr_public_stories_2 | 78,186 | 78 | 9 | 3.5s / 43 | Books | ASR | 80% / crisp |
asr_public_stories_1 | 46,142 | 38 | 4 | 3.0s / 30 | Books | ASR | 80% / crisp |
public_series_1 | 20,243 | 17 | 2 | 3.1s / 38 | Youtube videos | Subtitles | 95% / ~crisp |
asr_calls_2_val | 12,950 | 7,7 | 2 | 2.15s / 34 | Phone calls | Manual annotation | 99% / crisp |
public_lecture_1 | 6,803 | 6 | 1 | 3.4s / 47 | Lectures | Subtitles | 95% / crisp |
buriy_audiobooks_2_val | 7,850 | 4,9 | 1 | 2.25s / 31 | Books | Manual annotation | 99% / crisp |
public_youtube700_val | 7,311 | 4,5 | 1 | 2.2 / 35 | Youtube videos | Manual annotation | 99% / crisp |
Total | 16,513,202 | 20,108 | 2,369 |
(*) Automatic alignment
This alignment was performed using Yuri's alignment tool. Contact him if you need alignment for your own dataset.
New train datasets added:
- 10,430 hours radio_v4;
- 2,709 hours public_speech;
- 154 hours radio_v4_add;
- 5% sample of all new datasets with annotation.
Click to expand
New train datasets added:
- 1,439 hours radio_2;
- 1,104 hours public_youtube1120;
- 291 hours public_youtube1120_hq;
New validation datasets added:
- 8 hours asr_calls_2_val;
- 5 hours buriy_audiobooks_2_val;
- 5 hours public_youtube700_val;
Also shared a wav version via torrent.
Added the forgotten txt files to mp3 archives. Updating the torrent.
Torrent created and uploaded to academictorrents.
Quickly converted the dataset to MP3 thanks to the community! Waiting for our account for academic torrents to be approved. v0.4 will boast MP3 download links.
If you want to support the project, you can:
- Help us with hosting (create a mirror) / provide a reliable node for torrent;
- Help us with writing some helper functions;
- Donate (each coffee pays for several full downloads) / use our DO referral link to help;
We are converting the dataset to MP3 now.
Please contact us using the below contacts, if you would like to help.
Save us a couple of bucks, download via torrent:
You can download separate files via torrent.
Try several torrent clients if some do not work.
Looks like that due to large chunk size, most conversional torrent clients just fail silently.
No problem (re-calculating the torrent takes much time, and some people have downloaded it already):
apt update
apt install aria2
# list the torrent files
aria2c --show-files ru_open_stt_wav_v10.torrent
# download only one file
aria2c --select-file=4 ru_open_stt_wav_v10.torrent
# for more options visit
# https://aria2.github.io/manual/en/html/aria2c.html#basic-options
# https://aria2.github.io/manual/en/html/aria2c.html#bittorrent-metalink-options
# https://aria2.github.io/manual/en/html/aria2c.html#bittorrent-specific-options
If you are using Windows, you may use Linux subsystem to run these commands.
All WAV files can be downloaded ONLY via torrent
Dataset | GB, wav | GB, mp3 | Mp3 | Source | Manifest |
---|---|---|---|---|---|
radio_v4 | 1059 | 263 | mp3, txt | Radio | manifest file |
public_speech | 257 | 38.5 | mp3, txt | Sources from the Internet + alignment | manifest file |
radio_v4_add | 15.7 | 2.2 | mp3, txt | Radio | manifest file |
5% of radio_v4 + public_speech | - | 15.3 | mp3+txt | - | manifest file |
audiobook_2 | 162 | 21.0 | mp3+txt | Sources from the Internet + alignment | manifest file |
radio_2 | 154 | 25.7 | mp3+txt | Radio | manifest file |
public_youtube1120 | 237 | 32.4 | mp3+txt | YouTube videos | manifest file |
asr_public_phone_calls_2 | 66 | 7.5 | mp3+txt | Sources from the Internet + ASR | manifest file |
public_youtube1120_hq | 31 | 8.6 | mp3+txt | YouTube videos | manifest file |
asr_public_stories_2 | 9 | 1.1 | mp3+txt | Sources from the Internet + alignment | manifest file |
tts_russian_addresses_rhvoice_4voices | 80.9 | 9.9 | mp3+txt | TTS | manifest file |
public_youtube700 | 75.0 | 9.6 | mp3+txt | YouTube videos | manifest file |
asr_public_phone_calls_1 | 22.7 | 2.6 | mp3+txt | Sources from the Internet + ASR | manifest file |
asr_public_stories_1 | 4.1 | 0.5 | mp3+txt | Public stories | manifest file |
public_series_1 | 1.9 | 0.2 | mp3+txt | Public series | manifest file |
asr_calls_2_val | 2 | 0.2 | mp3+txt | Sources from the Internet | manifest file |
public_lecture_1 | 0.7 | 0.1 | mp3+txt | Sources from the Internet + manual | manifest file |
buriy_audiobooks_2_val | 1 | 0.15 | mp3+txt | Books + manual | manifest file |
public_youtube700_val | 2 | 0.13 | mp3+txt | YouTube videos + manual | manifest file |
Total | 2,186 | 391 |
download.sh
or
download.py
with this config file. Please check the config first.
- Download each dataset separately:
Via wget
wget https://ru-open-stt.ams3.digitaloceanspaces.com/some_file
For multi-threaded downloads use aria2 with -x
flag, i.e.
aria2c -c -x5 https://ru-open-stt.ams3.digitaloceanspaces.com/some_file
If necessary, merge chunks like this:
cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz
- Download the meta data and manifests for each dataset:
- Merge files (where applicable), unpack and enjoy!
The dataset is compiled using open domain sources. Some audio types are annotated automatically and verified statistically / using heuristics.
All files are normalized for easier / faster runtime augmentations and processing as follows:
- Converted to mono, if necessary;
- Converted to 16 kHz sampling rate, if necessary;
- Stored as 16-bit integers;
Each audio file is hashed. Its hash is used to create a folder hierarchy for more optimal fs operation.
target_format = 'wav'
wavb = wav.tobytes()
f_hash = hashlib.sha1(wavb).hexdigest()
store_path = Path(root_folder,
f_hash[0],
f_hash[1:3],
f_hash[3:15]+'.'+target_format)
Use helper functions from here for easier work with manifest files.
See example
from utils.open_stt_utils import read_manifest
manifest_df = read_manifest('path/to/manifest.csv')
See example
from utils.open_stt_utils import (plain_merge_manifests,
check_files,
save_manifest)
train_manifests = [
'path/to/manifest1.csv',
'path/to/manifest2.csv',
]
train_manifest = plain_merge_manifests(train_manifests,
MIN_DURATION=0.1,
MAX_DURATION=100)
check_files(train_manifest)
save_manifest(train_manifest,
'my_manifest.csv')
Please contact us here or just create a GitHub issue!
Authors (in alphabetic order):
- Anna Slizhikova;
- Alexander Veysov;
- Diliara Nurtdinova;
- Dmitry Voronin;
- Yuri Baburov;
This repo would not be possible without these people:
- Many thanks for helping to encode the initial bulk of the data into mp3 to akreal;
- 18 hours of ground truth annotation datasets for validation are a courtesy of activebc;
Kudos!
Mostly we used pydub
(via ffmpeg) or sox
(much much faster way) to convert to MP3.
We omitted blank files (YouTube mostly).
We used the following parameters:
- 16kHz;
- 32 kbps;
- Mono;
Usually 128-192 kbps is enough for music with sr of 44 kHz, 64-96 is enough for speech.
But here we have mono, 16 kHz and usually only one speaker. So 32 kbps was a good choice.
We did not use other formats like .ogg
, because .mp3
is much more popular.
See example `pydub`
from pydub import AudioSegment
sound = AudioSegment.from_file(temp_path,
format="wav")
file_handle = sound.export(store_mp3_path,
format="mp3",
parameters =["-ar", "{}".format(str(16000)),"-ac", "1"],
bitrate="{}k".format(str(32)))
See example `sox`
import subprocess
cmd = 'sox "{}" -C 32.01 -c 1 "{}"'.format(
wav_path,
store_mp3_path)
res = subprocess.call([cmd], shell=True)
if res != 0:
print('Problems with {}'.format(wav_path))
It is up to you, but to save space and spare CPU during training, I would suggest the following pipeline to extract the files:
See example
# you can also use pydub, torchaudio, sox or whatever
# we ended up using scipy for speed
# this example also includes hashing step which is not necessary
import librosa
import hashlib
import numpy as np
from pathlib import Path
from scipy.io import wavfile
def save_wav_diskdb(wav,
root_folder='../data/ru_open_stt/',
target_sr=16000):
assert type(wav) == np.ndarray
assert wav.dtype == np.dtype('int16')
assert len(wav.shape)==1
target_format = 'wav'
wavb = wav.tobytes()
# f_path = Path(audio_path)
f_hash = hashlib.sha1(wavb).hexdigest()
store_path = Path(root_folder,
f_hash[0],
f_hash[1:3],
f_hash[3:15]+'.'+target_format)
store_path.parent.mkdir(parents=True,
exist_ok=True)
wavfile.write(filename=str(store_path),
rate=target_sr,
data=wav)
return str(store_path)
root_folder = '../data/'
# save to int16, mono, 16 kHz to save space
target_dtype = np.dtype('int16')
target_sr = 16000
# librosa reads mp3
wav, sr = librosa.load(source_mp3_path,
mono=True,
sr=target_sr)
# librosa converts to float32 by default
wav = (wav * 32767).astype(target_dtype) # cast to int
wav_path = save_wav_diskdb(wav,
root_folder=root_folder,
target_sr=target_sr)
Even though OGG / Opus is considered to be better for speech with higher compression, we opted for a more conventional well known format.
Also LPC net codec boasts ultra-low bitrate speech compression as well. But we decided to opt for a more familiar format to avoid worry about actually losing signal in compression.
See example
from scipy.io import wavfile
sample_rate, sound = wavfile.read(path)
abs_max = np.abs(sound).max()
sound = sound.astype('float32')
if abs_max>0:
sound *= 1/abs_max
We are not altruists, life just is not a zero sum game.
Consider the progress in computer vision, that was made possible by:
- Public datasets;
- Public pre-trained models;
- Open source frameworks;
- Open research;
TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English. Ultimately it leads to worse-off situation for the general community.
- Speaker labels coming soon;
- Validation sets for new domains: Radio/Public Speech will be added in next releases.
Сc-by-nc and commercial usage available after agreement with dataset authors.
Donate (each coffee pays for several full downloads) or via open_collective or just use our DO referral link to help.