Skip to content
This repository has been archived by the owner on Oct 10, 2022. It is now read-only.

Added medium-sized YouTube dataset and TTS dataset

Pre-release
Pre-release
Compare
Choose a tag to compare
@snakers4 snakers4 released this 26 Apr 05:32
· 69 commits to master since this release

Added medium-sized YouTube dataset and TTS dataset

Key changes:

  • The storage format was changed to on-disk DB with hashes;
  • Added a 700 hour YouTube dataset;
  • Added a 700+ hour TTS dataset with Russian addresses;
  • Added some utils to work with manifests;
  • Added manifest files for easier porting into your ASR application;
  • Discarded previous links;
  • Dataset format will be uniform from now, new "datasets" will be just added;

Coming soon:

  • Large (1,500 hours) phone call dataset;
  • Large (1,500 hours) YouTube dataset;
  • ... and more)

Dataset composition

Dataset Utterances Hours GB Av len/chars Comment Annotation Quality/noise
asr_public_phone_calls_2 (*) 1,500 * Coming soon
public_youtube1500 (*) 1,500 * Coming soon
tts_russian_addresses 1,741,838 754 81 1.6s / 20 Russian addresses TTS, 4 voices 100% / crisp
public_youtube700 759,483 701 75 3.3s / 43 Youtube videos Subtitles >95% / ~crisp
asr_public_phone_calls_1 233,868 211 23 3.3s / 29 Phone calls ASR 70% / noisy
asr_public_stories_1 46,142 38 4 3.0s / 30 Books ASR 70% / crisp
public_series_1 20,243 17 2 3.1s / 38 Youtube videos Subtitles 95% / ~crisp
ru_RU 5,826 17 2 10.8s / 12 Public dataset Alignment 99% / crisp
voxforge_ru 8,344 17 2 7.5s / 77 Public dataset Reading 100% / crisp
russian_single 3,357 9 1 9.3s / 102 Public dataset Alignment 99% / crisp
public_lecture_1 6,803 6 1 3.4s / 47 Lectures Subtitles >95% / crisp
Total 2,825,904 1,771 190

Links

Meta data file.

Dataset GB GB, compressed Audio Source Manifest
tts_russian_addresses_rhvoice_4voices 80.9 67.0 part1, part2, part3, part4 TTS link
public_youtube700 75.0 67.0 part1, part2, part3, part4 YouTube videos link
asr_public_phone_calls_1 22.7 19.0 part1 ASR + public phone calls link
asr_public_stories_1 4.1 3.8 part1 Public stories link
public_series_1 1.9 1.7 part1 Public series link
ru_RU 1.9 1.4 part1 Caito.de dataset link
voxforge_ru 1.9 1.5 part1 Voxforge dataset link
russian_single 0.9 0.7 part1 Russian single speaker dataset link
public_lecture_1 0.7 0.6 part1 Public lectures link
Total 190 163