This repository has been archived by the owner on Oct 10, 2022. It is now read-only.
Releases: snakers4/open_stt
Releases · snakers4/open_stt
Direct Download Links
OPUS torrent micro release
OPUS torrent micro release
- Dataset conversion to OPUS
- OPUS torrent - https://academictorrents.com/details/95b4cab0f99850e119114c8b6df00193ab5fa34f
- OPUS helpers and build instructions - https://github.com/snakers4/open_stt/#how-to-open-opus
- Coming soon - new unlimited direct links
- Further reading links
Finally a v1.0 release with 3x more data
The largest Russian STT dataset up-to-date
- ~16m utterances;
- ~20 000 hours;
- 2,3 TB of data(in .wav format in int16);
- A wide variety of practical, close to real-life domains;
Major highlights
- ~3 000 hours of a completely new domain - public speech;
- A huge Radio dataset update with 10 000+ hours ;
- A 5% demo version of new Radio/Public Speech datasets;
- Vastly improved dataset normalization;
- Overall annotation quality is improved:
- Upstream model quality improvement;
- No more "dangling" letters;
- Improved voice activity detection;
See the above TLDR bullets;
Next steps
- Major past error clean-up planned in 1.1;
- Refine and publish speaker labels, probably add speakers for old datasets;
- Improve / re-upload some of the existing datasets, refine the STT labels;
- Probably add new languages;
- Add pre-trained models;
New major release - radio / youtube / data quality distillation
TLDR:
- 855 GB (in
.wav
format inint16
) non archived; - (new!) A new domain - radio;
- (new!) A larger YouTube dataset with 1000+ additional hours;
- (new!) A small (300 hours) YouTube dataset downloaded in maximum quality;
- (new!) 18 hours in 3 validation sets for YouTube / books / public calls with ground truth annotations;
- See the distilled files with "bad" data in this issue;
Added full WAV torrent release
Fixed issues with no txt files in torrents
Added txt files to torrents and direct archives.
Updated torrents.
Added torrent
Added link to a torrent download.
Dataset conversion to MP3
Key changes:
- Converted the majority of the dataset to MP3;
- Added download script, md5 hashes into download script;
- Fixed license;
- Added items to FAQ and common issues;
THE MAJORITY OF WAV LINKS WILL BE DELETED SOON.
Coming soon:
- Download via torrent;
- Large (1,500 hours) YouTube dataset;
- ... and more)
Dataset composition
Dataset | Utterances | Hours | GB | Av s/chars | Comment | Annotation | Quality/noise |
---|---|---|---|---|---|---|---|
public_youtube1500 (*) | 1,500 | * Coming soon | |||||
audiobook_2 | 1,149,404 | 1,511 | 166 | 4.7s / 56 | Books | Alignment (*) | 95% / crisp |
public_youtube700 | 759,483 | 701 | 75 | 3.3s / 43 | Youtube videos | Subtitles | 95% / ~crisp |
tts_russian_addresses | 1,741,838 | 754 | 81 | 1.6s / 20 | Russian addresses | TTS 4 voices | 100% / crisp |
asr_public_phone_calls_2 | 603,797 | 601 | 66 | 3.6s / 37 | Phone calls | ASR | 70% / noisy |
asr_public_phone_calls_1 | 233,868 | 211 | 23 | 3.3s / 29 | Phone calls | ASR | 70% / noisy |
asr_public_stories_2 | 78,186 | 78 | 9 | 3.5s / 43 | Books | ASR | 80% / crisp |
asr_public_stories_1 | 46,142 | 38 | 4 | 3.0s / 30 | Books | ASR | 80% / crisp |
public_series_1 | 20,243 | 17 | 2 | 3.1s / 38 | Youtube videos | Subtitles | 95% / ~crisp |
ru_RU | 5,826 | 17 | 2 | 11s / 12 | Public dataset | Alignment | 99% / crisp |
voxforge_ru | 8,344 | 17 | 2 | 7.5s / 77 | Public dataset | Reading | 100% / crisp |
russian_single | 3,357 | 9 | 1 | 9.3s / 102 | Public dataset | Alignment | 99% / crisp |
public_lecture_1 | 6,803 | 6 | 1 | 3.4s / 47 | Lectures | Subtitles | 95% / crisp |
Total | 4,657,291 | 3,961 | 431 |
Links
Meta data file.
Dataset | GB, wav | GB, mp3 | Wav | Mp3 | Source | Manifest |
---|---|---|---|---|---|---|
audiobook_2 | 166 | 21.0 | down | part1 | Sources from the Internet + alignment | link |
asr_public_phone_calls_2 | 66 | 7.5 | down | part1 | Sources from the Internet + ASR | link |
asr_public_stories_2 | 9 (7.5) | NA | part1 | NA | Sources from the Internet + alignment | link |
tts_russian_addresses_rhvoice_4voices | 80.9 | 9.9 | down | part1 | TTS | link |
public_youtube700 | 75.0 | 9.6 | down | part1 | YouTube videos | link |
asr_public_phone_calls_1 | 22.7 | 2.6 | down | part1 | Sources from the Internet + ASR | link |
asr_public_stories_1 | 4.1 | 0.5 | down | part1 | Public stories | link |
public_series_1 | 1.9 | 0.2 | down | part1 | Public series | link |
ru_RU | 1.9 | 0.2 | down | part1 | Caito.de dataset | link |
voxforge_ru | 1.9 | 0.2 | down | part1 | Voxforge dataset | link |
russian_single | 0.9 | 0.1 | down | part1 | Russian single speaker dataset | link |
public_lecture_1 | 0.7 | 0.1 | down | part1 | Sources from the Internet | link |
Total | 431 | 52 |
Added large audio book corpus, large phone call database, asr stories
Key changes:
- Added dataset: 1500 hours of aligned books, 600+ hours of phone calls, 78 hours of ASR stories.
- Formatting changes;
- Added license;
- Added items to FAQ and common issues;
Coming soon:
- Large (1,500 hours) YouTube dataset;
- ... and more)
Dataset composition
Dataset | Utterances | Hours | GB | Av s/chars | Comment | Annotation | Quality/noise |
---|---|---|---|---|---|---|---|
public_youtube1500 (*) | 1,500 | * Coming soon | |||||
audiobook_2 | 1,149,404 | 1,511 | 166 | 4.7s / 56 | Books | Alignment | 99% / crisp |
audiobook_1 | 196,666 | 237 | 26 | 4.3s / 50 | Books | Alignment | 99% / crisp |
public_youtube700 | 759,483 | 701 | 75 | 3.3s / 43 | Youtube videos | Subtitles | 95% / ~crisp |
tts_russian_addresses | 1,741,838 | 754 | 81 | 1.6s / 20 | Russian addresses | TTS 4 voices | 100% / crisp |
asr_public_phone_calls_2 | 603,797 | 601 | 66 | 3.6s / 37 | Phone calls | ASR | 70% / noisy |
asr_public_phone_calls_1 | 233,868 | 211 | 23 | 3.3s / 29 | Phone calls | ASR | 70% / noisy |
asr_public_stories_2 | 78,186 | 78 | 9 | 3.5s / 43 | Books | ASR | 80% / crisp |
asr_public_stories_1 | 46,142 | 38 | 4 | 3.0s / 30 | Books | ASR | 80% / crisp |
public_series_1 | 20,243 | 17 | 2 | 3.1s / 38 | Youtube videos | Subtitles | 95% / ~crisp |
ru_RU | 5,826 | 17 | 2 | 11s / 12 | Public dataset | Alignment | 99% / crisp |
voxforge_ru | 8,344 | 17 | 2 | 7.5s / 77 | Public dataset | Reading | 100% / crisp |
russian_single | 3,357 | 9 | 1 | 9.3s / 102 | Public dataset | Alignment | 99% / crisp |
public_lecture_1 | 6,803 | 6 | 1 | 3.4s / 47 | Lectures | Subtitles | 95% / crisp |
Total | 4,853,957 | 4,198 | 457 |
Links
Meta data file.
Dataset | GB | GB, compressed | Audio | Source | Manifest |
---|---|---|---|---|---|
audiobook_1 | 26 | 20.8 | part1 | Public books + alignment | link |
audiobook_2 | 166 | 131.7 | part1, part2, part3, part4, part5, part6, part7 | Public books + alignment | link |
asr_public_phone_calls_2 | 66 | 51.7 | part1, part2, part3 | ASR + public phone calls | link |
asr_public_stories_2 | 9 | 7.5 | part1 | Public books + alignment | link |
tts_russian_addresses_rhvoice_4voices | 80.9 | 67.0 | part1, part2, part3, part4 | TTS | link |
public_youtube700 | 75.0 | 67.0 | part1, part2, part3, part4 | YouTube videos | link |
asr_public_phone_calls_1 | 22.7 | 19.0 | part1 | ASR + public phone calls | link |
asr_public_stories_1 | 4.1 | 3.8 | part1 | Public stories | link |
public_series_1 | 1.9 | 1.7 | part1 | Public series | link |
ru_RU | 1.9 | 1.4 | part1 | Caito.de dataset | link |
voxforge_ru | 1.9 | 1.5 | part1 | Voxforge dataset | link |
russian_single | 0.9 | 0.7 | part1 | Russian single speaker dataset | link |
public_lecture_1 | 0.7 | 0.6 | part1 | Public lectures | link |
Total | 190 | 163 |
Check md5sum
md5sum /path/to/downloaded/file
Click to expand
type | md5sum | file |
---|---|---|
manifest | b0ce7564ba90b121aeb13aada73a6e30 | asr_public_phone_calls_1.csv |
manifest | 6867d14dfdec1f9e9b8ca2f1de9ceda6 | asr_public_phone_calls_2.csv |
manifest | 0bdd77e15172e654d9a1999a86e92c7f | asr_public_stories_1.csv |
manifest | f388013039d94dc36970547944db51c7 | asr_public_stories_2.csv |
manifest | 697738331b6021890c29a0d415d0f22d | private_buriy_audiobooks_1.csv |
manifest | 3b67e27c1429593cccbf7c516c4b582d | private_buriy_audiobooks_2.csv |
manifest | 04027c20eb3aff05f6067957ecff856b | public_lecture_1.csv |
manifest | 89da3f1b6afcd4d4936662ceabf3033e | public_series_1.csv |
manifest | a81dfb018c88d0ecd5194ab3d8ff6c95 | public_youtube700.csv |
manifest | c858f020729c34ba0ab525bbb8950d0c | ru_RU.csv |
manifest | 0275525914825dec663fd53390fdc9a0 | russian_single.csv |
manifest | 52f406f4e30fcc8c634f992befd91beb | tts_russian_addresses_rhvoice_4voices.csv |
audio | a5496898ee78654bf398ec6df71540d7 | asr_public_phone_calls_1.tar.gz |
audio | e4df5ef50787384648b59f5a87edc0c6 | asr_public_phone_calls_2.tar.gz |
audio | 97594127a922df8a7bcc2eecd2470805 | asr_public_phone_calls_2.tar.gz_aa |
audio | f9b6475f0f2898b16d9e6e0e648fb531 | asr_public_... |
Added medium-sized YouTube dataset and TTS dataset
Added medium-sized YouTube dataset and TTS dataset
Key changes:
- The storage format was changed to on-disk DB with hashes;
- Added a 700 hour YouTube dataset;
- Added a 700+ hour TTS dataset with Russian addresses;
- Added some utils to work with manifests;
- Added manifest files for easier porting into your ASR application;
- Discarded previous links;
- Dataset format will be uniform from now, new "datasets" will be just added;
Coming soon:
- Large (1,500 hours) phone call dataset;
- Large (1,500 hours) YouTube dataset;
- ... and more)
Dataset composition
Dataset | Utterances | Hours | GB | Av len/chars | Comment | Annotation | Quality/noise |
---|---|---|---|---|---|---|---|
asr_public_phone_calls_2 (*) | 1,500 | * Coming soon | |||||
public_youtube1500 (*) | 1,500 | * Coming soon | |||||
tts_russian_addresses | 1,741,838 | 754 | 81 | 1.6s / 20 | Russian addresses | TTS, 4 voices | 100% / crisp |
public_youtube700 | 759,483 | 701 | 75 | 3.3s / 43 | Youtube videos | Subtitles | >95% / ~crisp |
asr_public_phone_calls_1 | 233,868 | 211 | 23 | 3.3s / 29 | Phone calls | ASR | 70% / noisy |
asr_public_stories_1 | 46,142 | 38 | 4 | 3.0s / 30 | Books | ASR | 70% / crisp |
public_series_1 | 20,243 | 17 | 2 | 3.1s / 38 | Youtube videos | Subtitles | 95% / ~crisp |
ru_RU | 5,826 | 17 | 2 | 10.8s / 12 | Public dataset | Alignment | 99% / crisp |
voxforge_ru | 8,344 | 17 | 2 | 7.5s / 77 | Public dataset | Reading | 100% / crisp |
russian_single | 3,357 | 9 | 1 | 9.3s / 102 | Public dataset | Alignment | 99% / crisp |
public_lecture_1 | 6,803 | 6 | 1 | 3.4s / 47 | Lectures | Subtitles | >95% / crisp |
Total | 2,825,904 | 1,771 | 190 |
Links
Meta data file.
Dataset | GB | GB, compressed | Audio | Source | Manifest |
---|---|---|---|---|---|
tts_russian_addresses_rhvoice_4voices | 80.9 | 67.0 | part1, part2, part3, part4 | TTS | link |
public_youtube700 | 75.0 | 67.0 | part1, part2, part3, part4 | YouTube videos | link |
asr_public_phone_calls_1 | 22.7 | 19.0 | part1 | ASR + public phone calls | link |
asr_public_stories_1 | 4.1 | 3.8 | part1 | Public stories | link |
public_series_1 | 1.9 | 1.7 | part1 | Public series | link |
ru_RU | 1.9 | 1.4 | part1 | Caito.de dataset | link |
voxforge_ru | 1.9 | 1.5 | part1 | Voxforge dataset | link |
russian_single | 0.9 | 0.7 | part1 | Russian single speaker dataset | link |
public_lecture_1 | 0.7 | 0.6 | part1 | Public lectures | link |
Total | 190 | 163 |