~3 000 hours of a completely new domain - public speech;
A huge Radio dataset update with 10 000+ hours ;
A 5% demo version of new Radio/Public Speech datasets;
Vastly improved dataset normalization;
Overall annotation quality is improved:
- Upstream model quality improvement;
- No more "dangling" letters;
- Improved voice activity detection;
  See the above TLDR bullets;

Next steps

Major past error clean-up planned in 1.1;
Refine and publish speaker labels, probably add speakers for old datasets;
Improve / re-upload some of the existing datasets, refine the STT labels;
Probably add new languages;
Add pre-trained models;

Assets 2

02 Jul 06:33

snakers4

v0.5-beta

9ca61e4

New major release - radio / youtube / data quality distillation Pre-release

Pre-release

TLDR:

855 GB (in .wav format in int16) non archived;
(new!) A new domain - radio;
(new!) A larger YouTube dataset with 1000+ additional hours;
(new!) A small (300 hours) YouTube dataset downloaded in maximum quality;
(new!) 18 hours in 3 validation sets for YouTube / books / public calls with ground truth annotations;
See the distilled files with "bad" data in this issue;

Assets 3

19 May 10:21

snakers4

v0.4.3-alpha

05d2232

Added full WAV torrent release Pre-release

Pre-release

An MP3 version of the dataset;
A WAV version of the dataset;

Assets 2

13 May 09:43

snakers4

v0.4.2-alpha

09dd47d

Fixed issues with no txt files in torrents Pre-release

Pre-release

Added txt files to torrents and direct archives.
Updated torrents.

Assets 2

12 May 17:02

snakers4

v0.4.1-alpha

158e672

Added torrent Pre-release

Pre-release

Added link to a torrent download.

Assets 2

10 May 15:00

snakers4

v0.4-alpha

328915c

Dataset conversion to MP3 Pre-release

Pre-release

Key changes:

Converted the majority of the dataset to MP3;
Added download script, md5 hashes into download script;
Fixed license;
Added items to FAQ and common issues;

THE MAJORITY OF WAV LINKS WILL BE DELETED SOON.

Coming soon:

Download via torrent;
Large (1,500 hours) YouTube dataset;
... and more)

Dataset composition

Dataset	Utterances	Hours	GB	Av s/chars	Comment	Annotation	Quality/noise
public_youtube1500 (*)		1,500			* Coming soon
audiobook_2	1,149,404	1,511	166	4.7s / 56	Books	Alignment (*)	95% / crisp
public_youtube700	759,483	701	75	3.3s / 43	Youtube videos	Subtitles	95% / ~crisp
tts_russian_addresses	1,741,838	754	81	1.6s / 20	Russian addresses	TTS 4 voices	100% / crisp
asr_public_phone_calls_2	603,797	601	66	3.6s / 37	Phone calls	ASR	70% / noisy
asr_public_phone_calls_1	233,868	211	23	3.3s / 29	Phone calls	ASR	70% / noisy
asr_public_stories_2	78,186	78	9	3.5s / 43	Books	ASR	80% / crisp
asr_public_stories_1	46,142	38	4	3.0s / 30	Books	ASR	80% / crisp
public_series_1	20,243	17	2	3.1s / 38	Youtube videos	Subtitles	95% / ~crisp
ru_RU	5,826	17	2	11s / 12	Public dataset	Alignment	99% / crisp
voxforge_ru	8,344	17	2	7.5s / 77	Public dataset	Reading	100% / crisp
russian_single	3,357	9	1	9.3s / 102	Public dataset	Alignment	99% / crisp
public_lecture_1	6,803	6	1	3.4s / 47	Lectures	Subtitles	95% / crisp
Total	4,657,291	3,961	431

Links

Meta data file.

Dataset	GB, wav	GB, mp3	Wav	Mp3	Source	Manifest
audiobook_2	166	21.0	down	part1	Sources from the Internet + alignment	link
asr_public_phone_calls_2	66	7.5	down	part1	Sources from the Internet + ASR	link
asr_public_stories_2	9 (7.5)	NA	part1	NA	Sources from the Internet + alignment	link
tts_russian_addresses_rhvoice_4voices	80.9	9.9	down	part1	TTS	link
public_youtube700	75.0	9.6	down	part1	YouTube videos	link
asr_public_phone_calls_1	22.7	2.6	down	part1	Sources from the Internet + ASR	link
asr_public_stories_1	4.1	0.5	down	part1	Public stories	link
public_series_1	1.9	0.2	down	part1	Public series	link
ru_RU	1.9	0.2	down	part1	Caito.de dataset	link
voxforge_ru	1.9	0.2	down	part1	Voxforge dataset	link
russian_single	0.9	0.1	down	part1	Russian single speaker dataset	link
public_lecture_1	0.7	0.1	down	part1	Sources from the Internet	link
Total	431	52

Assets 2

30 Apr 08:45

snakers4

v0.3-alpha

dd6ac59

Added large audio book corpus, large phone call database, asr stories Pre-release

Pre-release

Key changes:

Added dataset: 1500 hours of aligned books, 600+ hours of phone calls, 78 hours of ASR stories.
Formatting changes;
Added license;
Added items to FAQ and common issues;

Coming soon:

Large (1,500 hours) YouTube dataset;
... and more)

Dataset composition

Dataset	Utterances	Hours	GB	Av s/chars	Comment	Annotation	Quality/noise
public_youtube1500 (*)		1,500			* Coming soon
audiobook_2	1,149,404	1,511	166	4.7s / 56	Books	Alignment	99% / crisp
audiobook_1	196,666	237	26	4.3s / 50	Books	Alignment	99% / crisp
public_youtube700	759,483	701	75	3.3s / 43	Youtube videos	Subtitles	95% / ~crisp
tts_russian_addresses	1,741,838	754	81	1.6s / 20	Russian addresses	TTS 4 voices	100% / crisp
asr_public_phone_calls_2	603,797	601	66	3.6s / 37	Phone calls	ASR	70% / noisy
asr_public_phone_calls_1	233,868	211	23	3.3s / 29	Phone calls	ASR	70% / noisy
asr_public_stories_2	78,186	78	9	3.5s / 43	Books	ASR	80% / crisp
asr_public_stories_1	46,142	38	4	3.0s / 30	Books	ASR	80% / crisp
public_series_1	20,243	17	2	3.1s / 38	Youtube videos	Subtitles	95% / ~crisp
ru_RU	5,826	17	2	11s / 12	Public dataset	Alignment	99% / crisp
voxforge_ru	8,344	17	2	7.5s / 77	Public dataset	Reading	100% / crisp
russian_single	3,357	9	1	9.3s / 102	Public dataset	Alignment	99% / crisp
public_lecture_1	6,803	6	1	3.4s / 47	Lectures	Subtitles	95% / crisp
Total	4,853,957	4,198	457

Links

Meta data file.

Dataset	GB	GB, compressed	Audio	Source	Manifest
audiobook_1	26	20.8	part1	Public books + alignment	link
audiobook_2	166	131.7	part1, part2, part3, part4, part5, part6, part7	Public books + alignment	link
asr_public_phone_calls_2	66	51.7	part1, part2, part3	ASR + public phone calls	link
asr_public_stories_2	9	7.5	part1	Public books + alignment	link
tts_russian_addresses_rhvoice_4voices	80.9	67.0	part1, part2, part3, part4	TTS	link
public_youtube700	75.0	67.0	part1, part2, part3, part4	YouTube videos	link
asr_public_phone_calls_1	22.7	19.0	part1	ASR + public phone calls	link
asr_public_stories_1	4.1	3.8	part1	Public stories	link
public_series_1	1.9	1.7	part1	Public series	link
ru_RU	1.9	1.4	part1	Caito.de dataset	link
voxforge_ru	1.9	1.5	part1	Voxforge dataset	link
russian_single	0.9	0.7	part1	Russian single speaker dataset	link
public_lecture_1	0.7	0.6	part1	Public lectures	link
Total	190	163

Check md5sum

md5sum /path/to/downloaded/file

Click to expand

type	md5sum	file
manifest	b0ce7564ba90b121aeb13aada73a6e30	asr_public_phone_calls_1.csv
manifest	6867d14dfdec1f9e9b8ca2f1de9ceda6	asr_public_phone_calls_2.csv
manifest	0bdd77e15172e654d9a1999a86e92c7f	asr_public_stories_1.csv
manifest	f388013039d94dc36970547944db51c7	asr_public_stories_2.csv
manifest	697738331b6021890c29a0d415d0f22d	private_buriy_audiobooks_1.csv
manifest	3b67e27c1429593cccbf7c516c4b582d	private_buriy_audiobooks_2.csv
manifest	04027c20eb3aff05f6067957ecff856b	public_lecture_1.csv
manifest	89da3f1b6afcd4d4936662ceabf3033e	public_series_1.csv
manifest	a81dfb018c88d0ecd5194ab3d8ff6c95	public_youtube700.csv
manifest	c858f020729c34ba0ab525bbb8950d0c	ru_RU.csv
manifest	0275525914825dec663fd53390fdc9a0	russian_single.csv
manifest	52f406f4e30fcc8c634f992befd91beb	tts_russian_addresses_rhvoice_4voices.csv
audio	a5496898ee78654bf398ec6df71540d7	asr_public_phone_calls_1.tar.gz
audio	e4df5ef50787384648b59f5a87edc0c6	asr_public_phone_calls_2.tar.gz
audio	97594127a922df8a7bcc2eecd2470805	asr_public_phone_calls_2.tar.gz_aa
audio	f9b6475f0f2898b16d9e6e0e648fb531	asr_public_...

Assets 2

26 Apr 05:32

snakers4

v0.2-alpha

f14a883

Added medium-sized YouTube dataset and TTS dataset Pre-release

Pre-release

Added medium-sized YouTube dataset and TTS dataset

Key changes:

The storage format was changed to on-disk DB with hashes;
Added a 700 hour YouTube dataset;
Added a 700+ hour TTS dataset with Russian addresses;
Added some utils to work with manifests;
Added manifest files for easier porting into your ASR application;
Discarded previous links;
Dataset format will be uniform from now, new "datasets" will be just added;

Coming soon:

Large (1,500 hours) phone call dataset;
Large (1,500 hours) YouTube dataset;
... and more)

Dataset composition

Dataset	Utterances	Hours	GB	Av len/chars	Comment	Annotation	Quality/noise
asr_public_phone_calls_2 (*)		1,500			* Coming soon
public_youtube1500 (*)		1,500			* Coming soon
tts_russian_addresses	1,741,838	754	81	1.6s / 20	Russian addresses	TTS, 4 voices	100% / crisp
public_youtube700	759,483	701	75	3.3s / 43	Youtube videos	Subtitles	>95% / ~crisp
asr_public_phone_calls_1	233,868	211	23	3.3s / 29	Phone calls	ASR	70% / noisy
asr_public_stories_1	46,142	38	4	3.0s / 30	Books	ASR	70% / crisp
public_series_1	20,243	17	2	3.1s / 38	Youtube videos	Subtitles	95% / ~crisp
ru_RU	5,826	17	2	10.8s / 12	Public dataset	Alignment	99% / crisp
voxforge_ru	8,344	17	2	7.5s / 77	Public dataset	Reading	100% / crisp
russian_single	3,357	9	1	9.3s / 102	Public dataset	Alignment	99% / crisp
public_lecture_1	6,803	6	1	3.4s / 47	Lectures	Subtitles	>95% / crisp
Total	2,825,904	1,771	190

Links

Meta data file.

Dataset	GB	GB, compressed	Audio	Source	Manifest
tts_russian_addresses_rhvoice_4voices	80.9	67.0	part1, part2, part3, part4	TTS	link
public_youtube700	75.0	67.0	part1, part2, part3, part4	YouTube videos	link
asr_public_phone_calls_1	22.7	19.0	part1	ASR + public phone calls	link
asr_public_stories_1	4.1	3.8	part1	Public stories	link
public_series_1	1.9	1.7	part1	Public series	link
ru_RU	1.9	1.4	part1	Caito.de dataset	link
voxforge_ru	1.9	1.5	part1	Voxforge dataset	link
russian_single	0.9	0.7	part1	Russian single speaker dataset	link
public_lecture_1	0.7	0.6	part1	Public lectures	link
Total	190	163

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPUS torrent micro release

The largest Russian STT dataset up-to-date

Major highlights

Next steps

Dataset composition

Links

Dataset composition

Links

Check md5sum

Added medium-sized YouTube dataset and TTS dataset

Dataset composition

Links

Releases: snakers4/open_stt

Direct Download Links

OPUS torrent micro release

OPUS torrent micro release

Finally a v1.0 release with 3x more data

The largest Russian STT dataset up-to-date

Major highlights

Next steps

New major release - radio / youtube / data quality distillation

Added full WAV torrent release

Fixed issues with no txt files in torrents

Added torrent

Dataset conversion to MP3

Dataset composition

Links

Added large audio book corpus, large phone call database, asr stories

Dataset composition

Links

Check md5sum

Added medium-sized YouTube dataset and TTS dataset

Added medium-sized YouTube dataset and TTS dataset

Dataset composition

Links