This repository has been archived by the owner on Oct 10, 2022. It is now read-only.

Finally a v1.0 release with 3x more data

Pre-release

Pre-release

snakers4 released this 05 Nov 07:16

· 30 commits to master since this release

cb2cf21

The largest Russian STT dataset up-to-date

~16m utterances;
~20 000 hours;
2,3 TB of data(in .wav format in int16);
A wide variety of practical, close to real-life domains;

Major highlights

~3 000 hours of a completely new domain - public speech;
A huge Radio dataset update with 10 000+ hours ;
A 5% demo version of new Radio/Public Speech datasets;
Vastly improved dataset normalization;
Overall annotation quality is improved:
- Upstream model quality improvement;
- No more "dangling" letters;
- Improved voice activity detection;
  See the above TLDR bullets;

Next steps

Major past error clean-up planned in 1.1;
Refine and publish speaker labels, probably add speakers for old datasets;
Improve / re-upload some of the existing datasets, refine the STT labels;
Probably add new languages;
Add pre-trained models;

Assets 2