Skip to content
This repository has been archived by the owner on Oct 10, 2022. It is now read-only.

Finally a v1.0 release with 3x more data

Pre-release
Pre-release
Compare
Choose a tag to compare
@snakers4 snakers4 released this 05 Nov 07:16
· 30 commits to master since this release

The largest Russian STT dataset up-to-date

  • ~16m utterances;
  • ~20 000 hours;
  • 2,3 TB of data(in .wav format in int16);
  • A wide variety of practical, close to real-life domains;

Major highlights

  • ~3 000 hours of a completely new domain - public speech;
  • A huge Radio dataset update with 10 000+ hours ;
  • A 5% demo version of new Radio/Public Speech datasets;
  • Vastly improved dataset normalization;
  • Overall annotation quality is improved:
    • Upstream model quality improvement;
    • No more "dangling" letters;
    • Improved voice activity detection;
      See the above TLDR bullets;

Next steps

  • Major past error clean-up planned in 1.1;
  • Refine and publish speaker labels, probably add speakers for old datasets;
  • Improve / re-upload some of the existing datasets, refine the STT labels;
  • Probably add new languages;
  • Add pre-trained models;