-
-
Notifications
You must be signed in to change notification settings - Fork 81
Files with poor annotation #7
Comments
So far we believe that 15-20% of our files may be of poor annotation quality |
@buriy |
stats by source fom
From both files: 518560 utterances |
For the public - all of this is already old and should / will be updated |
New round of data distillationA bit more detailed file pointing out files with poor annotation with some meta-data
It looks like that ~2m utterances out of 7m are to be discarded this way To use these files note that this is a multi-part zip file public_exclude_file_v5.zip |
For those looking for to download and unzip correctly these exclude files: wget https://github.com/snakers4/open_stt/files/3348311/public_exclude_file_v5.zip
wget https://github.com/snakers4/open_stt/files/3348314/public_exclude_file_v5.z01.zip
wget https://github.com/snakers4/open_stt/files/3348312/public_exclude_file_v5.z02.zip
wget https://github.com/snakers4/open_stt/files/3348313/public_exclude_file_v5.z03.zip
mv public_exclude_file_v5.z01.zip public_exclude_file_v5.z01
mv public_exclude_file_v5.z02.zip public_exclude_file_v5.z02
mv public_exclude_file_v5.z03.zip public_exclude_file_v5.z03
cat public_exclude_file_v5.z01 public_exclude_file_v5.z02 public_exclude_file_v5.z03 public_exclude_file_v5.zip > public_exclude_files_v5_.zip
unzip public_exclude_files_v5_.zip |
Yeah, I guess the README.md requires some refining |
Here's the complete exclude file for v5: |
@buriy There are a few files less than 20Kb, among which |
@vadimkantorov that's a known issue: this file length is 44 bytes, which is .wav header size. |
Actually this is due to really empty files but whatever) |
Yeah I forgot to exclude bad files for youtube_1120 |
hope it comes online soon! :) |
I would refer this issue here #5 (comment) for discussion |
Both .csv files have paths from youtube_1120 dataset. What is the difference between them? |
exclude_df_youtube_1120.zip has more files than public_exclude_file_v5.tar.gz or it is the same? UPD: |
I will be posting here some lists of files to be exluded from the dataset from time to time
Such lists are obtained via training models and seeping through files with higher than expected CER
The text was updated successfully, but these errors were encountered: