Files with poor annotation #7

snakers4 · 2019-05-14T06:13:06Z

I will be posting here some lists of files to be exluded from the dataset from time to time
Such lists are obtained via training models and seeping through files with higher than expected CER

snakers4 · 2019-05-14T07:48:45Z

So far we believe that 15-20% of our files may be of poor annotation quality
We will not be excluding them from the dataset for now, but we will be posting such lists here

snakers4 · 2019-05-14T13:33:53Z

@buriy
These are files in the file db, that most likely have poor annotation, according to my model

bad_trainval_v03.zip
bad_public_train_v03.zip

vadimkantorov · 2019-06-14T12:15:46Z

stats by source fom bad_trainval_v03.zip and bad_public_train_v03.zip:

bad_public_train_v03.csv:

Counter({
'asr_public_phone_calls_2': 170911, 
'private_buriy_audiobooks_2': 128318, 
'public_youtube700': 115683, 
'asr_public_phone_calls_1': 83432, 
'public_series_1': 4823, 
'asr_public_stories_2': 2902, 
'asr_public_stories_1': 2225, 
'tts_russian_addresses_rhvoice_4voices': 235, 
'public_lecture_1': 201, 
'voxforge_ru': 190, 
'ru_ru': 99, 
'russian_single': 55
})

bad_trainval_v03.csv:

Counter({
'private_buriy_audiobooks_2': 4895, 
'public_youtube700': 4217, 
'public_series_1': 185, 
'private_buriy_audiobooks_1': 166, 
'public_lecture_1': 15, 
'voxforge_ru': 7, 
'ru_ru': 6
})

From both files: 518560 utterances

snakers4 · 2019-06-30T08:54:28Z

For the public - all of this is already old and should / will be updated

snakers4 · 2019-07-02T04:49:38Z

New round of data distillation

A bit more detailed file pointing out files with poor annotation with some meta-data

CER threshold;
One of best CERs so far;

It looks like that ~2m utterances out of 7m are to be discarded this way
Pretty good yield for annotation w/o using money

To use these files note that this is a multi-part zip file
You have to change the names of the part files from *.z0?.zip to *.z0?

public_exclude_file_v5.zip
public_exclude_file_v5.z02.zip
public_exclude_file_v5.z03.zip
public_exclude_file_v5.z01.zip

vadimkantorov · 2019-07-07T10:17:06Z

For those looking for to download and unzip correctly these exclude files:

wget https://github.com/snakers4/open_stt/files/3348311/public_exclude_file_v5.zip
wget https://github.com/snakers4/open_stt/files/3348314/public_exclude_file_v5.z01.zip
wget https://github.com/snakers4/open_stt/files/3348312/public_exclude_file_v5.z02.zip
wget https://github.com/snakers4/open_stt/files/3348313/public_exclude_file_v5.z03.zip

mv public_exclude_file_v5.z01.zip public_exclude_file_v5.z01
mv public_exclude_file_v5.z02.zip public_exclude_file_v5.z02
mv public_exclude_file_v5.z03.zip public_exclude_file_v5.z03

cat public_exclude_file_v5.z01 public_exclude_file_v5.z02 public_exclude_file_v5.z03 public_exclude_file_v5.zip > public_exclude_files_v5_.zip

unzip public_exclude_files_v5_.zip

snakers4 · 2019-07-07T11:03:39Z

Yeah, I guess the README.md requires some refining

buriy · 2019-07-10T17:51:30Z

Here's the complete exclude file for v5:
https://github.com/snakers4/open_stt/releases/download/v0.5-beta/public_exclude_file_v5.tar.gz

vadimkantorov · 2019-07-10T19:39:25Z

@buriy There are a few files less than 20Kb, among which ru_open_stt/public_youtube700/d/a3/9a3ee5e6b4b0.wav fails to load with scipy.io.wavfile. It would be nice if you could exclude them in the next update of the exclude file.

buriy · 2019-07-10T20:01:02Z

@vadimkantorov that's a known issue: this file length is 44 bytes, which is .wav header size.
scipy.io.wavfile refuses to load empty files.
We'll look into it at some moment later.

snakers4 · 2019-07-10T20:08:43Z

Actually this is due to really empty files but whatever)

snakers4 · 2019-07-11T20:38:58Z

Yeah I forgot to exclude bad files for youtube_1120

snakers4 · 2019-07-11T20:39:37Z

vadimkantorov · 2019-07-12T10:39:35Z

hope it comes online soon! :)

snakers4 · 2019-07-12T12:45:00Z

Exclude file for YouTube1120

Compared with previous YouTube dataset, this one is much more challenging

To be on the safe side - I would exclude all files with current CER>0.4 (for this dataset we have ~40% of such files, unlike 25-30% as before)

Such files usually fall into 3 categories

1/3 - just plain wrong annotation
1/3 - correct annotation, but very noisy domain
1/3 - under-performing network

exclude_df_youtube_1120.zip

snakers4 · 2019-07-17T10:07:48Z

I would refer this issue here #5 (comment) for discussion

Oktai15 · 2020-01-13T14:37:27Z

@buriy @snakers4

Here's the complete exclude file for v5:
https://github.com/snakers4/open_stt/releases/download/v0.5-beta/public_exclude_file_v5.tar.gz

exclude_df_youtube_1120.zip

Both .csv files have paths from youtube_1120 dataset. What is the difference between them?

Oktai15 · 2020-01-13T14:39:01Z

exclude_df_youtube_1120.zip has more files than public_exclude_file_v5.tar.gz or it is the same?

UPD: grep public_youtube1120/ public_exclude_file_v5.csv | wc -l gives 191020 lines, but exclude_df_youtube_1120.csv has 541872 lines

snakers4 self-assigned this May 14, 2019

snakers4 mentioned this issue May 15, 2019

Some benchmarks on the datasets #5

Closed

snakers4 added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels May 29, 2019

snakers4 mentioned this issue Jun 30, 2019

Speakers id. #1

Closed

buriy added a commit that referenced this issue Jul 2, 2019

Added link to issue #7 with worse annotated files.

224718e

snakers4 closed this as completed Jul 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files with poor annotation #7

Files with poor annotation #7

snakers4 commented May 14, 2019

snakers4 commented May 14, 2019

snakers4 commented May 14, 2019 •

edited

Loading

vadimkantorov commented Jun 14, 2019 •

edited

Loading

snakers4 commented Jun 30, 2019

snakers4 commented Jul 2, 2019

vadimkantorov commented Jul 7, 2019

snakers4 commented Jul 7, 2019

buriy commented Jul 10, 2019

vadimkantorov commented Jul 10, 2019

buriy commented Jul 10, 2019

snakers4 commented Jul 10, 2019

snakers4 commented Jul 11, 2019

snakers4 commented Jul 11, 2019 •

edited

Loading

vadimkantorov commented Jul 12, 2019

snakers4 commented Jul 12, 2019

snakers4 commented Jul 17, 2019

Oktai15 commented Jan 13, 2020

Oktai15 commented Jan 13, 2020 •

edited

Loading

Files with poor annotation #7

Files with poor annotation #7

Comments

snakers4 commented May 14, 2019

snakers4 commented May 14, 2019

snakers4 commented May 14, 2019 • edited Loading

vadimkantorov commented Jun 14, 2019 • edited Loading

snakers4 commented Jun 30, 2019

snakers4 commented Jul 2, 2019

New round of data distillation

vadimkantorov commented Jul 7, 2019

snakers4 commented Jul 7, 2019

buriy commented Jul 10, 2019

vadimkantorov commented Jul 10, 2019

buriy commented Jul 10, 2019

snakers4 commented Jul 10, 2019

snakers4 commented Jul 11, 2019

snakers4 commented Jul 11, 2019 • edited Loading

vadimkantorov commented Jul 12, 2019

snakers4 commented Jul 12, 2019

Exclude file for YouTube1120

snakers4 commented Jul 17, 2019

Oktai15 commented Jan 13, 2020

Oktai15 commented Jan 13, 2020 • edited Loading

snakers4 commented May 14, 2019 •

edited

Loading

vadimkantorov commented Jun 14, 2019 •

edited

Loading

snakers4 commented Jul 11, 2019 •

edited

Loading

Oktai15 commented Jan 13, 2020 •

edited

Loading