Some benchmarks on the datasets #5

snakers4 · 2019-05-09T07:49:12Z

Below I will post some of the results on the public part of the dataset
Both train and validation

Hope this will inspire the community to share their results and models

snakers4 · 2019-05-09T08:17:34Z

@akreal
As you requested.
Some results of a more or less well trained model (csv+feather formats).
share_results_1.zip

You can open the feather file like this

import pandas as pd
df = pd.read_feather('../data/share_results_1.feather')

Overall, this model is not overfitted and there is no post-processing yet.

akreal · 2019-05-09T08:29:14Z

Perfect, thank you!

snakers4 · 2019-05-09T09:00:35Z

As you can see the model is not fully fitted yet (we are still in exploratory phase)
But it works perfectly on some easier datasets already

Obviously I exclude the following datasets from the file

private
TTS - they are too easy, they are for diversity only
ASR datasets - because you cannot really use them for validation

snakers4 · 2019-05-15T05:10:26Z

Now if we exclude "bad" files from here, we will get more interesting results.
I cannot say that all of these files have poor annotation, but the majority do.

share_results_v02.zip

snakers4 · 2019-05-29T03:03:53Z

Almost finished collecting v05 and searching hyper-params, will be posting new benchmarks and new data soon

m1ckyro5a · 2019-07-03T11:14:17Z

@snakers4 What model did you use for benchmark?

snakers4 · 2019-07-03T11:31:44Z

@m1ckyro5a
wav2letter inspired fork of the fork of deep speech pytorch

m1ckyro5a · 2019-07-03T12:35:00Z

@snakers4 How about deepspeech2? Which model is better?

snakers4 · 2019-07-03T12:41:23Z

It is hard to tell yet
The performance now is more limited by the data for us, more than by the model
Of course we compared some models side by side (CNN, RNN) only to find that RNNs were a bit better with the same number of weight updates, but slower in general

Some benches we ran on LibriSpeech
network_bench.xlsx

snakers4 · 2019-07-17T10:06:54Z

I will structure the benchmark files from now a bit

Path
Annotation / prediction
CER, WER
File path in the file db

Please note that exclusion files #7 were based on this benchmarks as well previously

All charts contain CER

Dataset benchmark v05

File

Model

CNN trained with CTC loss
Tuning with phonemes

Youtube

TED talks are much cleaner

Audio books

Notice the second normal bump

TTS

Academic datasets

ASR datasets

Pranks are very noisy by default

Radio

Quite good fit as well

Strict exclude file for distillation

An idea on how to set thresholds:

CLEAN_THRESHOLDS = {
    # very strict conditions, datasets are clean, no problem
    'tts_russian_addresses_rhvoice_4voices':0.2,
    'private_buriy_audiobooks_2':0.1,
    
    # strict conditions, datasets vary
    'public_youtube700':0.2,
    'public_youtube1120':0.2,
    'public_youtube1120_hq':0.2,
    'public_lecture_1':0.2,
    'public_series_1':0.2,
    
    # strict conditions, dataset mostly clean
    'radio_2':0.2,

    # very strict conditions, datasets are dirty
    'asr_public_phone_calls_1':0.2,
    'asr_public_phone_calls_2':0.2,
    'asr_public_stories_1':0.2,
    'asr_public_stories_2':0.2,
    
    # mostly just to filter outliers
    'ru_tts':0.4,
    'ru_ru':0.4,
    'voxforge_ru':0.4,
    'russian_single':0.4
}

snakers4 · 2019-07-17T10:08:46Z

Also a comment - model was not over-fitted, it is selected based on optimal generalization

vadimkantorov · 2019-07-20T12:29:01Z

https://ru-open-stt.ams3.digitaloceanspaces.com/benchmark_v05_public.csv.zip is in fact a gzip-compressed file (not a zip-compressed one), so one should decompress it with zcat benchmark_v05_public.csv.zip > benchmark_v05_public.csv

unzipping fails with:

 $ unzip benchmark_v05_public.csv.zip
Archive:  benchmark_v05_public.csv.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of benchmark_v05_public.csv.zip or
        benchmark_v05_public.csv.zip.zip, and cannot find benchmark_v05_public.csv.zip.ZIP, period.

after gzip-decompression the first line contains some weird stuff:

$ head -n 1 benchmark_v05_public.csv
data/dataset_cleaning/benchmark_v05_public.csv0000644000175000001441656463430613513563560021050 0ustar  kerasusers

johnnych7027 · 2020-04-19T11:21:32Z

Hi! What datasets have speaker labels?
Is there any information in which release the speaker labels will be?
Thanks a lot!

snakers4 · 2020-09-23T08:39:58Z

We decided not to update and / or maintain these for reasons.

snakers4 self-assigned this May 9, 2019

snakers4 added the benchmark Model runs, tests, accuracy reports, convergence label May 29, 2019

snakers4 mentioned this issue Jun 30, 2019

Speakers id. #1

Closed

snakers4 mentioned this issue Jul 17, 2019

Files with poor annotation #7

Closed

buriy mentioned this issue Nov 4, 2019

Question on transcriptions #14

Closed

snakers4 closed this as completed Sep 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some benchmarks on the datasets #5

Some benchmarks on the datasets #5

snakers4 commented May 9, 2019 •

edited

Loading

snakers4 commented May 9, 2019

akreal commented May 9, 2019

snakers4 commented May 9, 2019

snakers4 commented May 15, 2019

snakers4 commented May 29, 2019

m1ckyro5a commented Jul 3, 2019

snakers4 commented Jul 3, 2019

m1ckyro5a commented Jul 3, 2019

snakers4 commented Jul 3, 2019

snakers4 commented Jul 17, 2019 •

edited

Loading

snakers4 commented Jul 17, 2019

vadimkantorov commented Jul 20, 2019 •

edited

Loading

johnnych7027 commented Apr 19, 2020

snakers4 commented Sep 23, 2020

Some benchmarks on the datasets #5

Some benchmarks on the datasets #5

Comments

snakers4 commented May 9, 2019 • edited Loading

snakers4 commented May 9, 2019

akreal commented May 9, 2019

snakers4 commented May 9, 2019

snakers4 commented May 15, 2019

snakers4 commented May 29, 2019

m1ckyro5a commented Jul 3, 2019

snakers4 commented Jul 3, 2019

m1ckyro5a commented Jul 3, 2019

snakers4 commented Jul 3, 2019

snakers4 commented Jul 17, 2019 • edited Loading

Dataset benchmark v05

File

Model

Youtube

Audio books

TTS

Academic datasets

ASR datasets

Radio

Strict exclude file for distillation

snakers4 commented Jul 17, 2019

vadimkantorov commented Jul 20, 2019 • edited Loading

johnnych7027 commented Apr 19, 2020

snakers4 commented Sep 23, 2020

snakers4 commented May 9, 2019 •

edited

Loading

snakers4 commented Jul 17, 2019 •

edited

Loading

vadimkantorov commented Jul 20, 2019 •

edited

Loading