Skip to content
This repository has been archived by the owner on Oct 10, 2022. It is now read-only.

Some benchmarks on the datasets #5

Closed
snakers4 opened this issue May 9, 2019 · 14 comments
Closed

Some benchmarks on the datasets #5

snakers4 opened this issue May 9, 2019 · 14 comments
Assignees
Labels
benchmark Model runs, tests, accuracy reports, convergence

Comments

@snakers4
Copy link
Owner

snakers4 commented May 9, 2019

Below I will post some of the results on the public part of the dataset
Both train and validation

Hope this will inspire the community to share their results and models

@snakers4 snakers4 self-assigned this May 9, 2019
@snakers4
Copy link
Owner Author

snakers4 commented May 9, 2019

@akreal
As you requested.
Some results of a more or less well trained model (csv+feather formats).
share_results_1.zip

You can open the feather file like this

import pandas as pd
df = pd.read_feather('../data/share_results_1.feather')

Overall, this model is not overfitted and there is no post-processing yet.

@akreal
Copy link
Contributor

akreal commented May 9, 2019

Perfect, thank you!

@snakers4
Copy link
Owner Author

snakers4 commented May 9, 2019

As you can see the model is not fully fitted yet (we are still in exploratory phase)
But it works perfectly on some easier datasets already

image

Obviously I exclude the following datasets from the file

  • private
  • TTS - they are too easy, they are for diversity only
  • ASR datasets - because you cannot really use them for validation

@snakers4
Copy link
Owner Author

Now if we exclude "bad" files from here, we will get more interesting results.
I cannot say that all of these files have poor annotation, but the majority do.

share_results_v02.zip

image

@snakers4
Copy link
Owner Author

Almost finished collecting v05 and searching hyper-params, will be posting new benchmarks and new data soon

@snakers4 snakers4 added the benchmark Model runs, tests, accuracy reports, convergence label May 29, 2019
@snakers4 snakers4 mentioned this issue Jun 30, 2019
@m1ckyro5a
Copy link

@snakers4 What model did you use for benchmark?

@snakers4
Copy link
Owner Author

snakers4 commented Jul 3, 2019

@m1ckyro5a
wav2letter inspired fork of the fork of deep speech pytorch

@m1ckyro5a
Copy link

@snakers4 How about deepspeech2? Which model is better?

@snakers4
Copy link
Owner Author

snakers4 commented Jul 3, 2019

It is hard to tell yet
The performance now is more limited by the data for us, more than by the model
Of course we compared some models side by side (CNN, RNN) only to find that RNNs were a bit better with the same number of weight updates, but slower in general

Some benches we ran on LibriSpeech
network_bench.xlsx

@snakers4
Copy link
Owner Author

snakers4 commented Jul 17, 2019

I will structure the benchmark files from now a bit

  • Path
  • Annotation / prediction
  • CER, WER
  • File path in the file db

Please note that exclusion files #7 were based on this benchmarks as well previously

All charts contain CER

Dataset benchmark v05

File

File

Model

CNN trained with CTC loss
Tuning with phonemes

Youtube

TED talks are much cleaner
youtube

Audio books

Notice the second normal bump
books

TTS

tts

Academic datasets

academic

ASR datasets

Pranks are very noisy by default
asr

Radio

Quite good fit as well
radio

Strict exclude file for distillation

An idea on how to set thresholds:

CLEAN_THRESHOLDS = {
    # very strict conditions, datasets are clean, no problem
    'tts_russian_addresses_rhvoice_4voices':0.2,
    'private_buriy_audiobooks_2':0.1,
    
    # strict conditions, datasets vary
    'public_youtube700':0.2,
    'public_youtube1120':0.2,
    'public_youtube1120_hq':0.2,
    'public_lecture_1':0.2,
    'public_series_1':0.2,
    
    # strict conditions, dataset mostly clean
    'radio_2':0.2,

    # very strict conditions, datasets are dirty
    'asr_public_phone_calls_1':0.2,
    'asr_public_phone_calls_2':0.2,
    'asr_public_stories_1':0.2,
    'asr_public_stories_2':0.2,
    
    # mostly just to filter outliers
    'ru_tts':0.4,
    'ru_ru':0.4,
    'voxforge_ru':0.4,
    'russian_single':0.4
}

@snakers4
Copy link
Owner Author

Also a comment - model was not over-fitted, it is selected based on optimal generalization

@vadimkantorov
Copy link

vadimkantorov commented Jul 20, 2019

https://ru-open-stt.ams3.digitaloceanspaces.com/benchmark_v05_public.csv.zip is in fact a gzip-compressed file (not a zip-compressed one), so one should decompress it with zcat benchmark_v05_public.csv.zip > benchmark_v05_public.csv

unzipping fails with:

 $ unzip benchmark_v05_public.csv.zip
Archive:  benchmark_v05_public.csv.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of benchmark_v05_public.csv.zip or
        benchmark_v05_public.csv.zip.zip, and cannot find benchmark_v05_public.csv.zip.ZIP, period.

after gzip-decompression the first line contains some weird stuff:

$ head -n 1 benchmark_v05_public.csv
data/dataset_cleaning/benchmark_v05_public.csv0000644000175000001441656463430613513563560021050 0ustar  kerasusers

@johnnych7027
Copy link

Hi! What datasets have speaker labels?
Is there any information in which release the speaker labels will be?
Thanks a lot!

@snakers4
Copy link
Owner Author

We decided not to update and / or maintain these for reasons.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
benchmark Model runs, tests, accuracy reports, convergence
Projects
None yet
Development

No branches or pull requests

5 participants