[egs] Add recipe for Mozilla Common Voice corpus v1 #2057

entn-at · 2017-12-02T21:20:19Z

This a basic recipe for the recently released Mozilla Common Voice corpus (v1, CC-0 licensed). See https://voice.mozilla.org/data

Some of the data preparation scripts were taken from the voxforge recipe (dict, LM). The systems and chain model setup were adapted from mini_librispeech (including speed perturbation, PCA transform for i-vector extraction, etc.).

I did not tune the setup, the chain system already has WERs of about 5% (see RESULTS).

entn-at · 2017-12-02T21:29:03Z

Note that this recipe currently only uses the "valid" portion of the corpus, that is, utterances that have had at least 2 people listen to them, and the majority of those listeners said the audio matches the text.

danpovey · 2017-12-02T21:36:57Z

egs/commonvoice/s5/local/chain/tuning/run_tdnn_1a.sh

+  fixed-affine-layer name=lda input=Append(-2,-1,0,1,2,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat
+
+  # the first splicing is moved before the lda layer, so no splicing here
+  relu-batchnorm-layer name=tdnn1 dim=512


This system is rather small for a 500-hour dataset. You may want to try dim=768 instead of 512.

I also notice that in the RESULTS file you called this 1e (IIRC).

jtrmal · 2017-12-02T21:43:42Z

egs/commonvoice/s5/local/score.sh

@@ -0,0 +1,65 @@
+#!/bin/bash


can you replace this script by a symlink steps/score_kaldi.sh, please?

jtrmal · 2017-12-02T21:44:29Z

egs/commonvoice/s5/run.sh

+if [ $stage -le 0 ]; then
+  mkdir -p $data
+
+  local/download_and_untar.sh $(/usr/bin/dirname $data) $data_url


Does the absolute pathname /usr/bin/dirname have any particular reason?

danpovey · 2017-12-02T21:45:06Z

egs/commonvoice/s5/local/chain/tuning/run_tdnn_1a.sh

+    --trainer.num-epochs=4 \
+    --trainer.frames-per-iter=1500000 \
+    --trainer.optimization.num-jobs-initial=3 \
+    --trainer.optimization.num-jobs-final=3 \


If your setup allows it, it would be a good idea, for speed, to increase num-jobs-final to something like 12.

Unfortunately, I only have 3 GPUs, but I will change it in the script to 12.

jtrmal · 2017-12-02T21:53:19Z

egs/commonvoice/s5/local/format_data.sh

+for f in phones.txt words.txt phones.txt L.fst L_disambig.fst phones; do
+    cp -r data/lang/$f $test
+done
+cat $lmdir/lm.arpa | \


I'd prefer the rest of the script to be replaced by utils/format_lm.sh

danpovey · 2017-12-02T21:53:26Z

Sometimes we use the script utils/make_absolute.sh which helps with portability.

…

On Sat, Dec 2, 2017 at 4:44 PM, jtrmal ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/commonvoice/s5/run.sh <#2057 (comment)>: > +# Apache 2.0 + +data=/home/ubuntu/export/data/cv_corpus_v1 +data_url=https://common-voice-data-download.s3.amazonaws.com/cv_corpus_v1.tar.gz + +. ./cmd.sh +. ./path.sh + +stage=0 + +set -euo pipefail + +if [ $stage -le 0 ]; then + mkdir -p $data + + local/download_and_untar.sh $(/usr/bin/dirname $data) $data_url Does the absolute pathname /usr/bin/dirname have any particular reason? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2057 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu6oAMNavhX3oJJeJ2M-86SkeHGu3ks5s8cTBgaJpZM4QzeIL> .

danpovey · 2017-12-02T22:22:13Z

It would be better if you installed GridEngine and set up the 'gpu' resource, if you haven't already-- that way you could run with more jobs even though you only have 3 GPUs. I don't like scripts checked in that haven't been tested with those settings.

…

On Sat, Dec 2, 2017 at 5:18 PM, Ewald Enzinger ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/commonvoice/s5/local/chain/tuning/run_tdnn_1a.sh <#2057 (comment)>: > + + steps/nnet3/chain/train.py --stage=$train_stage \ + --cmd="$decode_cmd" \ + --feat.online-ivector-dir=$train_ivector_dir \ + --feat.cmvn-opts="--norm-means=false --norm-vars=false" \ + --chain.xent-regularize $xent_regularize \ + --chain.leaky-hmm-coefficient=0.1 \ + --chain.l2-regularize=0.00005 \ + --chain.apply-deriv-weights=false \ + --chain.lm-opts="--num-extra-lm-states=2000" \ + --trainer.srand=$srand \ + --trainer.max-param-change=2.0 \ + --trainer.num-epochs=4 \ + --trainer.frames-per-iter=1500000 \ + --trainer.optimization.num-jobs-initial=3 \ + --trainer.optimization.num-jobs-final=3 \ Unfortunately, I only have 3 GPUs, but I will change it in the script to 12. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2057 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu9QM9NteFDCFehfYKQF_UMGkPNQGks5s8cylgaJpZM4QzeIL> .

entn-at · 2017-12-02T22:25:39Z

No problem, I have gridengine set up. I'm going test it with jobs-final=12 (It's just going to take a while longer).

danpovey · 2017-12-02T22:28:10Z

It actually won't take any longer because the total number of jobs is the same as before, it's just that (conceptually) more of them are in parallel. I recommend to increase the dim from 512 to 768 at the same time.

…

On Sat, Dec 2, 2017 at 5:25 PM, Ewald Enzinger ***@***.***> wrote: No problem, I have gridengine set up. I'm going test it with jobs-final=12 (It's just going to take a while longer). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2057 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu4Oqtfrue9vjMEjtXzQnSIJxVc8pks5s8c5lgaJpZM4QzeIL> .

change score.sh to a symlink to steps/score_kaldi.sh; remove path to dirname; replace local/format_data.sh with call to utils/format_lm.sh; use <unk> instead of SIL in lexicon

entn-at · 2017-12-02T23:53:23Z

I made the following changes:

change score.sh to a symlink to steps/score_kaldi.sh
remove absolute path to dirname
replace local/format_data.sh with a call to utils/format_lm.sh
use <unk> instead of SIL in lexicon

I'm currently running the whole recipe from start to finish. Once that's done I'll add another commit with the changes to run_tdnn_1a.sh and RESULTS.

danpovey · 2017-12-04T04:50:25Z

Thanks a lot! @jtrmal, please merge when and if you're OK with it. No need to check more, necessarily.

entn-at · 2017-12-04T04:53:01Z

Thanks for the quick review and the helpful comments!

jtrmal · 2017-12-04T18:37:33Z

all right, I'll merge. Thanks a lot!

@rickychanhoyin

* 'master' of https://github.com/kaldi-asr/kaldi: (58 commits) [src] Fix bug in nnet3 optimization, affecting Scale() operation; cosmetic fixes. (kaldi-asr#2088) [egs] Mac compatibility fix to SGMM+MMI: remove -T option to cp (kaldi-asr#2087) [egs] Copy dictionary-preparation-script fix from fisher-english(8e7793f) to fisher-swbd and ami (kaldi-asr#2084) [egs] Small fix to backstitch in AMI scripts (kaldi-asr#2083) [scripts] Fix augment_data_dir.py (relates to non-pipe case of wav.scp) (kaldi-asr#2081) [egs,scripts] Add OPGRU scripts and recipes (kaldi-asr#1950) [egs] Add an l2-regularize-based recipe for image recognition setups (kaldi-asr#2066) [src] Bug-fix to assertion in cu-sparse-matrix.cc (RE large matrices) (kaldi-asr#2077) [egs] Add a tdnn+lstm+attention+backstitch recipe for tedlium (kaldi-asr#1982) [src,egs] Small cosmetic fixes (kaldi-asr#2074) [src] Small fix RE CuSparse error code printing (kaldi-asr#2070) [src] Fix compilation error on MSVC: missing include. (kaldi-asr#2064) [egs] Update to CSJ example scripts, with chain+TDNN recipes. Thanks: @rickychanhoyin (kaldi-asr#2035) [scripts,egs] Convert ". path.sh" to ". ./path.sh" (kaldi-asr#2061) [doc] Add documentation about matrix row and column ranges in scp files. [egs] Add recipe for Mozilla Common Voice corpus v1 (kaldi-asr#2057) [scripts] Fix bug in slurm.pl affecting log format (kaldi-asr#2063) [src] Fix some small typos (kaldi-asr#2060) [scripts] Adding --num-threads option to ivector extraction scripts; script fixes (kaldi-asr#2055) [src] Bug-fix to conceptual bug in Minimum Bayes Risk/sausage code. Thanks:@jtrmal (kaldi-asr#2056) ...

* [egs] Add recipe for Mozilla Common Voice corpus v1 * Addressing comments change score.sh to a symlink to steps/score_kaldi.sh; remove path to dirname; replace local/format_data.sh with call to utils/format_lm.sh; use <unk> instead of SIL in lexicon * Update chain tdnn system and results * Add license (Apache 2.0) info line to data prep script

Cyrix126 · 2019-11-21T09:21:02Z

It would be nice to adapt it to the updated version of the mozilla dataset and to add option for choosing the language.

danpovey · 2019-11-21T09:22:56Z

You're right it would be nice. And I am hoping someone will contribute it. But I have more pressing matters to attend to.

…

On Thu, Nov 21, 2019 at 5:21 PM Cyrix126 ***@***.***> wrote: It would be nice to adapt it to the updated version of the mozilla dataset and to add option for choosing the language. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2057?email_source=notifications&email_token=AAZFLO2X7SAGPKXGAUD4A2DQUZHH7A5CNFSM4EGN4IF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEZQ2NY#issuecomment-556993847>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO4D4ITSJFKMNOA7ZOTQUZHH7ANCNFSM4EGN4IFQ> .

[egs] Add recipe for Mozilla Common Voice corpus v1

9c44203

danpovey reviewed Dec 2, 2017

View reviewed changes

jtrmal reviewed Dec 2, 2017

View reviewed changes

danpovey reviewed Dec 2, 2017

View reviewed changes

jtrmal reviewed Dec 2, 2017

View reviewed changes

Addressing comments

2318c3d

change score.sh to a symlink to steps/score_kaldi.sh; remove path to dirname; replace local/format_data.sh with call to utils/format_lm.sh; use <unk> instead of SIL in lexicon

Update chain tdnn system and results

0a4a144

Add license (Apache 2.0) info line to data prep script

0814744

Merge branch 'master' into commonvoice

4ae97b1

jtrmal merged commit 93ceca7 into kaldi-asr:master Dec 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[egs] Add recipe for Mozilla Common Voice corpus v1 #2057

[egs] Add recipe for Mozilla Common Voice corpus v1 #2057

entn-at commented Dec 2, 2017

entn-at commented Dec 2, 2017

danpovey Dec 2, 2017

jtrmal Dec 2, 2017

jtrmal Dec 2, 2017

danpovey Dec 2, 2017

entn-at Dec 2, 2017

jtrmal Dec 2, 2017

danpovey commented Dec 2, 2017 via email

danpovey commented Dec 2, 2017 via email

entn-at commented Dec 2, 2017

danpovey commented Dec 2, 2017 via email

entn-at commented Dec 2, 2017

danpovey commented Dec 4, 2017

entn-at commented Dec 4, 2017

jtrmal commented Dec 4, 2017

Cyrix126 commented Nov 21, 2019

danpovey commented Nov 21, 2019 via email

[egs] Add recipe for Mozilla Common Voice corpus v1 #2057

[egs] Add recipe for Mozilla Common Voice corpus v1 #2057

Conversation

entn-at commented Dec 2, 2017

entn-at commented Dec 2, 2017

danpovey Dec 2, 2017

Choose a reason for hiding this comment

jtrmal Dec 2, 2017

Choose a reason for hiding this comment

jtrmal Dec 2, 2017

Choose a reason for hiding this comment

danpovey Dec 2, 2017

Choose a reason for hiding this comment

entn-at Dec 2, 2017

Choose a reason for hiding this comment

jtrmal Dec 2, 2017

Choose a reason for hiding this comment

danpovey commented Dec 2, 2017 via email

danpovey commented Dec 2, 2017 via email

entn-at commented Dec 2, 2017

danpovey commented Dec 2, 2017 via email

entn-at commented Dec 2, 2017

danpovey commented Dec 4, 2017

entn-at commented Dec 4, 2017

jtrmal commented Dec 4, 2017

Cyrix126 commented Nov 21, 2019

danpovey commented Nov 21, 2019 via email