Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[egs] Add recipe for Mozilla Common Voice corpus v1 #2057

Merged
merged 5 commits into from
Dec 4, 2017

Conversation

entn-at
Copy link
Contributor

@entn-at entn-at commented Dec 2, 2017

This a basic recipe for the recently released Mozilla Common Voice corpus (v1, CC-0 licensed). See https://voice.mozilla.org/data

Some of the data preparation scripts were taken from the voxforge recipe (dict, LM). The systems and chain model setup were adapted from mini_librispeech (including speed perturbation, PCA transform for i-vector extraction, etc.).

I did not tune the setup, the chain system already has WERs of about 5% (see RESULTS).

@entn-at
Copy link
Contributor Author

entn-at commented Dec 2, 2017

Note that this recipe currently only uses the "valid" portion of the corpus, that is, utterances that have had at least 2 people listen to them, and the majority of those listeners said the audio matches the text.

fixed-affine-layer name=lda input=Append(-2,-1,0,1,2,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat

# the first splicing is moved before the lda layer, so no splicing here
relu-batchnorm-layer name=tdnn1 dim=512
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This system is rather small for a 500-hour dataset. You may want to try dim=768 instead of 512.

I also notice that in the RESULTS file you called this 1e (IIRC).

@@ -0,0 +1,65 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you replace this script by a symlink steps/score_kaldi.sh, please?

if [ $stage -le 0 ]; then
mkdir -p $data

local/download_and_untar.sh $(/usr/bin/dirname $data) $data_url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the absolute pathname /usr/bin/dirname have any particular reason?

--trainer.num-epochs=4 \
--trainer.frames-per-iter=1500000 \
--trainer.optimization.num-jobs-initial=3 \
--trainer.optimization.num-jobs-final=3 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your setup allows it, it would be a good idea, for speed, to increase num-jobs-final to something like 12.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I only have 3 GPUs, but I will change it in the script to 12.

for f in phones.txt words.txt phones.txt L.fst L_disambig.fst phones; do
cp -r data/lang/$f $test
done
cat $lmdir/lm.arpa | \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer the rest of the script to be replaced by utils/format_lm.sh

@danpovey
Copy link
Contributor

danpovey commented Dec 2, 2017 via email

@danpovey
Copy link
Contributor

danpovey commented Dec 2, 2017 via email

@entn-at
Copy link
Contributor Author

entn-at commented Dec 2, 2017

No problem, I have gridengine set up. I'm going test it with jobs-final=12 (It's just going to take a while longer).

@danpovey
Copy link
Contributor

danpovey commented Dec 2, 2017 via email

change score.sh to a symlink to steps/score_kaldi.sh; remove path to dirname; replace local/format_data.sh with call to utils/format_lm.sh; use <unk> instead of SIL in lexicon
@entn-at
Copy link
Contributor Author

entn-at commented Dec 2, 2017

I made the following changes:

  • change score.sh to a symlink to steps/score_kaldi.sh
  • remove absolute path to dirname
  • replace local/format_data.sh with a call to utils/format_lm.sh
  • use <unk> instead of SIL in lexicon

I'm currently running the whole recipe from start to finish. Once that's done I'll add another commit with the changes to run_tdnn_1a.sh and RESULTS.

@danpovey
Copy link
Contributor

danpovey commented Dec 4, 2017

Thanks a lot! @jtrmal, please merge when and if you're OK with it. No need to check more, necessarily.

@entn-at
Copy link
Contributor Author

entn-at commented Dec 4, 2017

Thanks for the quick review and the helpful comments!

@jtrmal
Copy link
Contributor

jtrmal commented Dec 4, 2017

all right, I'll merge. Thanks a lot!

@jtrmal jtrmal merged commit 93ceca7 into kaldi-asr:master Dec 4, 2017
kronos-cm added a commit to kronos-cm/kaldi that referenced this pull request Dec 18, 2017
* 'master' of https://github.com/kaldi-asr/kaldi: (58 commits)
  [src] Fix bug in nnet3 optimization, affecting Scale() operation; cosmetic fixes. (kaldi-asr#2088)
  [egs] Mac compatibility fix to SGMM+MMI: remove -T option to cp (kaldi-asr#2087)
  [egs] Copy dictionary-preparation-script fix from fisher-english(8e7793f) to fisher-swbd and ami (kaldi-asr#2084)
  [egs] Small fix to backstitch in AMI scripts (kaldi-asr#2083)
  [scripts] Fix augment_data_dir.py (relates to non-pipe case of wav.scp) (kaldi-asr#2081)
  [egs,scripts] Add OPGRU scripts and recipes (kaldi-asr#1950)
  [egs] Add an l2-regularize-based recipe for image recognition setups (kaldi-asr#2066)
  [src] Bug-fix to assertion in cu-sparse-matrix.cc (RE large matrices) (kaldi-asr#2077)
  [egs] Add a tdnn+lstm+attention+backstitch recipe for tedlium (kaldi-asr#1982)
  [src,egs] Small cosmetic fixes (kaldi-asr#2074)
  [src] Small fix RE CuSparse error code printing (kaldi-asr#2070)
  [src] Fix compilation error on MSVC: missing include. (kaldi-asr#2064)
  [egs] Update to CSJ example scripts, with chain+TDNN recipes.  Thanks: @rickychanhoyin (kaldi-asr#2035)
  [scripts,egs] Convert ". path.sh" to ". ./path.sh" (kaldi-asr#2061)
  [doc] Add documentation about matrix row and column ranges in scp files.
  [egs] Add recipe for Mozilla Common Voice corpus v1 (kaldi-asr#2057)
  [scripts] Fix bug in slurm.pl affecting log format (kaldi-asr#2063)
  [src] Fix some small typos (kaldi-asr#2060)
  [scripts] Adding --num-threads option to ivector extraction scripts; script fixes (kaldi-asr#2055)
  [src] Bug-fix to conceptual bug in Minimum Bayes Risk/sausage code.  Thanks:@jtrmal (kaldi-asr#2056)
  ...
mahsa7823 pushed a commit to mahsa7823/kaldi that referenced this pull request Feb 28, 2018
* [egs] Add recipe for Mozilla Common Voice corpus v1

* Addressing comments

change score.sh to a symlink to steps/score_kaldi.sh; remove path to dirname; replace local/format_data.sh with call to utils/format_lm.sh; use <unk> instead of SIL in lexicon

* Update chain tdnn system and results

* Add license (Apache 2.0) info line to data prep script
Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
* [egs] Add recipe for Mozilla Common Voice corpus v1

* Addressing comments

change score.sh to a symlink to steps/score_kaldi.sh; remove path to dirname; replace local/format_data.sh with call to utils/format_lm.sh; use <unk> instead of SIL in lexicon

* Update chain tdnn system and results

* Add license (Apache 2.0) info line to data prep script
@Cyrix126
Copy link

It would be nice to adapt it to the updated version of the mozilla dataset and to add option for choosing the language.

@danpovey
Copy link
Contributor

danpovey commented Nov 21, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants