-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[egs] Add recipe for Mozilla Common Voice corpus v1 #2057
Conversation
Note that this recipe currently only uses the "valid" portion of the corpus, that is, utterances that have had at least 2 people listen to them, and the majority of those listeners said the audio matches the text. |
fixed-affine-layer name=lda input=Append(-2,-1,0,1,2,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat | ||
|
||
# the first splicing is moved before the lda layer, so no splicing here | ||
relu-batchnorm-layer name=tdnn1 dim=512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This system is rather small for a 500-hour dataset. You may want to try dim=768 instead of 512.
I also notice that in the RESULTS file you called this 1e (IIRC).
egs/commonvoice/s5/local/score.sh
Outdated
@@ -0,0 +1,65 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you replace this script by a symlink steps/score_kaldi.sh, please?
egs/commonvoice/s5/run.sh
Outdated
if [ $stage -le 0 ]; then | ||
mkdir -p $data | ||
|
||
local/download_and_untar.sh $(/usr/bin/dirname $data) $data_url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the absolute pathname /usr/bin/dirname
have any particular reason?
--trainer.num-epochs=4 \ | ||
--trainer.frames-per-iter=1500000 \ | ||
--trainer.optimization.num-jobs-initial=3 \ | ||
--trainer.optimization.num-jobs-final=3 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your setup allows it, it would be a good idea, for speed, to increase num-jobs-final to something like 12.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I only have 3 GPUs, but I will change it in the script to 12.
for f in phones.txt words.txt phones.txt L.fst L_disambig.fst phones; do | ||
cp -r data/lang/$f $test | ||
done | ||
cat $lmdir/lm.arpa | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer the rest of the script to be replaced by utils/format_lm.sh
Sometimes we use the script utils/make_absolute.sh which helps with
portability.
…On Sat, Dec 2, 2017 at 4:44 PM, jtrmal ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In egs/commonvoice/s5/run.sh
<#2057 (comment)>:
> +# Apache 2.0
+
+data=/home/ubuntu/export/data/cv_corpus_v1
+data_url=https://common-voice-data-download.s3.amazonaws.com/cv_corpus_v1.tar.gz
+
+. ./cmd.sh
+. ./path.sh
+
+stage=0
+
+set -euo pipefail
+
+if [ $stage -le 0 ]; then
+ mkdir -p $data
+
+ local/download_and_untar.sh $(/usr/bin/dirname $data) $data_url
Does the absolute pathname /usr/bin/dirname have any particular reason?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2057 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu6oAMNavhX3oJJeJ2M-86SkeHGu3ks5s8cTBgaJpZM4QzeIL>
.
|
It would be better if you installed GridEngine and set up the 'gpu'
resource, if you haven't already-- that way you could run with more jobs
even though you only have 3 GPUs. I don't like scripts checked in that
haven't been tested with those settings.
…On Sat, Dec 2, 2017 at 5:18 PM, Ewald Enzinger ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In egs/commonvoice/s5/local/chain/tuning/run_tdnn_1a.sh
<#2057 (comment)>:
> +
+ steps/nnet3/chain/train.py --stage=$train_stage \
+ --cmd="$decode_cmd" \
+ --feat.online-ivector-dir=$train_ivector_dir \
+ --feat.cmvn-opts="--norm-means=false --norm-vars=false" \
+ --chain.xent-regularize $xent_regularize \
+ --chain.leaky-hmm-coefficient=0.1 \
+ --chain.l2-regularize=0.00005 \
+ --chain.apply-deriv-weights=false \
+ --chain.lm-opts="--num-extra-lm-states=2000" \
+ --trainer.srand=$srand \
+ --trainer.max-param-change=2.0 \
+ --trainer.num-epochs=4 \
+ --trainer.frames-per-iter=1500000 \
+ --trainer.optimization.num-jobs-initial=3 \
+ --trainer.optimization.num-jobs-final=3 \
Unfortunately, I only have 3 GPUs, but I will change it in the script to
12.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2057 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu9QM9NteFDCFehfYKQF_UMGkPNQGks5s8cylgaJpZM4QzeIL>
.
|
No problem, I have gridengine set up. I'm going test it with jobs-final=12 (It's just going to take a while longer). |
It actually won't take any longer because the total number of jobs is the
same as before, it's just that (conceptually) more of them are in parallel.
I recommend to increase the dim from 512 to 768 at the same time.
…On Sat, Dec 2, 2017 at 5:25 PM, Ewald Enzinger ***@***.***> wrote:
No problem, I have gridengine set up. I'm going test it with jobs-final=12
(It's just going to take a while longer).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2057 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu4Oqtfrue9vjMEjtXzQnSIJxVc8pks5s8c5lgaJpZM4QzeIL>
.
|
change score.sh to a symlink to steps/score_kaldi.sh; remove path to dirname; replace local/format_data.sh with call to utils/format_lm.sh; use <unk> instead of SIL in lexicon
I made the following changes:
I'm currently running the whole recipe from start to finish. Once that's done I'll add another commit with the changes to |
Thanks a lot! @jtrmal, please merge when and if you're OK with it. No need to check more, necessarily. |
Thanks for the quick review and the helpful comments! |
all right, I'll merge. Thanks a lot! |
* 'master' of https://github.com/kaldi-asr/kaldi: (58 commits) [src] Fix bug in nnet3 optimization, affecting Scale() operation; cosmetic fixes. (kaldi-asr#2088) [egs] Mac compatibility fix to SGMM+MMI: remove -T option to cp (kaldi-asr#2087) [egs] Copy dictionary-preparation-script fix from fisher-english(8e7793f) to fisher-swbd and ami (kaldi-asr#2084) [egs] Small fix to backstitch in AMI scripts (kaldi-asr#2083) [scripts] Fix augment_data_dir.py (relates to non-pipe case of wav.scp) (kaldi-asr#2081) [egs,scripts] Add OPGRU scripts and recipes (kaldi-asr#1950) [egs] Add an l2-regularize-based recipe for image recognition setups (kaldi-asr#2066) [src] Bug-fix to assertion in cu-sparse-matrix.cc (RE large matrices) (kaldi-asr#2077) [egs] Add a tdnn+lstm+attention+backstitch recipe for tedlium (kaldi-asr#1982) [src,egs] Small cosmetic fixes (kaldi-asr#2074) [src] Small fix RE CuSparse error code printing (kaldi-asr#2070) [src] Fix compilation error on MSVC: missing include. (kaldi-asr#2064) [egs] Update to CSJ example scripts, with chain+TDNN recipes. Thanks: @rickychanhoyin (kaldi-asr#2035) [scripts,egs] Convert ". path.sh" to ". ./path.sh" (kaldi-asr#2061) [doc] Add documentation about matrix row and column ranges in scp files. [egs] Add recipe for Mozilla Common Voice corpus v1 (kaldi-asr#2057) [scripts] Fix bug in slurm.pl affecting log format (kaldi-asr#2063) [src] Fix some small typos (kaldi-asr#2060) [scripts] Adding --num-threads option to ivector extraction scripts; script fixes (kaldi-asr#2055) [src] Bug-fix to conceptual bug in Minimum Bayes Risk/sausage code. Thanks:@jtrmal (kaldi-asr#2056) ...
* [egs] Add recipe for Mozilla Common Voice corpus v1 * Addressing comments change score.sh to a symlink to steps/score_kaldi.sh; remove path to dirname; replace local/format_data.sh with call to utils/format_lm.sh; use <unk> instead of SIL in lexicon * Update chain tdnn system and results * Add license (Apache 2.0) info line to data prep script
* [egs] Add recipe for Mozilla Common Voice corpus v1 * Addressing comments change score.sh to a symlink to steps/score_kaldi.sh; remove path to dirname; replace local/format_data.sh with call to utils/format_lm.sh; use <unk> instead of SIL in lexicon * Update chain tdnn system and results * Add license (Apache 2.0) info line to data prep script
It would be nice to adapt it to the updated version of the mozilla dataset and to add option for choosing the language. |
You're right it would be nice. And I am hoping someone will contribute
it. But I have more pressing matters to attend to.
…On Thu, Nov 21, 2019 at 5:21 PM Cyrix126 ***@***.***> wrote:
It would be nice to adapt it to the updated version of the mozilla dataset
and to add option for choosing the language.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2057?email_source=notifications&email_token=AAZFLO2X7SAGPKXGAUD4A2DQUZHH7A5CNFSM4EGN4IF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEZQ2NY#issuecomment-556993847>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO4D4ITSJFKMNOA7ZOTQUZHH7ANCNFSM4EGN4IFQ>
.
|
This a basic recipe for the recently released Mozilla Common Voice corpus (v1, CC-0 licensed). See https://voice.mozilla.org/data
Some of the data preparation scripts were taken from the voxforge recipe (dict, LM). The systems and chain model setup were adapted from mini_librispeech (including speed perturbation, PCA transform for i-vector extraction, etc.).
I did not tune the setup, the chain system already has WERs of about 5% (see RESULTS).