-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[egs] Add "formosa_speech" recipe (Taiwanese Mandarin ASR) (#2474)
- Loading branch information
Showing
28 changed files
with
1,702 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
### Welcome to the demo recipe of the Formosa Speech in the Wild (FSW) Project ### | ||
|
||
The language habits of Taiwanese people are different from other Mandarin speakers (both accents and cultures) [1]. Especially Tainwaese use tranditional Chinese characters, i.e., 繁體中文). To address this issue, a Taiwanese speech corpus collection project "Formosa Speech in the Wild (FSW)" was initiated in 2017 to improve the development of Taiwanese-specific speech recognition techniques. | ||
|
||
FSW corpus will be a large-scale database of real-Life/multi-gene Taiwanese Spontaneous speech collected and transcribed from various sources (radio, TV, open courses, etc.). To demostrate that this database is a reasonable data resource for Taiwanese spontaneous speech recognition research, a baseline recipe is provied here for everybody, especially students, to develop their own systems easily and quickly. | ||
|
||
This recipe is based on the "NER-Trs-Vol1" corpus (about 150 hours broadcast radio speech selected from FSW). For more details, please visit: | ||
* Formosa Speech in the Wild (FSW) project (https://sites.google.com/speech.ntut.edu.tw/fsw) | ||
|
||
If you want to apply the NER-Trs-Vol1 corpus, please contact Yuan-Fu Liao (廖元甫) via "yfliao@mail.ntut.edu.tw". This corpus is only for non-commercial research/education use and will be distributed via our GitLab server in https://speech.nchc.org.tw. | ||
|
||
Any bug, errors, comments or suggestions are very welcomed. | ||
|
||
Yuan-Fu Liao (廖元甫) | ||
Associate Professor | ||
Department of electronic Engineering, | ||
National Taipei University of Technology | ||
http://www.ntut.edu.tw/~yfliao | ||
yfliao@mail.ntut.edu.tw | ||
|
||
............ | ||
[1] The languages of Taiwan consist of several varieties of languages under families of the Austronesian languages and the Sino-Tibetan languages. Taiwanese Mandarin, Hokkien, Hakka and Formosan languages are used by 83.5%, 81.9%, 6.6% and 1.4% of the population respectively (2010). Given the prevalent use of Taiwanese Hokkien, the Mandarin spoken in Taiwan has been to a great extent influenced by it. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# | ||
# Reference results | ||
# | ||
# Experimental settings: | ||
# | ||
# training set: show CS, BG, DA, QG, SR, SY and WK, in total 18977 utt., 1,088,948 words | ||
# test set: show JZ, GJ, KX and YX, in total 2112 utt., 135,972 words | ||
# eval set: show JX, TD and WJ, in total 2222 utt., 104,648 words | ||
# | ||
# lexicon: 274,036 words | ||
# phones (IPA): 196 (tonal) | ||
# | ||
|
||
# WER: test | ||
|
||
%WER 61.32 [ 83373 / 135972, 5458 ins, 19156 del, 58759 sub ] exp/mono/decode_test/wer_11_0.0 | ||
%WER 41.00 [ 55742 / 135972, 6725 ins, 12763 del, 36254 sub ] exp/tri1/decode_test/wer_15_0.0 | ||
%WER 40.41 [ 54948 / 135972, 7366 ins, 11505 del, 36077 sub ] exp/tri2/decode_test/wer_14_0.0 | ||
%WER 38.67 [ 52574 / 135972, 6855 ins, 11250 del, 34469 sub ] exp/tri3a/decode_test/wer_15_0.0 | ||
%WER 35.70 [ 48546 / 135972, 7197 ins, 9717 del, 31632 sub ] exp/tri4a/decode_test/wer_17_0.0 | ||
%WER 32.11 [ 43661 / 135972, 6112 ins, 10185 del, 27364 sub ] exp/tri5a/decode_test/wer_17_0.5 | ||
%WER 31.36 [ 42639 / 135972, 6846 ins, 8860 del, 26933 sub ] exp/tri5a_cleaned/decode_test/wer_17_0.5 | ||
%WER 24.43 [ 33218 / 135972, 5524 ins, 7583 del, 20111 sub ] exp/nnet3/tdnn_sp/decode_test/wer_12_0.0 | ||
%WER 23.95 [ 32568 / 135972, 4457 ins, 10271 del, 17840 sub ] exp/chain/tdnn_1a_sp/decode_test/wer_10_0.0 | ||
%WER 23.54 [ 32006 / 135972, 4717 ins, 8644 del, 18645 sub ] exp/chain/tdnn_1b_sp/decode_test/wer_10_0.0 | ||
%WER 20.64 [ 28067 / 135972, 4434 ins, 7946 del, 15687 sub ] exp/chain/tdnn_1c_sp/decode_test/wer_11_0.0 | ||
%WER 20.98 [ 28527 / 135972, 4706 ins, 7816 del, 16005 sub ] exp/chain/tdnn_1d_sp/decode_test/wer_10_0.0 | ||
|
||
# CER: test | ||
|
||
%WER 54.09 [ 116688 / 215718, 4747 ins, 24510 del, 87431 sub ] exp/mono/decode_test/cer_10_0.0 | ||
%WER 32.61 [ 70336 / 215718, 5866 ins, 16282 del, 48188 sub ] exp/tri1/decode_test/cer_13_0.0 | ||
%WER 32.10 [ 69238 / 215718, 6186 ins, 15772 del, 47280 sub ] exp/tri2/decode_test/cer_13_0.0 | ||
%WER 30.40 [ 65583 / 215718, 6729 ins, 13115 del, 45739 sub ] exp/tri3a/decode_test/cer_12_0.0 | ||
%WER 27.53 [ 59389 / 215718, 6311 ins, 13008 del, 40070 sub ] exp/tri4a/decode_test/cer_15_0.0 | ||
%WER 24.21 [ 52232 / 215718, 6425 ins, 11543 del, 34264 sub ] exp/tri5a/decode_test/cer_15_0.0 | ||
%WER 23.41 [ 50492 / 215718, 6645 ins, 10997 del, 32850 sub ] exp/tri5a_cleaned/decode_test/cer_17_0.0 | ||
%WER 17.07 [ 36829 / 215718, 4734 ins, 9938 del, 22157 sub ] exp/nnet3/tdnn_sp/decode_test/cer_12_0.0 | ||
%WER 16.83 [ 36305 / 215718, 4772 ins, 10810 del, 20723 sub ] exp/chain/tdnn_1a_sp/decode_test/cer_9_0.0 | ||
%WER 16.44 [ 35459 / 215718, 4216 ins, 11278 del, 19965 sub ] exp/chain/tdnn_1b_sp/decode_test/cer_10_0.0 | ||
%WER 13.72 [ 29605 / 215718, 4678 ins, 8066 del, 16861 sub ] exp/chain/tdnn_1c_sp/decode_test/cer_10_0.0 | ||
%WER 14.08 [ 30364 / 215718, 5182 ins, 7588 del, 17594 sub ] exp/chain/tdnn_1d_sp/decode_test/cer_9_0.0 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# "queue.pl" uses qsub. The options to it are | ||
# options to qsub. If you have GridEngine installed, | ||
# change this to a queue you have access to. | ||
# Otherwise, use "run.pl", which will run jobs locally | ||
# (make sure your --num-jobs options are no more than | ||
# the number of cpus on your machine. | ||
|
||
# Run locally: | ||
#export train_cmd=run.pl | ||
#export decode_cmd=run.pl | ||
|
||
# JHU cluster (or most clusters using GridEngine, with a suitable | ||
# conf/queue.conf). | ||
export train_cmd="queue.pl" | ||
export decode_cmd="queue.pl --mem 4G" | ||
|
||
host=$(hostname -f) | ||
if [ ${host#*.} == "fit.vutbr.cz" ]; then | ||
queue_conf=$HOME/queue_conf/default.conf # see example /homes/kazi/iveselyk/queue_conf/default.conf, | ||
export train_cmd="queue.pl --config $queue_conf --mem 2G --matylda 0.2" | ||
export decode_cmd="queue.pl --config $queue_conf --mem 3G --matylda 0.1" | ||
export cuda_cmd="queue.pl --config $queue_conf --gpu 1 --mem 10G --tmp 40G" | ||
elif [ ${host#*.} == "cm.cluster" ]; then | ||
# MARCC bluecrab cluster: | ||
export train_cmd="slurm.pl --time 4:00:00 " | ||
export decode_cmd="slurm.pl --mem 4G --time 4:00:00 " | ||
fi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
beam=11.0 # beam for decoding. Was 13.0 in the scripts. | ||
first_beam=8.0 # beam for 1st-pass decoding in SAT. | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
--use-energy=false # only non-default option. | ||
--sample-frequency=16000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# config for high-resolution MFCC features, intended for neural network training. | ||
# Note: we keep all cepstra, so it has the same info as filterbank features, | ||
# but MFCC is more easily compressible (because less correlated) which is why | ||
# we prefer this method. | ||
--use-energy=false # use average of log energy, not energy. | ||
--sample-frequency=16000 # Switchboard is sampled at 8kHz | ||
--num-mel-bins=40 # similar to Google's setup. | ||
--num-ceps=40 # there is no dimensionality reduction. | ||
--low-freq=40 # low cutoff frequency for mel bins | ||
--high-freq=-200 # high cutoff frequently, relative to Nyquist of 8000 (=3800) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# configuration file for apply-cmvn-online, used when invoking online2-wav-nnet3-latgen-faster. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
--sample-frequency=16000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
tuning/run_tdnn_1d.sh |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,181 @@ | ||
#!/bin/bash | ||
|
||
# This script is based on run_tdnn_7h.sh in swbd chain recipe. | ||
|
||
set -e | ||
|
||
# configs for 'chain' | ||
affix=1a | ||
stage=0 | ||
train_stage=-10 | ||
get_egs_stage=-10 | ||
dir=exp/chain/tdnn # Note: _sp will get added to this | ||
decode_iter= | ||
|
||
# training options | ||
num_epochs=4 | ||
initial_effective_lrate=0.001 | ||
final_effective_lrate=0.0001 | ||
max_param_change=2.0 | ||
final_layer_normalize_target=0.5 | ||
num_jobs_initial=2 | ||
num_jobs_final=12 | ||
minibatch_size=128 | ||
frames_per_eg=150,110,90 | ||
remove_egs=false | ||
common_egs_dir= | ||
xent_regularize=0.1 | ||
|
||
# End configuration section. | ||
echo "$0 $@" # Print the command line for logging | ||
|
||
. ./cmd.sh | ||
. ./path.sh | ||
. ./utils/parse_options.sh | ||
|
||
if ! cuda-compiled; then | ||
cat <<EOF && exit 1 | ||
This script is intended to be used with GPUs but you have not compiled Kaldi with CUDA | ||
If you want to use GPUs (and have them), go to src/, and configure and make on a machine | ||
where "nvcc" is installed. | ||
EOF | ||
fi | ||
|
||
# The iVector-extraction and feature-dumping parts are the same as the standard | ||
# nnet3 setup, and you can skip them by setting "--stage 8" if you have already | ||
# run those things. | ||
|
||
dir=${dir}${affix:+_$affix}_sp | ||
train_set=train_sp | ||
ali_dir=exp/tri5a_sp_ali | ||
treedir=exp/chain/tri6_7d_tree_sp | ||
lang=data/lang_chain | ||
|
||
|
||
# if we are using the speed-perturbed data we need to generate | ||
# alignments for it. | ||
local/nnet3/run_ivector_common.sh --stage $stage || exit 1; | ||
|
||
if [ $stage -le 7 ]; then | ||
# Get the alignments as lattices (gives the LF-MMI training more freedom). | ||
# use the same num-jobs as the alignments | ||
nj=$(cat $ali_dir/num_jobs) || exit 1; | ||
steps/align_fmllr_lats.sh --nj $nj --cmd "$train_cmd" data/$train_set \ | ||
data/lang exp/tri5a exp/tri5a_sp_lats | ||
rm exp/tri5a_sp_lats/fsts.*.gz # save space | ||
fi | ||
|
||
if [ $stage -le 8 ]; then | ||
# Create a version of the lang/ directory that has one state per phone in the | ||
# topo file. [note, it really has two states.. the first one is only repeated | ||
# once, the second one has zero or more repeats.] | ||
rm -rf $lang | ||
cp -r data/lang $lang | ||
silphonelist=$(cat $lang/phones/silence.csl) || exit 1; | ||
nonsilphonelist=$(cat $lang/phones/nonsilence.csl) || exit 1; | ||
# Use our special topology... note that later on may have to tune this | ||
# topology. | ||
steps/nnet3/chain/gen_topo.py $nonsilphonelist $silphonelist >$lang/topo | ||
fi | ||
|
||
if [ $stage -le 9 ]; then | ||
# Build a tree using our new topology. This is the critically different | ||
# step compared with other recipes. | ||
steps/nnet3/chain/build_tree.sh --frame-subsampling-factor 3 \ | ||
--context-opts "--context-width=2 --central-position=1" \ | ||
--cmd "$train_cmd" 5000 data/$train_set $lang $ali_dir $treedir | ||
fi | ||
|
||
if [ $stage -le 10 ]; then | ||
echo "$0: creating neural net configs using the xconfig parser"; | ||
|
||
num_targets=$(tree-info $treedir/tree |grep num-pdfs|awk '{print $2}') | ||
learning_rate_factor=$(echo "print 0.5/$xent_regularize" | python) | ||
|
||
mkdir -p $dir/configs | ||
cat <<EOF > $dir/configs/network.xconfig | ||
input dim=100 name=ivector | ||
input dim=43 name=input | ||
# please note that it is important to have input layer with the name=input | ||
# as the layer immediately preceding the fixed-affine-layer to enable | ||
# the use of short notation for the descriptor | ||
fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat | ||
# the first splicing is moved before the lda layer, so no splicing here | ||
relu-batchnorm-layer name=tdnn1 dim=625 | ||
relu-batchnorm-layer name=tdnn2 input=Append(-1,0,1) dim=625 | ||
relu-batchnorm-layer name=tdnn3 input=Append(-1,0,1) dim=625 | ||
relu-batchnorm-layer name=tdnn4 input=Append(-3,0,3) dim=625 | ||
relu-batchnorm-layer name=tdnn5 input=Append(-3,0,3) dim=625 | ||
relu-batchnorm-layer name=tdnn6 input=Append(-3,0,3) dim=625 | ||
## adding the layers for chain branch | ||
relu-batchnorm-layer name=prefinal-chain input=tdnn6 dim=625 target-rms=0.5 | ||
output-layer name=output include-log-softmax=false dim=$num_targets max-change=1.5 | ||
# adding the layers for xent branch | ||
# This block prints the configs for a separate output that will be | ||
# trained with a cross-entropy objective in the 'chain' models... this | ||
# has the effect of regularizing the hidden parts of the model. we use | ||
# 0.5 / args.xent_regularize as the learning rate factor- the factor of | ||
# 0.5 / args.xent_regularize is suitable as it means the xent | ||
# final-layer learns at a rate independent of the regularization | ||
# constant; and the 0.5 was tuned so as to make the relative progress | ||
# similar in the xent and regular final layers. | ||
relu-batchnorm-layer name=prefinal-xent input=tdnn6 dim=625 target-rms=0.5 | ||
output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor max-change=1.5 | ||
EOF | ||
steps/nnet3/xconfig_to_configs.py --xconfig-file $dir/configs/network.xconfig --config-dir $dir/configs/ | ||
fi | ||
|
||
if [ $stage -le 11 ]; then | ||
steps/nnet3/chain/train.py --stage $train_stage \ | ||
--cmd "$decode_cmd" \ | ||
--feat.online-ivector-dir exp/nnet3/ivectors_${train_set} \ | ||
--feat.cmvn-opts "--norm-means=false --norm-vars=false" \ | ||
--chain.xent-regularize $xent_regularize \ | ||
--chain.leaky-hmm-coefficient 0.1 \ | ||
--chain.l2-regularize 0.00005 \ | ||
--chain.apply-deriv-weights false \ | ||
--chain.lm-opts="--num-extra-lm-states=2000" \ | ||
--egs.dir "$common_egs_dir" \ | ||
--egs.stage $get_egs_stage \ | ||
--egs.opts "--frames-overlap-per-eg 0" \ | ||
--egs.chunk-width $frames_per_eg \ | ||
--trainer.num-chunk-per-minibatch $minibatch_size \ | ||
--trainer.frames-per-iter 1500000 \ | ||
--trainer.num-epochs $num_epochs \ | ||
--trainer.optimization.num-jobs-initial $num_jobs_initial \ | ||
--trainer.optimization.num-jobs-final $num_jobs_final \ | ||
--trainer.optimization.initial-effective-lrate $initial_effective_lrate \ | ||
--trainer.optimization.final-effective-lrate $final_effective_lrate \ | ||
--trainer.max-param-change $max_param_change \ | ||
--cleanup.remove-egs $remove_egs \ | ||
--feat-dir data/${train_set}_hires \ | ||
--tree-dir $treedir \ | ||
--lat-dir exp/tri5a_sp_lats \ | ||
--use-gpu wait \ | ||
--dir $dir || exit 1; | ||
fi | ||
|
||
if [ $stage -le 12 ]; then | ||
# Note: it might appear that this $lang directory is mismatched, and it is as | ||
# far as the 'topo' is concerned, but this script doesn't read the 'topo' from | ||
# the lang directory. | ||
utils/mkgraph.sh --self-loop-scale 1.0 data/lang_test $dir $dir/graph | ||
fi | ||
|
||
graph_dir=$dir/graph | ||
if [ $stage -le 13 ]; then | ||
for test_set in test eval; do | ||
steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 \ | ||
--nj 10 --cmd "$decode_cmd" \ | ||
--online-ivector-dir exp/nnet3/ivectors_$test_set \ | ||
$graph_dir data/${test_set}_hires $dir/decode_${test_set} || exit 1; | ||
done | ||
wait; | ||
fi | ||
|
||
exit 0; |
Oops, something went wrong.