-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OCR/Handwriting Recognition examples #1984
Conversation
signal(SIGPIPE, SIG_DFL) | ||
|
||
parser = argparse.ArgumentParser( | ||
description="""Generates and saves the feature vectors""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is good if you add some description about types of augmentation, you are doing in this script.
egs/iam/s5/local/chain/run_cnn_1a.sh
Outdated
frame_subsampling_factor=4 | ||
alignment_subsampling_factor=1 | ||
# training chunk-options | ||
chunk_width=340,300,200,100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason for this chunk width? should it be multiple of 32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change this to something less unusual like 300,200,100
Again, the reason for choosing this is because the average number of frames per phone/letter is almost 2 times larger for OCR.
egs/iam/s5/local/chain/run_cnn_1a.sh
Outdated
lat_dir=exp/chain${nnet3_affix}/${gmm}_${train_set}_lats | ||
dir=exp/chain${nnet3_affix}/cnn${affix} | ||
train_data_dir=data/${train_set} | ||
lores_train_data_dir=$train_data_dir # for the start, use the same data for gmm and chain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you define lores_train_data_dir?Isn't it the same as train_data_dir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was modified from an ASR recipe and we didn't remove this
variable so we could optionally experiment with different resolutions for
the gmm and chain systems. Anyway, the gmm system does not give good results
so I guess we can remove this and focus on the chain setup only.
egs/iam/s5/local/chain/run_cnn_1a.sh
Outdated
# chain options | ||
train_stage=-10 | ||
xent_regularize=0.1 | ||
frame_subsampling_factor=4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does 4 work better than 3 in HWR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has not been tested yet. The reason for choosing
a larger factor was that the average number of frames per
word in OCR (when the line images are scaled to have a height of 40) is almost 2 times that of ASR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best word error rate with subsampling factor(FSF) 4 is slightly better than 3 and 5. WER(%) with FSF=3,5 is close to 14.80% and for FSF=4 it is close to 14.50%.
FSF | Best WER(%) |
---|---|
3 | 14.81 |
4 | 14.51 |
5 | 14.80 |
import sys | ||
import numpy as np | ||
from scipy import misc | ||
parser = argparse.ArgumentParser(description="""uses phones to convert unk to word""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is good to add extended description for this function and its arguments. This script can be used in other applications.
Actually I prefer it when the chunk widths are not too regularly spaced...
then we get more combinations so we can more closely approximate the
lengths of longer utterances. That's why I sometimes use slightly
random-seeming numbers.
…On Wed, Nov 15, 2017 at 12:36 AM, Hossein Hadian ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In egs/iam/s5/local/chain/run_cnn_1a.sh
<#1984 (comment)>:
> +train_set=train
+gmm=tri3 # this is the source gmm-dir that we'll use for alignments; it
+ # should have alignments for the specified training data.
+nnet3_affix= # affix for exp dirs, e.g. it was _cleaned in tedlium.
+affix=_1a #affix for TDNN+LSTM directory e.g. "1a" or "1b", in case we change the configuration.
+ali=tri3_ali
+common_egs_dir=
+reporting_email=
+
+# chain options
+train_stage=-10
+xent_regularize=0.1
+frame_subsampling_factor=4
+alignment_subsampling_factor=1
+# training chunk-options
+chunk_width=340,300,200,100
We can change this to something less unusual like 300,200,100
Again, the reason for choosing this is because the average number of
frames per phone/letter is almost 2 times larger for OCR.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1984 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuyvpyEHBrfwIUFm8fxes-u7jMuuxks5s2nhtgaJpZM4QLkfc>
.
|
Oh OK. I won't change it then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just realized I had some pending comments.
egs/iam/s5/run.sh
Outdated
# create a backup directory to store text, utt2spk and image.scp file | ||
mkdir -p $data_dir/train/backup | ||
mv $data_dir/train/text $data_dir/train/utt2spk $data_dir/train/images.scp $data_dir/train/backup/ | ||
local/augment_and_make_feature_vect.py $data_dir/train --scale-size 40 --vertical-shift 10 | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like "vect" for vector, should be "vec". But the name is too long anyway-- call it "augment_and_make_features.py".
egs/iam/s5/run.sh
Outdated
ark:- ark,scp:$data_dir/test/data/images.ark,$data_dir/test/feats.scp || exit 1 | ||
steps/compute_cmvn_stats.sh $data_dir/test || exit 1; | ||
|
||
if [ $augment = true ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just do
if $augment; then
(true and false are valid statements that have return codes).
@@ -0,0 +1,85 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this make_features.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For newly created python scripts I prefer python3. This will reduce our headaches when python2 is no longer supported.
egs/uw3/v1/local/prepare_lm.sh
Outdated
|
||
cat $arpa | utils/find_arpa_oovs.pl $lang/words.txt > $tmpdir/oovs.txt | ||
|
||
cat $arpa | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks to me like you might have copied an older example here. Right now I believe most of this is done by a single command line involving arpa2fst, look for a more recent example to copy. I think the oovs.txt is no longer needed, also. It's all done via options to arpa2fst now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think more preferable way would be just call utils\format_lm.sh (I noticed in the previous recipe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it seems you are osciallating between IRSTLM and pocolm? is that necessary? Cannot you use just one toolkit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is very old. I'll fix it.
Re LM toolkits, I'm not sure which toolkit to use.
Here (i.w. UW3) we just need a simple LM trained on training text.
But in IAM, we have 3 LM sources and pocolm might be more suited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm OK with using 2 different LM toolkits since it's 2 different recipes. The only benefit in having a standard is if it's globally enforced, and that isn't practical for various reasons (there just isn't a clear leader).
egs/uw3/v1/local/process_data.py
Outdated
#!/usr/bin/env python | ||
|
||
# (Author: Chun Chieh Chang) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment saying what this script does would be nice. E.g. what files does it produce and what will their contents look like? Doing this via some kind of doc-string or usage message is fine too, maybe preferable.
egs/iam/s5/run.sh
Outdated
local/iam_train_lm.sh | ||
cp -R $data_dir/lang -T $data_dir/lang_test_corpus | ||
gunzip -k -f data/local/local_lm/data/arpa/3gram_big.arpa.gz | ||
local/prepare_lm.sh data/local/local_lm/data/arpa/3gram_big.arpa $data_dir/lang_test_corpus || exit 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wasteful of disk, to gunzip the ARPA file and not delete it afterwards. (why gzip it in the first place if you are going to keep the unzipped one)?
I don't really like the way this is structured, with the prepare_lm.sh script that can either use an existing LM or prepare one. I'd rather have one script to build the LM and one to build the graph; but as I commented in the script, there is a much simpler way to build the graph than what you are doing, it's just a single invocation of arpa2fst now I think.
egs/iam/s5/run.sh
Outdated
local/run_unk_model.sh | ||
fi | ||
|
||
num_gauss=10000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't have these variables, just hardcode them in the script invocations.
egs/uw3/v1/run.sh
Outdated
|
||
stage=0 | ||
nj=30 | ||
data_download=data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better and clearer to specify something other than data/ for the download location, e.g. at least specify a subdirectory, because it makes it unclear to the user to what extent you can really change this.
Make sure that if the data is already downloaded, the contents of that directory $data_download can be used without any modification, so that a user can use another user's downloaded directory. Otherwise it would be impossible for multiple users to share the same data.
And please make clear what location on the CLSP grid can be used for this, so that if anyone runs it here, they don't have to re-download the data.
egs/uw3/v1/run.sh
Outdated
stage=0 | ||
nj=30 | ||
data_download=data | ||
data_dir=data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nonstandard to have 'data' and 'exp' be variables in the scripts. Even though I can see that it might be useful, I'd rather have this script look like all the other scripts and not use these variables.
egs/iam/s5/local/chain/run_cnn_1a.sh
Outdated
chunk_width=340,300,200,100 | ||
num_leaves=500 | ||
# we don't need extra left/right context for TDNN systems. | ||
chunk_left_context=32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need this left and right context for CNN systems either as long as you set them up right.
Did you try removing 'required-time-offsets=0' from the 'common' options?
Then the network would require all that context, and no extra context would be required; the needed number of
frames would be added in egs creation.
(However, in this case you should probably be careful to ensure that the first and last frame of each text line
are just the background color, as you'd get the first or last frame repeated and it would otherwise produce a strange image).
egs/iam/s5/local/score.sh
Outdated
@@ -0,0 +1,154 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a symlink to steps/score_kaldi.sh, not a copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because it calls local/unk_arc_post_to_transcription.py
somewhere
in the scoring pipeline. However, looking at the script,
it seems to me that it can be done
through hyp_filtering_cmd
. @aarora8, could you please make it
a symlink and do the "unk scoring" using hyp_filtering_cmd
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, it is also converting upper case to lower case. But in the mail response Paul from RWTH-aachen mention that they are not converting upper case to lower case. should I not include the conversion in the change as well.
BTW, lot of the files does not have authors and copyrights -- please add those. |
egs/iam/s5/cmd.sh
Outdated
# conf/queue.conf in http://kaldi-asr.org/doc/queue.html for more information, | ||
# or search for the string 'default_config' in utils/queue.pl or utils/slurm.pl. | ||
|
||
export cmd="queue.pl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd is definitely weird and non-standard. If you have any use-case for it, please name it more self-descriptively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think I am ok with just using "$cmd". The distinction between train_cmd and decode_cmd became less necessary now that we have a common interface for those tools-- we mostly keep them around just out of inertia.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should probably remove either $cmd, or $train_cmd and $decode_cmd.
egs/iam/s5/local/prepare_lm.sh
Outdated
@@ -0,0 +1,62 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this file should be replaced by utils/format_lm.sh (or at least most of the things from this file should be replaced by that one)
cmd=run.pl | ||
stage=0 | ||
decode_mbr=false | ||
stats=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps you should add stats
to scoring_opts as well.
Also, if don't wanna do it the way suggested in kaldi_score_cer.sh (I assume because you want to modify the default lmwts, even though I'm not sure if it's worth the efort), please make sure the stage parameter is processed correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChunChiehChang, could you please change this script
to follow the instructions mentioned in the header of steps/scoring/score_kaldi_cer.sh
?
I agree but now we would have three instead of using just one?
y.
…On Thu, Nov 23, 2017 at 3:15 PM, Daniel Povey ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In egs/iam/s5/cmd.sh
<#1984 (comment)>:
> @@ -0,0 +1,15 @@
+# you can change cmd.sh depending on what type of queue you are using.
+# If you have no queueing system and want to run on a local machine, you
+# can change all instances 'queue.pl' to run.pl (but be careful and run
+# commands one by one: most recipes will exhaust the memory on your
+# machine). queue.pl works with GridEngine (qsub). slurm.pl works
+# with slurm. Different queues are configured differently, with different
+# queue names and different ways of specifying things like memory;
+# to account for these differences you can create and edit the file
+# conf/queue.conf to match your queue's configuration. Search for
+# conf/queue.conf in http://kaldi-asr.org/doc/queue.html for more information,
+# or search for the string 'default_config' in utils/queue.pl or utils/slurm.pl.
+
+export cmd="queue.pl"
Actually I think I am ok with just using "$cmd". The distinction between
train_cmd and decode_cmd became less necessary now that we have a common
interface for those tools-- we mostly keep them around just out of inertia.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1984 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AKisX31An5JzrPn8MD3v27wP1u2xrM0Yks5s5dJ7gaJpZM4QLkfc>
.
|
But in this particular example it looks like there would be just one.
The only way to start using just one is to do it example by example-- too
much hassle to upgrade existing scripts.
…On Thu, Nov 23, 2017 at 3:17 PM, jtrmal ***@***.***> wrote:
I agree but now we would have three instead of using just one?
y.
On Thu, Nov 23, 2017 at 3:15 PM, Daniel Povey ***@***.***>
wrote:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In egs/iam/s5/cmd.sh
> <#1984 (comment)>:
>
> > @@ -0,0 +1,15 @@
> +# you can change cmd.sh depending on what type of queue you are using.
> +# If you have no queueing system and want to run on a local machine, you
> +# can change all instances 'queue.pl' to run.pl (but be careful and run
> +# commands one by one: most recipes will exhaust the memory on your
> +# machine). queue.pl works with GridEngine (qsub). slurm.pl works
> +# with slurm. Different queues are configured differently, with
different
> +# queue names and different ways of specifying things like memory;
> +# to account for these differences you can create and edit the file
> +# conf/queue.conf to match your queue's configuration. Search for
> +# conf/queue.conf in http://kaldi-asr.org/doc/queue.html for more
information,
> +# or search for the string 'default_config' in utils/queue.pl or utils/
slurm.pl.
> +
> +export cmd="queue.pl"
>
> Actually I think I am ok with just using "$cmd". The distinction between
> train_cmd and decode_cmd became less necessary now that we have a common
> interface for those tools-- we mostly keep them around just out of
inertia.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1984 (comment)>,
or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/
AKisX31An5JzrPn8MD3v27wP1u2xrM0Yks5s5dJ7gaJpZM4QLkfc>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1984 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu3KlrX4WkHfjAdX44IdVgVf1d-RRks5s5dLsgaJpZM4QLkfc>
.
|
Ok, I thought I see three different variables definition. But if there is
only one, then I don't have any complaints
Y
…On Nov 23, 2017 15:26, "Daniel Povey" ***@***.***> wrote:
But in this particular example it looks like there would be just one.
The only way to start using just one is to do it example by example-- too
much hassle to upgrade existing scripts.
On Thu, Nov 23, 2017 at 3:17 PM, jtrmal ***@***.***> wrote:
> I agree but now we would have three instead of using just one?
> y.
>
> On Thu, Nov 23, 2017 at 3:15 PM, Daniel Povey ***@***.***>
> wrote:
>
> > ***@***.**** commented on this pull request.
> > ------------------------------
> >
> > In egs/iam/s5/cmd.sh
> > <#1984 (comment)>:
> >
> > > @@ -0,0 +1,15 @@
> > +# you can change cmd.sh depending on what type of queue you are using.
> > +# If you have no queueing system and want to run on a local machine,
you
> > +# can change all instances 'queue.pl' to run.pl (but be careful and
run
> > +# commands one by one: most recipes will exhaust the memory on your
> > +# machine). queue.pl works with GridEngine (qsub). slurm.pl works
> > +# with slurm. Different queues are configured differently, with
> different
> > +# queue names and different ways of specifying things like memory;
> > +# to account for these differences you can create and edit the file
> > +# conf/queue.conf to match your queue's configuration. Search for
> > +# conf/queue.conf in http://kaldi-asr.org/doc/queue.html for more
> information,
> > +# or search for the string 'default_config' in utils/queue.pl or
utils/
> slurm.pl.
> > +
> > +export cmd="queue.pl"
> >
> > Actually I think I am ok with just using "$cmd". The distinction
between
> > train_cmd and decode_cmd became less necessary now that we have a
common
> > interface for those tools-- we mostly keep them around just out of
> inertia.
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <#1984 (comment)>,
> or mute
> > the thread
> > <https://github.com/notifications/unsubscribe-auth/
> AKisX31An5JzrPn8MD3v27wP1u2xrM0Yks5s5dJ7gaJpZM4QLkfc>
> > .
> >
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1984 (comment)>,
or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/
ADJVu3KlrX4WkHfjAdX44IdVgVf1d-RRks5s5dLsgaJpZM4QLkfc>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1984 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AKisX6-kC__HplEd5AyAHxB3C4o-JBPIks5s5dT4gaJpZM4QLkfc>
.
|
egs/uw3/v1/local/prepare_lm.sh
Outdated
|
||
cat $arpa | utils/find_arpa_oovs.pl $lang/words.txt > $tmpdir/oovs.txt | ||
|
||
cat $arpa | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is very old. I'll fix it.
Re LM toolkits, I'm not sure which toolkit to use.
Here (i.w. UW3) we just need a simple LM trained on training text.
But in IAM, we have 3 LM sources and pocolm might be more suited.
cmd=run.pl | ||
stage=0 | ||
decode_mbr=false | ||
stats=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChunChiehChang, could you please change this script
to follow the instructions mentioned in the header of steps/scoring/score_kaldi_cer.sh
?
egs/iam/s5/local/score.sh
Outdated
@@ -0,0 +1,154 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because it calls local/unk_arc_post_to_transcription.py
somewhere
in the scoring pipeline. However, looking at the script,
it seems to me that it can be done
through hyp_filtering_cmd
. @aarora8, could you please make it
a symlink and do the "unk scoring" using hyp_filtering_cmd
?
@@ -0,0 +1,105 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this script to steps/nnet3
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See how it differs from steps/nnet3/align_lats.sh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, they are basically the same, however steps/nnet3/align_lats.sh
was not there at the time.
@aarora8, could you please update the chainali
recipes to use steps/nnet3/align_lats.sh
instead and remove this script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, made the change for the pull request.
I'm not gonna argue any more, but I don't see reason why they _should_ be
different, given the fact that the recipes are prepared by the same team
(and in the same PR) -- I think the more things will be idiomatic and fixed
across related recipes, the better (of course only where it makes sense).
y.
…On Fri, Nov 24, 2017 at 1:38 PM, Daniel Povey ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In egs/uw3/v1/local/prepare_lm.sh
<#1984 (comment)>:
> + echo "$0: IRSTLM does not seem to be installed (build-lm.sh not on your path): " && \
+ echo "go to <kaldi-root>/tools and try 'make irstlm_tgt'" && exit 1;
+
+ cut -d' ' -f2- $lmsrc | sed -e 's:^:<s> :' -e 's:$: </s>:' \
+ > $tmpdir/tmp_lm_train
+ build-lm.sh -k 1 -i $tmpdir/tmp_lm_train -n $ngram -o $tmpdir/tmp_lm.ilm.gz
+
+ compile-lm $tmpdir/tmp_lm.ilm.gz -t=yes /dev/stdout | \
+ grep -v unk > $tmpdir/lm_phone_bg.arpa
+
+ arpa=$tmpdir/lm_phone_bg.arpa
+fi
+
+cat $arpa | utils/find_arpa_oovs.pl $lang/words.txt > $tmpdir/oovs.txt
+
+cat $arpa | \
I think I'm OK with using 2 different LM toolkits since it's 2 different
recipes. The only benefit in having a standard is if it's globally
enforced, and that isn't practical for various reasons (there just isn't a
clear leader).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1984 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AKisX4x7cBFI6q6b1rqp0qAnpZdZRun5ks5s5w0QgaJpZM4QLkfc>
.
|
utt_word_dict = dict() | ||
utt_phone_dict = dict()# stores utteranceID and phoneID | ||
unk_word_dict = dict() | ||
count=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks crowded. consider adding an empty line here
out_fh = open(args.out_ark,'wb') | ||
|
||
phone_dict = dict()# stores phoneID and phone mapping | ||
phone_data_vect = phone_fh.read().strip().split("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks crowded. consider adding an empty line here
key_val = key_val.split(" ") | ||
phone_dict[key_val[1]] = key_val[0] | ||
word_dict = dict() | ||
word_data_vect = word_fh.read().strip().split("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks crowded. consider adding an empty line here
parser.add_argument('unk', type=str, default='-', help='location of unk file') | ||
parser.add_argument('--input-ark', type=str, default='-', help='where to read the input data') | ||
parser.add_argument('--out-ark', type=str, default='-', help='where to write the output data') | ||
args = parser.parse_args() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks crowded. consider adding an empty line here
utt_word_dict[uttID] = dict() | ||
utt_phone_dict[uttID] = dict() | ||
utt_word_dict[uttID][count] = word | ||
utt_phone_dict[uttID][count] = phones |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks crowded. consider adding an empty line here
for phone_val in phone_val_vect: | ||
phone_2_word.append(phone_val.split('_')[0]) | ||
phone_2_word = ''.join(phone_2_word) | ||
utt_word_dict[uttID][count] = phone_2_word |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is the core of this script and definitely needs more explanation,.e.g. "Since in OCR, the lexicon is purely graphemic, we can just concatenate the phones from the most probable phone sequence given by the unk-model to produce the predicted word.", because usually we need a P2G model in order to map a phone sequence to a predicted word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please add some explanation at the header,.e.g. "then it will replace
the <unk> with the word predicted by <unk> model" => "then it will replace
the <unk> with the word predicted by <unk> model by concatenating phones decoded from the unk-model."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks.
I haven't merged this because it's still marked WIP. Let me know when you think it's ready. |
* OCR: Add IAM corpus with unk decoding support (#3) * Add a new English OCR database 'UW3' * Some minor fixes re IAM corpus * Fix an issue in IAM chain recipes + add a new recipe (#6) * Some fixes based on the pull request review * Various fixes + cleaning on IAM * Fix LM estimation and add extended dictionary + other minor fixes * Add README for IAM * Add output filter for scoring * Fix a bug RE switch to pyhton3 * Add updated results + minor fixes * Remove unk decoding -- gives almost no gain * Add UW3 OCR database * Fix cmd.sh in IAM + fix usages of train/decode_cmd in chain recipes * Various minor fixes on UW3 * Rename iam/s5 to iam/v1 * Add README file for UW3 * Various cosmetic fixes on UW3 scripts * Minor fixes in IAM
* OCR: Add IAM corpus with unk decoding support (kaldi-asr#3) * Add a new English OCR database 'UW3' * Some minor fixes re IAM corpus * Fix an issue in IAM chain recipes + add a new recipe (kaldi-asr#6) * Some fixes based on the pull request review * Various fixes + cleaning on IAM * Fix LM estimation and add extended dictionary + other minor fixes * Add README for IAM * Add output filter for scoring * Fix a bug RE switch to pyhton3 * Add updated results + minor fixes * Remove unk decoding -- gives almost no gain * Add UW3 OCR database * Fix cmd.sh in IAM + fix usages of train/decode_cmd in chain recipes * Various minor fixes on UW3 * Rename iam/s5 to iam/v1 * Add README file for UW3 * Various cosmetic fixes on UW3 scripts * Minor fixes in IAM
* OCR: Add IAM corpus with unk decoding support (#3) * Add a new English OCR database 'UW3' * Some minor fixes re IAM corpus * Fix an issue in IAM chain recipes + add a new recipe (#6) * Some fixes based on the pull request review * Various fixes + cleaning on IAM * Fix LM estimation and add extended dictionary + other minor fixes * Add README for IAM * Add output filter for scoring * Fix a bug RE switch to pyhton3 * Add updated results + minor fixes * Remove unk decoding -- gives almost no gain * Add UW3 OCR database * Fix cmd.sh in IAM + fix usages of train/decode_cmd in chain recipes * Various minor fixes on UW3 * Rename iam/s5 to iam/v1 * Add README file for UW3 * Various cosmetic fixes on UW3 scripts * Minor fixes in IAM
* OCR: Add IAM corpus with unk decoding support (Idlak#3) * Add a new English OCR database 'UW3' * Some minor fixes re IAM corpus * Fix an issue in IAM chain recipes + add a new recipe (Idlak#6) * Some fixes based on the pull request review * Various fixes + cleaning on IAM * Fix LM estimation and add extended dictionary + other minor fixes * Add README for IAM * Add output filter for scoring * Fix a bug RE switch to pyhton3 * Add updated results + minor fixes * Remove unk decoding -- gives almost no gain * Add UW3 OCR database * Fix cmd.sh in IAM + fix usages of train/decode_cmd in chain recipes * Various minor fixes on UW3 * Rename iam/s5 to iam/v1 * Add README file for UW3 * Various cosmetic fixes on UW3 scripts * Minor fixes in IAM
* OCR: Add IAM corpus with unk decoding support (Idlak#3) * Add a new English OCR database 'UW3' * Some minor fixes re IAM corpus * Fix an issue in IAM chain recipes + add a new recipe (Idlak#6) * Some fixes based on the pull request review * Various fixes + cleaning on IAM * Fix LM estimation and add extended dictionary + other minor fixes * Add README for IAM * Add output filter for scoring * Fix a bug RE switch to pyhton3 * Add updated results + minor fixes * Remove unk decoding -- gives almost no gain * Add UW3 OCR database * Fix cmd.sh in IAM + fix usages of train/decode_cmd in chain recipes * Various minor fixes on UW3 * Rename iam/s5 to iam/v1 * Add README file for UW3 * Various cosmetic fixes on UW3 scripts * Minor fixes in IAM
No description provided.