Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OCR/Handwriting Recognition examples #1984

Merged
merged 19 commits into from
Jan 4, 2018
Merged

Conversation

hhadian
Copy link
Contributor

@hhadian hhadian commented Oct 30, 2017

No description provided.

signal(SIGPIPE, SIG_DFL)

parser = argparse.ArgumentParser(
description="""Generates and saves the feature vectors""")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good if you add some description about types of augmentation, you are doing in this script.

frame_subsampling_factor=4
alignment_subsampling_factor=1
# training chunk-options
chunk_width=340,300,200,100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason for this chunk width? should it be multiple of 32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change this to something less unusual like 300,200,100
Again, the reason for choosing this is because the average number of frames per phone/letter is almost 2 times larger for OCR.

lat_dir=exp/chain${nnet3_affix}/${gmm}_${train_set}_lats
dir=exp/chain${nnet3_affix}/cnn${affix}
train_data_dir=data/${train_set}
lores_train_data_dir=$train_data_dir # for the start, use the same data for gmm and chain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you define lores_train_data_dir?Isn't it the same as train_data_dir?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was modified from an ASR recipe and we didn't remove this
variable so we could optionally experiment with different resolutions for
the gmm and chain systems. Anyway, the gmm system does not give good results
so I guess we can remove this and focus on the chain setup only.

# chain options
train_stage=-10
xent_regularize=0.1
frame_subsampling_factor=4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 4 work better than 3 in HWR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has not been tested yet. The reason for choosing
a larger factor was that the average number of frames per
word in OCR (when the line images are scaled to have a height of 40) is almost 2 times that of ASR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best word error rate with subsampling factor(FSF) 4 is slightly better than 3 and 5. WER(%) with FSF=3,5 is close to 14.80% and for FSF=4 it is close to 14.50%.

FSF Best WER(%)
3 14.81
4 14.51
5 14.80

import sys
import numpy as np
from scipy import misc
parser = argparse.ArgumentParser(description="""uses phones to convert unk to word""")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to add extended description for this function and its arguments. This script can be used in other applications.

@danpovey
Copy link
Contributor

danpovey commented Nov 15, 2017 via email

@hhadian
Copy link
Contributor Author

hhadian commented Nov 15, 2017

Actually I prefer it when the chunk widths are not too regularly spaced...
then we get more combinations so we can more closely approximate the
lengths of longer utterances. That's why I sometimes use slightly
random-seeming numbers.

Oh OK. I won't change it then.

Copy link
Contributor

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just realized I had some pending comments.

# create a backup directory to store text, utt2spk and image.scp file
mkdir -p $data_dir/train/backup
mv $data_dir/train/text $data_dir/train/utt2spk $data_dir/train/images.scp $data_dir/train/backup/
local/augment_and_make_feature_vect.py $data_dir/train --scale-size 40 --vertical-shift 10 | \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like "vect" for vector, should be "vec". But the name is too long anyway-- call it "augment_and_make_features.py".

ark:- ark,scp:$data_dir/test/data/images.ark,$data_dir/test/feats.scp || exit 1
steps/compute_cmvn_stats.sh $data_dir/test || exit 1;

if [ $augment = true ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do
if $augment; then
(true and false are valid statements that have return codes).

@@ -0,0 +1,85 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this make_features.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For newly created python scripts I prefer python3. This will reduce our headaches when python2 is no longer supported.


cat $arpa | utils/find_arpa_oovs.pl $lang/words.txt > $tmpdir/oovs.txt

cat $arpa | \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks to me like you might have copied an older example here. Right now I believe most of this is done by a single command line involving arpa2fst, look for a more recent example to copy. I think the oovs.txt is no longer needed, also. It's all done via options to arpa2fst now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more preferable way would be just call utils\format_lm.sh (I noticed in the previous recipe)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems you are osciallating between IRSTLM and pocolm? is that necessary? Cannot you use just one toolkit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is very old. I'll fix it.
Re LM toolkits, I'm not sure which toolkit to use.
Here (i.w. UW3) we just need a simple LM trained on training text.
But in IAM, we have 3 LM sources and pocolm might be more suited.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm OK with using 2 different LM toolkits since it's 2 different recipes. The only benefit in having a standard is if it's globally enforced, and that isn't practical for various reasons (there just isn't a clear leader).

#!/usr/bin/env python

# (Author: Chun Chieh Chang)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment saying what this script does would be nice. E.g. what files does it produce and what will their contents look like? Doing this via some kind of doc-string or usage message is fine too, maybe preferable.

local/iam_train_lm.sh
cp -R $data_dir/lang -T $data_dir/lang_test_corpus
gunzip -k -f data/local/local_lm/data/arpa/3gram_big.arpa.gz
local/prepare_lm.sh data/local/local_lm/data/arpa/3gram_big.arpa $data_dir/lang_test_corpus || exit 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wasteful of disk, to gunzip the ARPA file and not delete it afterwards. (why gzip it in the first place if you are going to keep the unzipped one)?
I don't really like the way this is structured, with the prepare_lm.sh script that can either use an existing LM or prepare one. I'd rather have one script to build the LM and one to build the graph; but as I commented in the script, there is a much simpler way to build the graph than what you are doing, it's just a single invocation of arpa2fst now I think.

local/run_unk_model.sh
fi

num_gauss=10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't have these variables, just hardcode them in the script invocations.


stage=0
nj=30
data_download=data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better and clearer to specify something other than data/ for the download location, e.g. at least specify a subdirectory, because it makes it unclear to the user to what extent you can really change this.
Make sure that if the data is already downloaded, the contents of that directory $data_download can be used without any modification, so that a user can use another user's downloaded directory. Otherwise it would be impossible for multiple users to share the same data.
And please make clear what location on the CLSP grid can be used for this, so that if anyone runs it here, they don't have to re-download the data.

stage=0
nj=30
data_download=data
data_dir=data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nonstandard to have 'data' and 'exp' be variables in the scripts. Even though I can see that it might be useful, I'd rather have this script look like all the other scripts and not use these variables.

chunk_width=340,300,200,100
num_leaves=500
# we don't need extra left/right context for TDNN systems.
chunk_left_context=32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need this left and right context for CNN systems either as long as you set them up right.
Did you try removing 'required-time-offsets=0' from the 'common' options?
Then the network would require all that context, and no extra context would be required; the needed number of
frames would be added in egs creation.
(However, in this case you should probably be careful to ensure that the first and last frame of each text line
are just the background color, as you'd get the first or last frame repeated and it would otherwise produce a strange image).

@@ -0,0 +1,154 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a symlink to steps/score_kaldi.sh, not a copy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because it calls local/unk_arc_post_to_transcription.py somewhere
in the scoring pipeline. However, looking at the script,
it seems to me that it can be done
through hyp_filtering_cmd. @aarora8, could you please make it
a symlink and do the "unk scoring" using hyp_filtering_cmd?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, it is also converting upper case to lower case. But in the mail response Paul from RWTH-aachen mention that they are not converting upper case to lower case. should I not include the conversion in the change as well.

@jtrmal
Copy link
Contributor

jtrmal commented Nov 23, 2017

BTW, lot of the files does not have authors and copyrights -- please add those.

# conf/queue.conf in http://kaldi-asr.org/doc/queue.html for more information,
# or search for the string 'default_config' in utils/queue.pl or utils/slurm.pl.

export cmd="queue.pl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmd is definitely weird and non-standard. If you have any use-case for it, please name it more self-descriptively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think I am ok with just using "$cmd". The distinction between train_cmd and decode_cmd became less necessary now that we have a common interface for those tools-- we mostly keep them around just out of inertia.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should probably remove either $cmd, or $train_cmd and $decode_cmd.

@@ -0,0 +1,62 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file should be replaced by utils/format_lm.sh (or at least most of the things from this file should be replaced by that one)

cmd=run.pl
stage=0
decode_mbr=false
stats=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps you should add stats to scoring_opts as well.
Also, if don't wanna do it the way suggested in kaldi_score_cer.sh (I assume because you want to modify the default lmwts, even though I'm not sure if it's worth the efort), please make sure the stage parameter is processed correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChunChiehChang, could you please change this script
to follow the instructions mentioned in the header of steps/scoring/score_kaldi_cer.sh?

@jtrmal
Copy link
Contributor

jtrmal commented Nov 23, 2017 via email

@danpovey
Copy link
Contributor

danpovey commented Nov 23, 2017 via email

@jtrmal
Copy link
Contributor

jtrmal commented Nov 23, 2017 via email


cat $arpa | utils/find_arpa_oovs.pl $lang/words.txt > $tmpdir/oovs.txt

cat $arpa | \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is very old. I'll fix it.
Re LM toolkits, I'm not sure which toolkit to use.
Here (i.w. UW3) we just need a simple LM trained on training text.
But in IAM, we have 3 LM sources and pocolm might be more suited.

cmd=run.pl
stage=0
decode_mbr=false
stats=true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChunChiehChang, could you please change this script
to follow the instructions mentioned in the header of steps/scoring/score_kaldi_cer.sh?

@@ -0,0 +1,154 @@
#!/bin/bash
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because it calls local/unk_arc_post_to_transcription.py somewhere
in the scoring pipeline. However, looking at the script,
it seems to me that it can be done
through hyp_filtering_cmd. @aarora8, could you please make it
a symlink and do the "unk scoring" using hyp_filtering_cmd?

@@ -0,0 +1,105 @@
#!/bin/bash
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this script to steps/nnet3?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See how it differs from steps/nnet3/align_lats.sh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, they are basically the same, however steps/nnet3/align_lats.sh was not there at the time.
@aarora8, could you please update the chainali
recipes to use steps/nnet3/align_lats.sh instead and remove this script?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, made the change for the pull request.

@jtrmal
Copy link
Contributor

jtrmal commented Nov 24, 2017 via email

utt_word_dict = dict()
utt_phone_dict = dict()# stores utteranceID and phoneID
unk_word_dict = dict()
count=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks crowded. consider adding an empty line here

out_fh = open(args.out_ark,'wb')

phone_dict = dict()# stores phoneID and phone mapping
phone_data_vect = phone_fh.read().strip().split("\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks crowded. consider adding an empty line here

key_val = key_val.split(" ")
phone_dict[key_val[1]] = key_val[0]
word_dict = dict()
word_data_vect = word_fh.read().strip().split("\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks crowded. consider adding an empty line here

parser.add_argument('unk', type=str, default='-', help='location of unk file')
parser.add_argument('--input-ark', type=str, default='-', help='where to read the input data')
parser.add_argument('--out-ark', type=str, default='-', help='where to write the output data')
args = parser.parse_args()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks crowded. consider adding an empty line here

utt_word_dict[uttID] = dict()
utt_phone_dict[uttID] = dict()
utt_word_dict[uttID][count] = word
utt_phone_dict[uttID][count] = phones
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks crowded. consider adding an empty line here

for phone_val in phone_val_vect:
phone_2_word.append(phone_val.split('_')[0])
phone_2_word = ''.join(phone_2_word)
utt_word_dict[uttID][count] = phone_2_word
Copy link
Contributor

@xiaohui-zhang xiaohui-zhang Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is the core of this script and definitely needs more explanation,.e.g. "Since in OCR, the lexicon is purely graphemic, we can just concatenate the phones from the most probable phone sequence given by the unk-model to produce the predicted word.", because usually we need a P2G model in order to map a phone sequence to a predicted word.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please add some explanation at the header,.e.g. "then it will replace
the <unk> with the word predicted by <unk> model" => "then it will replace
the <unk> with the word predicted by <unk> model by concatenating phones decoded from the unk-model."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks.

@danpovey
Copy link
Contributor

danpovey commented Jan 4, 2018

I haven't merged this because it's still marked WIP. Let me know when you think it's ready.

@hhadian hhadian changed the title [WIP] Add OCR/Handwriting Recognition examples Add OCR/Handwriting Recognition examples Jan 4, 2018
@danpovey danpovey merged commit 8292e4c into kaldi-asr:master Jan 4, 2018
danpovey pushed a commit to danpovey/kaldi that referenced this pull request Jan 5, 2018
* OCR: Add IAM corpus with unk decoding support (#3)

* Add a new English OCR database 'UW3'

* Some minor fixes re IAM corpus

* Fix an issue in IAM chain recipes + add a new recipe (#6)

* Some fixes based on the pull request review

* Various fixes + cleaning on IAM

* Fix LM estimation and add extended dictionary + other minor fixes

* Add README for IAM

* Add output filter for scoring

* Fix a bug RE switch to pyhton3

* Add updated results + minor fixes

* Remove unk decoding -- gives almost no gain

* Add UW3 OCR database

* Fix cmd.sh in IAM + fix usages of train/decode_cmd in chain recipes

* Various minor fixes on UW3

* Rename iam/s5 to iam/v1

* Add README file for UW3

* Various cosmetic fixes on UW3 scripts

* Minor fixes in IAM
eginhard pushed a commit to eginhard/kaldi that referenced this pull request Jan 11, 2018
* OCR: Add IAM corpus with unk decoding support (kaldi-asr#3)

* Add a new English OCR database 'UW3'

* Some minor fixes re IAM corpus

* Fix an issue in IAM chain recipes + add a new recipe (kaldi-asr#6)

* Some fixes based on the pull request review

* Various fixes + cleaning on IAM

* Fix LM estimation and add extended dictionary + other minor fixes

* Add README for IAM

* Add output filter for scoring

* Fix a bug RE switch to pyhton3

* Add updated results + minor fixes

* Remove unk decoding -- gives almost no gain

* Add UW3 OCR database

* Fix cmd.sh in IAM + fix usages of train/decode_cmd in chain recipes

* Various minor fixes on UW3

* Rename iam/s5 to iam/v1

* Add README file for UW3

* Various cosmetic fixes on UW3 scripts

* Minor fixes in IAM
mahsa7823 pushed a commit to mahsa7823/kaldi that referenced this pull request Feb 28, 2018
* OCR: Add IAM corpus with unk decoding support (#3)

* Add a new English OCR database 'UW3'

* Some minor fixes re IAM corpus

* Fix an issue in IAM chain recipes + add a new recipe (#6)

* Some fixes based on the pull request review

* Various fixes + cleaning on IAM

* Fix LM estimation and add extended dictionary + other minor fixes

* Add README for IAM

* Add output filter for scoring

* Fix a bug RE switch to pyhton3

* Add updated results + minor fixes

* Remove unk decoding -- gives almost no gain

* Add UW3 OCR database

* Fix cmd.sh in IAM + fix usages of train/decode_cmd in chain recipes

* Various minor fixes on UW3

* Rename iam/s5 to iam/v1

* Add README file for UW3

* Various cosmetic fixes on UW3 scripts

* Minor fixes in IAM
Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
* OCR: Add IAM corpus with unk decoding support (Idlak#3)

* Add a new English OCR database 'UW3'

* Some minor fixes re IAM corpus

* Fix an issue in IAM chain recipes + add a new recipe (Idlak#6)

* Some fixes based on the pull request review

* Various fixes + cleaning on IAM

* Fix LM estimation and add extended dictionary + other minor fixes

* Add README for IAM

* Add output filter for scoring

* Fix a bug RE switch to pyhton3

* Add updated results + minor fixes

* Remove unk decoding -- gives almost no gain

* Add UW3 OCR database

* Fix cmd.sh in IAM + fix usages of train/decode_cmd in chain recipes

* Various minor fixes on UW3

* Rename iam/s5 to iam/v1

* Add README file for UW3

* Various cosmetic fixes on UW3 scripts

* Minor fixes in IAM
Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
* OCR: Add IAM corpus with unk decoding support (Idlak#3)

* Add a new English OCR database 'UW3'

* Some minor fixes re IAM corpus

* Fix an issue in IAM chain recipes + add a new recipe (Idlak#6)

* Some fixes based on the pull request review

* Various fixes + cleaning on IAM

* Fix LM estimation and add extended dictionary + other minor fixes

* Add README for IAM

* Add output filter for scoring

* Fix a bug RE switch to pyhton3

* Add updated results + minor fixes

* Remove unk decoding -- gives almost no gain

* Add UW3 OCR database

* Fix cmd.sh in IAM + fix usages of train/decode_cmd in chain recipes

* Various minor fixes on UW3

* Rename iam/s5 to iam/v1

* Add README file for UW3

* Various cosmetic fixes on UW3 scripts

* Minor fixes in IAM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants