WIP Sequence training of nnet3 models #3

vimalmanohar · 2015-12-19T18:09:30Z

No description provided.

vimalmanohar · 2015-12-19T18:24:19Z

src/nnet3/nnet-discriminative-example.h

+    determine the "pinch points".
+ */
+void SplitDiscriminativeExample(
+    const std::string &name,


If there are multiple supervisions, then only the "name" object will be considered to identify the pinch points.
This is a difference from nnet2. The Excise function also has the same issue.

Actually, I know I told you we should support multiple supervision objects,
but on second thoughts, I think it makes sense to allow just one (and of
course store its name). In a multilingual setup, an utterance corresponds
to just one language.

In addition, after thinking about this a bit, I think we're going to have
to make some substantial changes from the 'nnet2' way of doing the
discriminative training. In nnet2 we split the lattices on pinch points,
and I think there was some way of padding them and stitching examples
together. But if we have recurrent architectures that see infinite
context, stitching examples together won't fly, and even padding at the
ends isn't quite right. Also, there is a big cost to using variable-length
egs, because of how the compilation works. So we will need to rely more on
fixed-length egs extracted from the lattice without regard to where the
pinch points lie. This is what I do in the 'chain' models. I use
fixed-length egs (1.5 seconds by default), and discard training utterances
shorter than this. (we can append training data at the data-dir level if
we're concerned about losing too many short utterances; @tomkocse already
wrote a script for this).

So we have extract fixed-length egs from the lattices. The edge effects
can be handled by using the 'forward' and 'backward' scores of the cut
points as the initial and final-probs. [you can of course renormalize
somehow so the best cost is zero.] Initial-probs can be simulated using
arc probabilities. In order to know which frames the acoustic scores
correspond to, the decoder will have to dump in the non-compact lattice
format (--determinize=false), and because this takes up a lot of disk, we
can eventually consider integrating the decoding with the initial phase of
egs-dumping. But for now probably best to just dump the lattices without
determinization.

The initial splitting up of lattices can decide on random fixed-length
pieces of lattice- use the 'SplitIntoRanges' function from the 'chain'
branch. The lattice splitting-up code will be similar to class
SupervisionSplitter in the 'chain' branch, except with more attention to
the initial and final costs. (To do this, in addition to computing the
lattice state times, you'll want to compute the lattice alpha and beta
scores).

Obviously this is a slightly bigger project than we thought, now. If you
don't have time, feel free to reconsider.

Dan

On Sat, Dec 19, 2015 at 10:24 AM, Vimal Manohar notifications@github.com
wrote:

In src/nnet3/nnet-discriminative-example.h
#3 (comment):

int64 num_frames_kept_after_split;

int32 longest_segment_after_split;

int64 num_frames_kept_after_excise;

int32 longest_segment_after_excise;

SplitExampleStats() { memset(this, 0, sizeof(*this)); }

void Print();
+};

+/** Split a "discriminative example" into multiple pieces,

splitting where the lattice has "pinch points".

Uses "name" as the supervision object that is used to

determine the "pinch points".

*/
+void SplitDiscriminativeExample(

const std::string &name,

If there are multiple supervisions, then only the "name" object will be
considered to identify the pinch points.
This is a difference from nnet2. The Excise function also has the same
issue.

—
Reply to this email directly or view it on GitHub
https://github.com/danpovey/kaldi/pull/3/files#r48095338.

I have some time to work on this project.

If we are not looking at pinch points and creating only fixed-length segments, then it should not be too difficult to have to support multiple supervision objects. This might be useful in some situations like training with MMI and CE objective. Microsoft had done this in one of their papers to fix some issues like getting large number of deletions and insertions, which we usually have.

If we use lattice forward and backward scores, would we need to update these scores during some of the training iterations since they change when the model gets updated?

Vimal

Also, we would need different forward and backward scores for the different objectives, right? So each supervision object would be specific to a particular objective because the forward scores for MMI would be different from those for sMBR and MPE.

No, we won't be updating the scores. This is a hassle to do and will
hardly change the results.

Just compute the MMI-type scores. The extra MPE-type scores will be set to
zero. The scores won't be specific to the objective.

Dan

On Sat, Dec 19, 2015 at 3:00 PM, Vimal Manohar notifications@github.com
wrote:

In src/nnet3/nnet-discriminative-example.h
#3 (comment):

int64 num_frames_kept_after_split;

int32 longest_segment_after_split;

int64 num_frames_kept_after_excise;

int32 longest_segment_after_excise;

SplitExampleStats() { memset(this, 0, sizeof(*this)); }

void Print();
+};

+/** Split a "discriminative example" into multiple pieces,

splitting where the lattice has "pinch points".

Uses "name" as the supervision object that is used to

determine the "pinch points".

*/
+void SplitDiscriminativeExample(

const std::string &name,

Also, we would need different forward and backward scores for the
different objectives, right? So each supervision object would be specific
to a particular objective because the forward scores for MMI would be
different from those for sMBR and MPE.

—
Reply to this email directly or view it on GitHub
https://github.com/danpovey/kaldi/pull/3/files#r48097595.

vimalmanohar · 2015-12-21T22:29:40Z

I moved some of the code thats common to chain and sequence training to chain/chain-utils.cc.
I have a DiscriminativeSupervision class thats similar to chain::Supervision and its in nnet3/discriminative-supervision.cc. The splitting code works on Lattice type instead of fst::StdVectorFst.
I see that in chain code, the normalization Fst is constant and is read from disk. The way I am implementing it is that when the lattices are split into 1.5s segments, each of the initial and final states in the segments have their own initial and final weights, which would be the forward and backward scores respectively of the unsplit lattice. Is this what you had in mind?
You said the extra MPE-type scores will be set to zero. This would be only an approximation, but not exactly correct, right?

danpovey · 2015-12-21T23:10:01Z

I moved some of the code thats common to chain and sequence training to
chain/chain-utils.cc.
I have a DiscriminativeSupervision class thats similar to
chain::Supervision and its in nnet3/discriminative-supervision.cc. The
splitting code works on Lattice type instead of fst::StdVectorFst.
I see that in chain code, the normalization Fst is constant and is read
from disk. The way I am implementing it is that when the lattices are split
into 1.5s segments, each of the initial and final states in the segments
have their own initial and final weights, which would be the forward and
backward scores respectively of the unsplit lattice. Is this what you had
in mind?

Yes. Of course we'll have to make sure that acoustic scale used there is
the same one as used in actual training.

You said the extra MPE-type scores will be set to zero. This would be only
an approximation, but not exactly correct, right?

Yes. Actually, later on we could investigate making them nonzero, but I
doubt it will make very much difference.
Dan

vimalmanohar · 2015-12-22T23:36:45Z

There might be an issue when splitting lattices. Since a new state is added to accommodate the initial weights for the states, the length of path in the lattice will be 1 more than the number of frames. This will be a problem because we would have to add a dummy in the alignment. Otherwise the functions in lattice-functions.cc would not work. At what stage must this be handled? Should there be a variable in the supervision object to identify if the supervision object has undergone some splitting, in which case, a dummy can be added the alignment when necessary.

danpovey · 2015-12-22T23:38:45Z

There might be an issue when splitting lattices. Since a new state is added

to accommodate the initial weights for the states, the length of path in
the lattice will be 1 more than the number of frames. This will be a
problem because we would have to add a dummy in the alignment. Otherwise
the functions in lattice-functions.cc would not work. At what stage must
this be handled? Should there be a variable in the supervision object to
identify if the supervision object has undergone some splitting, in which
case, a dummy can be added the alignment when necessary.

I thought the lattice format allowed epsilon arcs?
Even if it does not, fixing this by RmEpsilon is easy.
Dan

—
Reply to this email directly or view it on GitHub
#3 (comment).

vimalmanohar · 2015-12-22T23:52:55Z

Ok, I think it might work. I can do RmEpsilon after the lattice is split.

danpovey · 2015-12-22T23:54:20Z

OK, but before doing that, verify that it's even necessary, and let me know
what crashes if you have epsilons. I thought epsilons in Lattices were
supported, but I might be wrong.

On Tue, Dec 22, 2015 at 3:52 PM, Vimal Manohar notifications@github.com
wrote:

Ok, I think it might work. I can do RmEpsilon after the lattice is split.

—
Reply to this email directly or view it on GitHub
#3 (comment).

vimalmanohar · 2015-12-23T00:03:46Z

Epsilons are supported in lattice. The discrminative training functions use LatticeStateTimes to get the frame index for a state. So the path length in the lattice must match the number of frames in the alignment.

danpovey · 2015-12-23T00:04:45Z

Yes but LatticeStateTimes doesn't count epsilons when measuring the path
length. If it does, it would be a bug.
Dan

On Tue, Dec 22, 2015 at 4:03 PM, Vimal Manohar notifications@github.com
wrote:

Epsilons are supported in lattice. The discrminative training functions
use LatticeStateTimes to get the frame index for a state. So the path
length in the lattice must match the number of frames in the alignment.

—
Reply to this email directly or view it on GitHub
#3 (comment).

vimalmanohar · 2015-12-23T00:07:33Z

Ok, I just checked the LatticeStateTimes. Its fine. It does not count the epsilon arcs. I don't need to do RmEpsilon then.

vimalmanohar · 2015-12-28T01:32:50Z

I added all the discriminative training codes from nnet2 including the semi-supervised training stuff. I am now going to write the scripts to test them out.

danpovey · 2015-12-28T01:34:24Z

Great, thanks!

On Sun, Dec 27, 2015 at 5:32 PM, Vimal Manohar notifications@github.com
wrote:

I added all the discriminative training codes from nnet2 including the
semi-supervised training stuff. I am now going to write the scripts to test
them out.

—
Reply to this email directly or view it on GitHub
#3 (comment).

…equence training

vimalmanohar · 2016-01-02T00:27:29Z

I wrote the scripts and codes. I have some questions about the implementation:

In nnet2, the minibatch size was 512. Using this would remove a lot of short utterances. I think this would be a more reasonable number than 1.5s, since we might more than just a couple of words in the utterance. Should combining data at script level made default in the discriminative training scripts?
All the lattices are undeterminized at all times to get forward and backward scores. Is it better to determinize them after splitting the lattices and adding the initial and final scores to it? This would mean we should not allow to split again. Perhaps, we can add a flag saying if the lattice is determinized.

danpovey · 2016-01-02T00:40:54Z

I wrote the scripts and codes. I have some questions about the
implementation:

In nnet2, the minibatch size was 512. Using this would remove a lot of
short utterances. I think this would be a more reasonable number than 1.5s,
since we might more than just a couple of words in the utterance.

Bear in mind that it could make a substantial difference to the objective
function. You'll have to tune this. In most databases you'll lose <1%
data by truncating at 1.5secs

Should combining data at script level made default in the discriminative
training scripts?

Do some experimentation before you decide.

All the lattices are undeterminized at all times to get forward and
backward scores. Is it better to determinize them after splitting the
lattices and adding the initial and final scores to it? This would mean we
should not allow to split again. Perhaps, we can add a flag saying if the
lattice is determinized.

This probably makes sense- it would make the resulting lattice have fewer
states. But make it optional and try it both ways- determinizing will end
up removing duplicate paths and alternative alignments of the original
data, so would make a difference. It would probably make sense to also
try the same recipe with lattices that were determinized when dumped, and
were then subject to acoustic lattice rescoring to get the correct
per-frame log-likelihoods (this may require a new binary). If it turns out
that it's better to determinize at the start, we can find a way to do this
without evaluating the neural network twice.

Dan

Reply to this email directly or view it on GitHub

#3 (comment).

vimalmanohar · 2016-01-06T08:07:33Z

src/nnet3bin/nnet3-discriminative-merge-egs.cc

+      examples.back() = cur_eg;
+
+      bool minibatch_ready =
+          static_cast<int32>(examples.size()) >= minibatch_size;


minibatch_size is measured in terms of number of examples rather than the number of output frames. This is the same as in the chain code. Is there a reason why this is preferred? What should be the minibatch_size if the examples are 1.5s long? The default in chain code was 64.

Generally we like the number of sequences in the minibatch to be a power of
two, and preferably a multiple of 64 (for reasons relating to NVidia board
architecture). This is easier to ensure if we set it absolutely, not as a
number of frames.
Dan

On Wed, Jan 6, 2016 at 12:07 AM, Vimal Manohar notifications@github.com
wrote:

In src/nnet3bin/nnet3-discriminative-merge-egs.cc
#3 (comment):

examples_wspecifier = po.GetArg(2);

SequentialNnetDiscriminativeExampleReader example_reader(examples_rspecifier);

NnetDiscriminativeExampleWriter example_writer(examples_wspecifier);

std::vector examples;

examples.reserve(minibatch_size);

int64 num_read = 0, num_written = 0;

while (!example_reader.Done()) {

const NnetDiscriminativeExample &cur_eg = example_reader.Value();

examples.resize(examples.size() + 1);

examples.back() = cur_eg;

bool minibatch_ready =

static_cast<int32>(examples.size()) >= minibatch_size;

minibatch_size is measured in terms of number of examples rather than the
number of output frames. This is the same as in the chain code. Is there a
reason why this is preferred? What should be the minibatch_size if the
examples are 1.5s long? The default in chain code was 64.

—
Reply to this email directly or view it on GitHub
https://github.com/danpovey/kaldi/pull/3/files#r48934967.

…branch

…fix memory-exhaustion issue found by Xiang Li

…ry consumption

…ext from the network

…ed regularization

…s with discriminative training

…ive training

changed beam to 11, commented online decoding block and added online decoding results to 6v_sp script.

added the option trainer.deriv-truncate-margin to train_rnn.py and tr…

* OCR: Add IAM corpus with unk decoding support (#3) * Add a new English OCR database 'UW3' * Some minor fixes re IAM corpus * Fix an issue in IAM chain recipes + add a new recipe (#6) * Some fixes based on the pull request review * Various fixes + cleaning on IAM * Fix LM estimation and add extended dictionary + other minor fixes * Add README for IAM * Add output filter for scoring * Fix a bug RE switch to pyhton3 * Add updated results + minor fixes * Remove unk decoding -- gives almost no gain * Add UW3 OCR database * Fix cmd.sh in IAM + fix usages of train/decode_cmd in chain recipes * Various minor fixes on UW3 * Rename iam/s5 to iam/v1 * Add README file for UW3 * Various cosmetic fixes on UW3 scripts * Minor fixes in IAM

sequence: Basic outline of header files for sequence training of nnet3

ac873d0

vimalmanohar reviewed Dec 19, 2015
View reviewed changes

sequencue: Further work on sequence training code

a146504

sequencue: Added functions to split and merge supervisions

774d934

vimalmanohar added 4 commits December 24, 2015 14:31

sequence: Further additions to sequence training

43d1b22

sequence: Added discriminative training functions

1fc4aa2

sequence: Added binary files and modified lattice functions

a68f426

sequence: Minor fix

d4fd87a

sequence: nnet3-discriminative-train program added

332dbd2

vimalmanohar added 3 commits December 30, 2015 18:25

sequence: Modified some scripts to support new additions related to s…

81229bb

…equence training

sequence: Added scripts for discriminative training in nnet3

79f3e8e

align_fix: Added --use-gpu option to nnet3-align-compiled

3c9680e

vimalmanohar reviewed Jan 6, 2016
View reviewed changes

danpovey and others added 4 commits January 6, 2016 19:22

changes regarding determinization of chain egs, merging from another …

7ebeacb

…branch

Merged ApplySignum from snr_clean

6848f31

jesus branch: adding debugged version of determinization changes, to …

fed777f

…fix memory-exhaustion issue found by Xiang Li

improvement to steps/nnet3/chain/get_egs.sh script for speed and memo…

e210ccb

…ry consumption

vimalmanohar and others added 25 commits January 31, 2016 14:02

sequence: Added test functions and also converting nnet2 to nnet3 degs

d3ce34a

sequence: nnet3 sequeuce scripts

e8d0b75

sequence: Minor fix in nnet3/decode.sh

accfc59

sequence: Fixed bug in nnet2 sequence training that did not take cont…

22e4df2

…ext from the network

sequence: nnet3 sequence scripts in swbd and wsj

4630e94

sequence: minor modifications to discriminative nnet2 scripts

7d9c43b

sequence: Bug fix to write frame_subsampling_factor to nnet3 info dir

a1e930c

sequence: Bug fix to scoring scripts

7da9a8c

snr: Added length tolerance to nnet3-get-egs

1bcb8e1

chain branch: minor bugfix (remove duplicately registered option)

b3f05c8

sequence: Minor fixes

9280600

sequence: Fixed sequence script

55ba492

sequence : Added some options to top level scripts

506c3e3

sequence: restoring old behavior of old behavior of smbr for now

b7dea22

sequence: Fixed some bugs in nnet3/nnet-utils.cc

9bf0acf

snr: Added weights option to online2bin/ivector-extract-online2.cc

3f20fa6

sequence: Fixed bug to directly read from normal Lattice type and add…

34f2ffe

…ed regularization

sequence: Made a bunch of script modifications to support chain model…

b5c2412

…s with discriminative training

sequence: Moved around some functions common to chain and discriminat…

e2e676a

…ive training

sequeunce: Support reading subset in lattice-copy

707ea5e

sequence: swbd discriminative scripts top level

fc62a1c

sequence: Added python explicitly for calling python scripts

502d25e

sequence: nnet3 adjust priors script

25acfaa

semisup: Modified scoring script to get confidences

05b21dc

sequence: Added some programs to work with degs

3d2129a

vimalmanohar closed this Mar 5, 2016

danpovey pushed a commit that referenced this pull request May 2, 2016

Merge pull request #3 from pegahgh/chain-online-nnet3

97e675f

changed beam to 11, commented online decoding block and added online decoding results to 6v_sp script.

danpovey pushed a commit that referenced this pull request Dec 8, 2016

Merge pull request #3 from freewym/vimal_raw_python_script

48fd6ab

added the option trainer.deriv-truncate-margin to train_rnn.py and tr…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Sequence training of nnet3 models #3

WIP Sequence training of nnet3 models #3

vimalmanohar commented Dec 19, 2015

vimalmanohar Dec 19, 2015

danpovey Dec 19, 2015

vimalmanohar Dec 19, 2015

vimalmanohar Dec 19, 2015

danpovey Dec 19, 2015

vimalmanohar commented Dec 21, 2015

danpovey commented Dec 21, 2015

vimalmanohar commented Dec 22, 2015

danpovey commented Dec 22, 2015

vimalmanohar commented Dec 22, 2015

danpovey commented Dec 22, 2015

vimalmanohar commented Dec 23, 2015

danpovey commented Dec 23, 2015

vimalmanohar commented Dec 23, 2015

vimalmanohar commented Dec 28, 2015

danpovey commented Dec 28, 2015

vimalmanohar commented Jan 2, 2016

danpovey commented Jan 2, 2016

vimalmanohar Jan 6, 2016

danpovey Jan 6, 2016

WIP Sequence training of nnet3 models #3

WIP Sequence training of nnet3 models #3

Conversation

vimalmanohar commented Dec 19, 2015

vimalmanohar Dec 19, 2015

Choose a reason for hiding this comment

danpovey Dec 19, 2015

Choose a reason for hiding this comment

vimalmanohar Dec 19, 2015

Choose a reason for hiding this comment

vimalmanohar Dec 19, 2015

Choose a reason for hiding this comment

danpovey Dec 19, 2015

Choose a reason for hiding this comment

vimalmanohar commented Dec 21, 2015

danpovey commented Dec 21, 2015

vimalmanohar commented Dec 22, 2015

danpovey commented Dec 22, 2015

vimalmanohar commented Dec 22, 2015

danpovey commented Dec 22, 2015

vimalmanohar commented Dec 23, 2015

danpovey commented Dec 23, 2015

vimalmanohar commented Dec 23, 2015

vimalmanohar commented Dec 28, 2015

danpovey commented Dec 28, 2015

vimalmanohar commented Jan 2, 2016

danpovey commented Jan 2, 2016

vimalmanohar Jan 6, 2016

Choose a reason for hiding this comment

danpovey Jan 6, 2016

Choose a reason for hiding this comment