Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Sequence training of nnet3 models #3

Closed
wants to merge 159 commits into from

Conversation

vimalmanohar
Copy link

No description provided.

determine the "pinch points".
*/
void SplitDiscriminativeExample(
const std::string &name,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple supervisions, then only the "name" object will be considered to identify the pinch points.
This is a difference from nnet2. The Excise function also has the same issue.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I know I told you we should support multiple supervision objects,
but on second thoughts, I think it makes sense to allow just one (and of
course store its name). In a multilingual setup, an utterance corresponds
to just one language.

In addition, after thinking about this a bit, I think we're going to have
to make some substantial changes from the 'nnet2' way of doing the
discriminative training. In nnet2 we split the lattices on pinch points,
and I think there was some way of padding them and stitching examples
together. But if we have recurrent architectures that see infinite
context, stitching examples together won't fly, and even padding at the
ends isn't quite right. Also, there is a big cost to using variable-length
egs, because of how the compilation works. So we will need to rely more on
fixed-length egs extracted from the lattice without regard to where the
pinch points lie. This is what I do in the 'chain' models. I use
fixed-length egs (1.5 seconds by default), and discard training utterances
shorter than this. (we can append training data at the data-dir level if
we're concerned about losing too many short utterances; @tomkocse already
wrote a script for this).

So we have extract fixed-length egs from the lattices. The edge effects
can be handled by using the 'forward' and 'backward' scores of the cut
points as the initial and final-probs. [you can of course renormalize
somehow so the best cost is zero.] Initial-probs can be simulated using
arc probabilities. In order to know which frames the acoustic scores
correspond to, the decoder will have to dump in the non-compact lattice
format (--determinize=false), and because this takes up a lot of disk, we
can eventually consider integrating the decoding with the initial phase of
egs-dumping. But for now probably best to just dump the lattices without
determinization.

The initial splitting up of lattices can decide on random fixed-length
pieces of lattice- use the 'SplitIntoRanges' function from the 'chain'
branch. The lattice splitting-up code will be similar to class
SupervisionSplitter in the 'chain' branch, except with more attention to
the initial and final costs. (To do this, in addition to computing the
lattice state times, you'll want to compute the lattice alpha and beta
scores).

Obviously this is a slightly bigger project than we thought, now. If you
don't have time, feel free to reconsider.

Dan

On Sat, Dec 19, 2015 at 10:24 AM, Vimal Manohar notifications@github.com
wrote:

In src/nnet3/nnet-discriminative-example.h
#3 (comment):

  • int64 num_frames_kept_after_split;
  • int32 longest_segment_after_split;
  • int64 num_frames_kept_after_excise;
  • int32 longest_segment_after_excise;
  • SplitExampleStats() { memset(this, 0, sizeof(*this)); }
  • void Print();
    +};

+/** Split a "discriminative example" into multiple pieces,

  • splitting where the lattice has "pinch points".
  • Uses "name" as the supervision object that is used to
  • determine the "pinch points".
  • */
    +void SplitDiscriminativeExample(
  • const std::string &name,

If there are multiple supervisions, then only the "name" object will be
considered to identify the pinch points.
This is a difference from nnet2. The Excise function also has the same
issue.


Reply to this email directly or view it on GitHub
https://github.com/danpovey/kaldi/pull/3/files#r48095338.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some time to work on this project.

If we are not looking at pinch points and creating only fixed-length segments, then it should not be too difficult to have to support multiple supervision objects. This might be useful in some situations like training with MMI and CE objective. Microsoft had done this in one of their papers to fix some issues like getting large number of deletions and insertions, which we usually have.

If we use lattice forward and backward scores, would we need to update these scores during some of the training iterations since they change when the model gets updated?

Vimal

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we would need different forward and backward scores for the different objectives, right? So each supervision object would be specific to a particular objective because the forward scores for MMI would be different from those for sMBR and MPE.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we won't be updating the scores. This is a hassle to do and will
hardly change the results.

Just compute the MMI-type scores. The extra MPE-type scores will be set to
zero. The scores won't be specific to the objective.

Dan

On Sat, Dec 19, 2015 at 3:00 PM, Vimal Manohar notifications@github.com
wrote:

In src/nnet3/nnet-discriminative-example.h
#3 (comment):

  • int64 num_frames_kept_after_split;
  • int32 longest_segment_after_split;
  • int64 num_frames_kept_after_excise;
  • int32 longest_segment_after_excise;
  • SplitExampleStats() { memset(this, 0, sizeof(*this)); }
  • void Print();
    +};

+/** Split a "discriminative example" into multiple pieces,

  • splitting where the lattice has "pinch points".
  • Uses "name" as the supervision object that is used to
  • determine the "pinch points".
  • */
    +void SplitDiscriminativeExample(
  • const std::string &name,

Also, we would need different forward and backward scores for the
different objectives, right? So each supervision object would be specific
to a particular objective because the forward scores for MMI would be
different from those for sMBR and MPE.


Reply to this email directly or view it on GitHub
https://github.com/danpovey/kaldi/pull/3/files#r48097595.

@vimalmanohar
Copy link
Author

I moved some of the code thats common to chain and sequence training to chain/chain-utils.cc.
I have a DiscriminativeSupervision class thats similar to chain::Supervision and its in nnet3/discriminative-supervision.cc. The splitting code works on Lattice type instead of fst::StdVectorFst.
I see that in chain code, the normalization Fst is constant and is read from disk. The way I am implementing it is that when the lattices are split into 1.5s segments, each of the initial and final states in the segments have their own initial and final weights, which would be the forward and backward scores respectively of the unsplit lattice. Is this what you had in mind?
You said the extra MPE-type scores will be set to zero. This would be only an approximation, but not exactly correct, right?

@danpovey
Copy link
Owner

I moved some of the code thats common to chain and sequence training to
chain/chain-utils.cc.
I have a DiscriminativeSupervision class thats similar to
chain::Supervision and its in nnet3/discriminative-supervision.cc. The
splitting code works on Lattice type instead of fst::StdVectorFst.
I see that in chain code, the normalization Fst is constant and is read
from disk. The way I am implementing it is that when the lattices are split
into 1.5s segments, each of the initial and final states in the segments
have their own initial and final weights, which would be the forward and
backward scores respectively of the unsplit lattice. Is this what you had
in mind?

Yes. Of course we'll have to make sure that acoustic scale used there is
the same one as used in actual training.

You said the extra MPE-type scores will be set to zero. This would be only
an approximation, but not exactly correct, right?

Yes. Actually, later on we could investigate making them nonzero, but I
doubt it will make very much difference.
Dan

@vimalmanohar
Copy link
Author

There might be an issue when splitting lattices. Since a new state is added to accommodate the initial weights for the states, the length of path in the lattice will be 1 more than the number of frames. This will be a problem because we would have to add a dummy in the alignment. Otherwise the functions in lattice-functions.cc would not work. At what stage must this be handled? Should there be a variable in the supervision object to identify if the supervision object has undergone some splitting, in which case, a dummy can be added the alignment when necessary.

@danpovey
Copy link
Owner

There might be an issue when splitting lattices. Since a new state is added

to accommodate the initial weights for the states, the length of path in
the lattice will be 1 more than the number of frames. This will be a
problem because we would have to add a dummy in the alignment. Otherwise
the functions in lattice-functions.cc would not work. At what stage must
this be handled? Should there be a variable in the supervision object to
identify if the supervision object has undergone some splitting, in which
case, a dummy can be added the alignment when necessary.

I thought the lattice format allowed epsilon arcs?
Even if it does not, fixing this by RmEpsilon is easy.
Dan


Reply to this email directly or view it on GitHub
#3 (comment).

@vimalmanohar
Copy link
Author

Ok, I think it might work. I can do RmEpsilon after the lattice is split.

@danpovey
Copy link
Owner

OK, but before doing that, verify that it's even necessary, and let me know
what crashes if you have epsilons. I thought epsilons in Lattices were
supported, but I might be wrong.

On Tue, Dec 22, 2015 at 3:52 PM, Vimal Manohar notifications@github.com
wrote:

Ok, I think it might work. I can do RmEpsilon after the lattice is split.


Reply to this email directly or view it on GitHub
#3 (comment).

@vimalmanohar
Copy link
Author

Epsilons are supported in lattice. The discrminative training functions use LatticeStateTimes to get the frame index for a state. So the path length in the lattice must match the number of frames in the alignment.

@danpovey
Copy link
Owner

Yes but LatticeStateTimes doesn't count epsilons when measuring the path
length. If it does, it would be a bug.
Dan

On Tue, Dec 22, 2015 at 4:03 PM, Vimal Manohar notifications@github.com
wrote:

Epsilons are supported in lattice. The discrminative training functions
use LatticeStateTimes to get the frame index for a state. So the path
length in the lattice must match the number of frames in the alignment.


Reply to this email directly or view it on GitHub
#3 (comment).

@vimalmanohar
Copy link
Author

Ok, I just checked the LatticeStateTimes. Its fine. It does not count the epsilon arcs. I don't need to do RmEpsilon then.

@vimalmanohar
Copy link
Author

I added all the discriminative training codes from nnet2 including the semi-supervised training stuff. I am now going to write the scripts to test them out.

@danpovey
Copy link
Owner

Great, thanks!

On Sun, Dec 27, 2015 at 5:32 PM, Vimal Manohar notifications@github.com
wrote:

I added all the discriminative training codes from nnet2 including the
semi-supervised training stuff. I am now going to write the scripts to test
them out.


Reply to this email directly or view it on GitHub
#3 (comment).

@vimalmanohar
Copy link
Author

I wrote the scripts and codes. I have some questions about the implementation:

  1. In nnet2, the minibatch size was 512. Using this would remove a lot of short utterances. I think this would be a more reasonable number than 1.5s, since we might more than just a couple of words in the utterance. Should combining data at script level made default in the discriminative training scripts?
  2. All the lattices are undeterminized at all times to get forward and backward scores. Is it better to determinize them after splitting the lattices and adding the initial and final scores to it? This would mean we should not allow to split again. Perhaps, we can add a flag saying if the lattice is determinized.

@danpovey
Copy link
Owner

danpovey commented Jan 2, 2016

I wrote the scripts and codes. I have some questions about the
implementation:

  1. In nnet2, the minibatch size was 512. Using this would remove a lot of
    short utterances. I think this would be a more reasonable number than 1.5s,
    since we might more than just a couple of words in the utterance.

Bear in mind that it could make a substantial difference to the objective
function. You'll have to tune this. In most databases you'll lose <1%
data by truncating at 1.5secs

Should combining data at script level made default in the discriminative
training scripts?

Do some experimentation before you decide.

  1. All the lattices are undeterminized at all times to get forward and
    backward scores. Is it better to determinize them after splitting the
    lattices and adding the initial and final scores to it? This would mean we
    should not allow to split again. Perhaps, we can add a flag saying if the
    lattice is determinized.

This probably makes sense- it would make the resulting lattice have fewer
states. But make it optional and try it both ways- determinizing will end
up removing duplicate paths and alternative alignments of the original
data, so would make a difference. It would probably make sense to also
try the same recipe with lattices that were determinized when dumped, and
were then subject to acoustic lattice rescoring to get the correct
per-frame log-likelihoods (this may require a new binary). If it turns out
that it's better to determinize at the start, we can find a way to do this
without evaluating the neural network twice.

Dan

Reply to this email directly or view it on GitHub

#3 (comment).

examples.back() = cur_eg;

bool minibatch_ready =
static_cast<int32>(examples.size()) >= minibatch_size;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minibatch_size is measured in terms of number of examples rather than the number of output frames. This is the same as in the chain code. Is there a reason why this is preferred? What should be the minibatch_size if the examples are 1.5s long? The default in chain code was 64.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we like the number of sequences in the minibatch to be a power of
two, and preferably a multiple of 64 (for reasons relating to NVidia board
architecture). This is easier to ensure if we set it absolutely, not as a
number of frames.
Dan

On Wed, Jan 6, 2016 at 12:07 AM, Vimal Manohar notifications@github.com
wrote:

In src/nnet3bin/nnet3-discriminative-merge-egs.cc
#3 (comment):

  •    examples_wspecifier = po.GetArg(2);
    
  • SequentialNnetDiscriminativeExampleReader example_reader(examples_rspecifier);
  • NnetDiscriminativeExampleWriter example_writer(examples_wspecifier);
  • std::vector examples;
  • examples.reserve(minibatch_size);
  • int64 num_read = 0, num_written = 0;
  • while (!example_reader.Done()) {
  •  const NnetDiscriminativeExample &cur_eg = example_reader.Value();
    
  •  examples.resize(examples.size() + 1);
    
  •  examples.back() = cur_eg;
    
  •  bool minibatch_ready =
    
  •      static_cast<int32>(examples.size()) >= minibatch_size;
    

minibatch_size is measured in terms of number of examples rather than the
number of output frames. This is the same as in the chain code. Is there a
reason why this is preferred? What should be the minibatch_size if the
examples are 1.5s long? The default in chain code was 64.


Reply to this email directly or view it on GitHub
https://github.com/danpovey/kaldi/pull/3/files#r48934967.

vimalmanohar and others added 25 commits January 31, 2016 14:02
danpovey pushed a commit that referenced this pull request May 2, 2016
changed beam to 11, commented online decoding block and added online decoding results to 6v_sp script.
danpovey pushed a commit that referenced this pull request Dec 8, 2016
added the option trainer.deriv-truncate-margin to train_rnn.py and tr…
danpovey pushed a commit that referenced this pull request Jan 5, 2018
* OCR: Add IAM corpus with unk decoding support (#3)

* Add a new English OCR database 'UW3'

* Some minor fixes re IAM corpus

* Fix an issue in IAM chain recipes + add a new recipe (#6)

* Some fixes based on the pull request review

* Various fixes + cleaning on IAM

* Fix LM estimation and add extended dictionary + other minor fixes

* Add README for IAM

* Add output filter for scoring

* Fix a bug RE switch to pyhton3

* Add updated results + minor fixes

* Remove unk decoding -- gives almost no gain

* Add UW3 OCR database

* Fix cmd.sh in IAM + fix usages of train/decode_cmd in chain recipes

* Various minor fixes on UW3

* Rename iam/s5 to iam/v1

* Add README file for UW3

* Various cosmetic fixes on UW3 scripts

* Minor fixes in IAM
danpovey pushed a commit that referenced this pull request Jan 5, 2018
* OCR: Add IAM corpus with unk decoding support (#3)

* Add a new English OCR database 'UW3'

* Some minor fixes re IAM corpus

* Fix an issue in IAM chain recipes + add a new recipe (#6)

* Some fixes based on the pull request review

* Various fixes + cleaning on IAM

* Fix LM estimation and add extended dictionary + other minor fixes

* Add README for IAM

* Add output filter for scoring

* Fix a bug RE switch to pyhton3

* Add updated results + minor fixes

* Remove unk decoding -- gives almost no gain

* Add UW3 OCR database

* Fix cmd.sh in IAM + fix usages of train/decode_cmd in chain recipes

* Various minor fixes on UW3

* Rename iam/s5 to iam/v1

* Add README file for UW3

* Various cosmetic fixes on UW3 scripts

* Minor fixes in IAM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants