Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment-restricted RNNT #14

Open
desh2608 opened this issue Oct 13, 2022 · 2 comments
Open

Alignment-restricted RNNT #14

desh2608 opened this issue Oct 13, 2022 · 2 comments

Comments

@desh2608
Copy link

First of all, thanks for this amazing work in benchmarking the several available RNNT implementations. This is more of a "discussion" rather than an issue.

I am sure you are aware about this, but the FB speech group uses a kind of "pruned" RNNT where the pruning is done using external alignments (paper link). The idea is that for each token u in U, you restrict the time-steps (< T) that it can occur in, by using alignments obtained from, say, a hybrid ASR. This effectively "prunes" the lattice in a way similar to the k2 pruned RNNT. I imagine that the first pass "trivial" joiner is approximating a similar alignment between T and U, to then prune the lattice.

I was wondering how hard it would be to implement something like alignment-restricted RNNT in k2, given the pruned framework. From a high-level view, it would basically require using external alignments to prune the lattice, and the 2nd pass of the loss computation can proceed as before. I am interested because I think if we have access to external alignments, training a model on conditions involving noise and babble in the background might be easier, since the trivial joiner may have a hard time especially at the beginning of training.

I would be happy to hear your thoughts on the matter.

@csukuangfj
Copy link
Owner

http://arxiv.org/abs/2011.03072
As you commented:

The idea is that for each token u in U, you restrict the time-steps (< T) that it can occur in

In pruned RNN-T, we use:

For each time step t, we restrict the number of symbols it can emit to S, where S is a fixed parameter, e.g., 3

In Ar-RNN-T, the number of symbols can be emitted is different at each time step t.
I am afraid we cannot simply replace the trivial joiner using external alignments.


To implement Ar-RNN-T, I suggest that you can use
https://github.com/csukuangfj/optimized_transducer
as a starting point.

I would like to help with it.

@desh2608
Copy link
Author

Thanks for your comment. Yeah, that was my main concern --- I was not sure how easy it would be to set a variable S per time step in the pruned RNNT framework.

It seems that AR-RNNT also uses the sequence concatenation and function merging from Microsoft's paper, so you are right that optimized_transducer would be a good starting point. I will take a look later this month and try to implement it. I don't have a lot of experience with CUDA so your help would be much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants