You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thanks for this amazing work in benchmarking the several available RNNT implementations. This is more of a "discussion" rather than an issue.
I am sure you are aware about this, but the FB speech group uses a kind of "pruned" RNNT where the pruning is done using external alignments (paper link). The idea is that for each token u in U, you restrict the time-steps (< T) that it can occur in, by using alignments obtained from, say, a hybrid ASR. This effectively "prunes" the lattice in a way similar to the k2 pruned RNNT. I imagine that the first pass "trivial" joiner is approximating a similar alignment between T and U, to then prune the lattice.
I was wondering how hard it would be to implement something like alignment-restricted RNNT in k2, given the pruned framework. From a high-level view, it would basically require using external alignments to prune the lattice, and the 2nd pass of the loss computation can proceed as before. I am interested because I think if we have access to external alignments, training a model on conditions involving noise and babble in the background might be easier, since the trivial joiner may have a hard time especially at the beginning of training.
I would be happy to hear your thoughts on the matter.
The text was updated successfully, but these errors were encountered:
The idea is that for each token u in U, you restrict the time-steps (< T) that it can occur in
In pruned RNN-T, we use:
For each time step t, we restrict the number of symbols it can emit to S, where S is a fixed parameter, e.g., 3
In Ar-RNN-T, the number of symbols can be emitted is different at each time step t.
I am afraid we cannot simply replace the trivial joiner using external alignments.
Thanks for your comment. Yeah, that was my main concern --- I was not sure how easy it would be to set a variable S per time step in the pruned RNNT framework.
It seems that AR-RNNT also uses the sequence concatenation and function merging from Microsoft's paper, so you are right that optimized_transducer would be a good starting point. I will take a look later this month and try to implement it. I don't have a lot of experience with CUDA so your help would be much appreciated.
First of all, thanks for this amazing work in benchmarking the several available RNNT implementations. This is more of a "discussion" rather than an issue.
I am sure you are aware about this, but the FB speech group uses a kind of "pruned" RNNT where the pruning is done using external alignments (paper link). The idea is that for each token u in U, you restrict the time-steps (< T) that it can occur in, by using alignments obtained from, say, a hybrid ASR. This effectively "prunes" the lattice in a way similar to the k2 pruned RNNT. I imagine that the first pass "trivial" joiner is approximating a similar alignment between T and U, to then prune the lattice.
I was wondering how hard it would be to implement something like alignment-restricted RNNT in k2, given the pruned framework. From a high-level view, it would basically require using external alignments to prune the lattice, and the 2nd pass of the loss computation can proceed as before. I am interested because I think if we have access to external alignments, training a model on conditions involving noise and babble in the background might be easier, since the trivial joiner may have a hard time especially at the beginning of training.
I would be happy to hear your thoughts on the matter.
The text was updated successfully, but these errors were encountered: