-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
p(blank symbol) >> p(non-blank symbol) during NN-CTC training #3
Comments
Hmmm, we definitely observe character probabilities spiking above blank probabilities at many time steps. Though there is an imbalance issue: blanks are much more frequent than all other characters. Not sure why MLP+CNN wouldn't do as well w/o more details (are you providing sufficient temporal context?). That said, at convergence your negative log-likelihood cost looks too high; we get < 50 using about 20 million parameters. |
I suspect this is due to underfitting. The network always learns first that On Wed, Jun 24, 2015 at 5:53 AM, gmkim90 notifications@github.com wrote:
|
Thanks for your comments. @zxie, @amaas I use 21frames as context window (with frame length : 25ms/ frame shift size : 10ms ). And MLP architecture used is as follows : 840(40FBANK x 21CW) - 1024 - 1024 -1024 -31 (~3M params). I use normal regularizer such as momentum(0.9) and weight decay(0.0005) during training. From your comments, it seems that trained network underfit. (average log-likelihood is is not high enough). What do you think to try next? Do I need to try MLP with more parameter? Do I need to try RNN(more expressive for sequential data)? or more training iteration? |
Your MLP gives framewise predictions, correct? Could you detail how your cost is computed w.r.t. the desired character sequence? Are you just using (T - CW) CNN-MLPs (w/ shared parameters), where T denotes the number of input frames? |
Yes. MLP gives framewise character predictions. I am basically using MLP-CTC (MLP : 840(40FBANK x 21CW) - 1024 - 1024 -1024 -31). And I also tried CNN instaed of MLP to produce framewise prediction. Objective function(to be maximized) is log likelihood of transcription given Input per utterance. |
As a sanity check I would try increasing layer sizes to 2048 and training On Wed, Jun 24, 2015 at 10:35 PM, gmkim90 notifications@github.com wrote:
|
If I'm understanding correctly, not having recurrent connections could also be issue...it's a big ask to have each MLP produce the right prediction independent of the others without sequential reasoning. |
Did you ever solve the issue? I have the same problem at the moment, the network is outputting all blanks. |
Do you scale your output with prior probabilities? The count of blank symbol is quite higher than others. |
Hi all
I want to discuss some issue regarding training DNN/CNN-CTC for speech recognition. (Wall Street Journal Corpus). I modeled output unit as characters.
I observed that CTC objective function was increasing and finally converged during training.
But I also observed that final NN outputs have clear tendency : p(blank symbol) >> p(non-blank symbol) for all speech time frame as following figure
In Alex Graves' paper, trained RNN should have high p(non-blank) at some point like following figure
Do you have same situation when you train NN-CTC for sequence labeling problem? I am suspecting that the reason is I use MLP/CNN instead of RNN, but I can't clearly explain why this can be a reason.
Any idea about this result?
Thank you for reading my question.
The text was updated successfully, but these errors were encountered: