-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State of the Art for conformer and beam decoding #106
Comments
Hi @abhigarg-iitk
|
Thanks |
@abhigarg-iitk I'm also planning to use batch dimension directly in the decoding, I'll find a way to do that ASAP. |
Hi @abhigarg-iitk
Another improvement could come from the vocabulary. The actual example uses a vocab size of 1000, but maybe a bigger vocab could help. I tried with a vocab of size 8000 but the training does not fit in memory. |
Hi @gandroz , In my opinion
I think the first statement might be true. Unless we have a shallow fusion with an LM the beam search might not be that effective. For reference see table 3 in this work. Although #123 has good point. maybe we can have a look at some of the standard beam searches of espnet
Although conformer paper doesn't mention explicitly the vocab size, contextnet paper mentions using 1k word-piece model and I assume conformer might be using the same vocab. Moreover maybe we can infer the vocab size using number of parameter mentioned in the conformer paper. |
Hi @abhigarg-iitk I contacted the first author of the paper and here is his answer:
|
Hi @gandroz , Thanks for this answer. I had looked earlier into the Lingvo implementation of conformer. And one strange contrast was the use of ff in the convolution module instead of pointwise conv used in the original paper. Also the class name says "Lightweight conv layer" which also has a mention in the paper. Infact I also trying replacing pointwise conv with ff layers but the results were somewhat worse. Although I didn't check my implementation throughly. Even Espnet seems to use pointwise conv and not ff link. |
I made a quick review of the model code and did not find any great difference with the ESPNet implementation. Maybe in the decoder... The paper refers to a single layer LSTM whereas the transformer decoder in ESPNet seems to add MHA layers. |
Also in ESPNet:
So definitevely, the approach from ESPNet is not purely the one exposed in the conformer paper. |
There're 2 things I'm not sure in the paper: variational noise and the structure of prediction and joint networks. I don't know if they have dense layers right after the encoder and prediction net or only dense after adding 2 inputs, layernorm or projection in the prediction net. |
@usimarit I'm waiting for an answer about the joint network and the choices made by the conformer team. I'll let you know when I have further details |
Hi, Here is my training log. |
@pourfard I could not say... I'm using the whole training dataset (960h) and after 50 epochs, the losses were 22.16 ont the training datasets and 6.3 on the dev ones. And yes, it is very long to train... |
Hi @gandroz : have you got something back from the conformer's authors? |
@tund not yet, I'll try to poke him again tomorrow |
Thanks @gandroz |
Hi, Thanks for developing this great tool kit. I had 2 questions about the conformer model :-
examples/conformer
, I think almost all the parameter are similar to conformer(S) of https://arxiv.org/pdf/2005.08100.pdf . However, the performance gap between the paper and conformer model inexamples/conformer
seems to be quite big (2.7 v/s 6.44 for test-clean). What do you think might be the reason for this?One reason I can see is that 2.7 is obtained with beam-search whereas 6.44 without. But I don't think just beam search can bring that difference. Can you give me some pointers on how can I reduce this gap? Also, Did you try decoding with beam search for
examples/conformer
?examples/conformer
with beam searchtest_subword_conformer.py
using the pre-trained model provided via drive. For this I just modifiedbeam-width
parameter in config.yml. But the decoding is taking very large time (about 30 min per batch, the total number of batches in test clean ~650) on Nvidia p40 with 24GB memory.Is this the expected behaviour or do I need to something more than changing
beam-width
from 0 to 4/8. What was the decoding time for you?Thanks,
Abhinav
The text was updated successfully, but these errors were encountered: