-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the meaning of Gamma in Attention model #57
Comments
You can find gamma in eqn (3) in the paper: https://arxiv.org/pdf/1805.08318.pdf The idea of introducing a gamma parameter is to give the network the chance to learn how much to be influenced by the self-attention module. With gamma = 0, the self-attention layer will simply pass the feature maps from the previous layer, untouched. With gamma > 0, the self-attention layer will add the results of self-attention to the feature maps from the previous layer, weighted by gamma itself. So the larger the gamma, the more the self-attention modules influence the outputs. The idea is that in early training, it is easier/more stable to ignore the self-attention layers (models are initialized with gamma = 0 for all self-attention layers), and then the network should learn to increase gamma as training progresses. For what it's worth, in my experiments, once gamma is initialized at 0, it rarely increases throughout training. |
|
How to update gamma, manually update myself? |
Thanks for your amazing work first.
I am quite confused about the "gamma" in Attention Model, can you explain the meaning of "Gamma"?
I also can not find the parameter in the original paper.
Thanks
The text was updated successfully, but these errors were encountered: