Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the meaning of Gamma in Attention model #57

Open
hank19960918 opened this issue Mar 30, 2021 · 3 comments
Open

the meaning of Gamma in Attention model #57

hank19960918 opened this issue Mar 30, 2021 · 3 comments

Comments

@hank19960918
Copy link

Thanks for your amazing work first.
I am quite confused about the "gamma" in Attention Model, can you explain the meaning of "Gamma"?
I also can not find the parameter in the original paper.

Thanks

@philnovv
Copy link

philnovv commented Mar 20, 2022

You can find gamma in eqn (3) in the paper: https://arxiv.org/pdf/1805.08318.pdf

The idea of introducing a gamma parameter is to give the network the chance to learn how much to be influenced by the self-attention module. With gamma = 0, the self-attention layer will simply pass the feature maps from the previous layer, untouched. With gamma > 0, the self-attention layer will add the results of self-attention to the feature maps from the previous layer, weighted by gamma itself. So the larger the gamma, the more the self-attention modules influence the outputs.

The idea is that in early training, it is easier/more stable to ignore the self-attention layers (models are initialized with gamma = 0 for all self-attention layers), and then the network should learn to increase gamma as training progresses.

For what it's worth, in my experiments, once gamma is initialized at 0, it rarely increases throughout training.

@weiweiecho
Copy link

You can find gamma in eqn (3) in the paper: https://arxiv.org/pdf/1805.08318.pdf

The idea of introducing a gamma parameter is to give the network the chance to learn how much to be influenced by the self-attention module. With gamma = 0, the self-attention layer will simply pass the feature maps from the previous layer, untouched. With gamma > 0, the self-attention layer will add the results of self-attention to the feature maps from the previous layer, weighted by gamma itself. So the larger the gamma, the more the self-attention modules influence the outputs.

The idea is that in early training, it is easier/more stable to ignore the self-attention layers (models are initialized with gamma = 0 for all self-attention layers), and then the network should learn to increase gamma as training progresses.

For what it's worth, in my experiments, once gamma is initialized at 0, it rarely increases throughout training.
@philnovv so we need initialized gamma at other value bigger than 0?

@thd-ux
Copy link

thd-ux commented Sep 10, 2024

You can find gamma in eqn (3) in the paper: https://arxiv.org/pdf/1805.08318.pdf

The idea of introducing a gamma parameter is to give the network the chance to learn how much to be influenced by the self-attention module. With gamma = 0, the self-attention layer will simply pass the feature maps from the previous layer, untouched. With gamma > 0, the self-attention layer will add the results of self-attention to the feature maps from the previous layer, weighted by gamma itself. So the larger the gamma, the more the self-attention modules influence the outputs.

The idea is that in early training, it is easier/more stable to ignore the self-attention layers (models are initialized with gamma = 0 for all self-attention layers), and then the network should learn to increase gamma as training progresses.

For what it's worth, in my experiments, once gamma is initialized at 0, it rarely increases throughout training.

How to update gamma, manually update myself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants