the meaning of Gamma in Attention model #57

hank19960918 · 2021-03-30T07:59:18Z

Thanks for your amazing work first.
I am quite confused about the "gamma" in Attention Model, can you explain the meaning of "Gamma"?
I also can not find the parameter in the original paper.

Thanks

philnovv · 2022-03-20T15:30:31Z

You can find gamma in eqn (3) in the paper: https://arxiv.org/pdf/1805.08318.pdf

The idea of introducing a gamma parameter is to give the network the chance to learn how much to be influenced by the self-attention module. With gamma = 0, the self-attention layer will simply pass the feature maps from the previous layer, untouched. With gamma > 0, the self-attention layer will add the results of self-attention to the feature maps from the previous layer, weighted by gamma itself. So the larger the gamma, the more the self-attention modules influence the outputs.

The idea is that in early training, it is easier/more stable to ignore the self-attention layers (models are initialized with gamma = 0 for all self-attention layers), and then the network should learn to increase gamma as training progresses.

For what it's worth, in my experiments, once gamma is initialized at 0, it rarely increases throughout training.

weiweiecho · 2022-07-31T00:08:36Z

You can find gamma in eqn (3) in the paper: https://arxiv.org/pdf/1805.08318.pdf

The idea of introducing a gamma parameter is to give the network the chance to learn how much to be influenced by the self-attention module. With gamma = 0, the self-attention layer will simply pass the feature maps from the previous layer, untouched. With gamma > 0, the self-attention layer will add the results of self-attention to the feature maps from the previous layer, weighted by gamma itself. So the larger the gamma, the more the self-attention modules influence the outputs.

The idea is that in early training, it is easier/more stable to ignore the self-attention layers (models are initialized with gamma = 0 for all self-attention layers), and then the network should learn to increase gamma as training progresses.

For what it's worth, in my experiments, once gamma is initialized at 0, it rarely increases throughout training.
@philnovv so we need initialized gamma at other value bigger than 0?

thd-ux · 2024-09-10T07:36:08Z

You can find gamma in eqn (3) in the paper: https://arxiv.org/pdf/1805.08318.pdf

The idea of introducing a gamma parameter is to give the network the chance to learn how much to be influenced by the self-attention module. With gamma = 0, the self-attention layer will simply pass the feature maps from the previous layer, untouched. With gamma > 0, the self-attention layer will add the results of self-attention to the feature maps from the previous layer, weighted by gamma itself. So the larger the gamma, the more the self-attention modules influence the outputs.

The idea is that in early training, it is easier/more stable to ignore the self-attention layers (models are initialized with gamma = 0 for all self-attention layers), and then the network should learn to increase gamma as training progresses.

For what it's worth, in my experiments, once gamma is initialized at 0, it rarely increases throughout training.

How to update gamma, manually update myself？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the meaning of Gamma in Attention model #57

the meaning of Gamma in Attention model #57

hank19960918 commented Mar 30, 2021

philnovv commented Mar 20, 2022 •

edited

Loading

weiweiecho commented Jul 31, 2022

thd-ux commented Sep 10, 2024

the meaning of Gamma in Attention model #57

the meaning of Gamma in Attention model #57

Comments

hank19960918 commented Mar 30, 2021

philnovv commented Mar 20, 2022 • edited Loading

weiweiecho commented Jul 31, 2022

thd-ux commented Sep 10, 2024

philnovv commented Mar 20, 2022 •

edited

Loading