You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Intriguing paper, keep the softmax(QKt) and V untangled, in that retrievals (*V_i in the vanilla attention) can have a look at all the searchs, that is it can be evaluated against all the softmax(QKt)_j, on a per head basis ("heads" become how many searchs and and many retrieval you support, possibly different)
Motivation
Interesting take for some tasks, does not seem life changing for classical MLM but seems very relevant to reasoning or vision related tasks
Pitch
Implement this, see how it goes in something like Dino ?
🚀 Feature
Intriguing paper, keep the softmax(QKt) and V untangled, in that retrievals (*V_i in the vanilla attention) can have a look at all the searchs, that is it can be evaluated against all the softmax(QKt)_j, on a per head basis ("heads" become how many searchs and and many retrieval you support, possibly different)
Motivation
Interesting take for some tasks, does not seem life changing for classical MLM but seems very relevant to reasoning or vision related tasks
Pitch
Implement this, see how it goes in something like Dino ?
Alternatives
Not doing it
Additional context
Paper
Reference implementation
The text was updated successfully, but these errors were encountered: