Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepency in code and paper related to HGRNBitAttention #37

Open
loki-r opened this issue Aug 7, 2024 · 1 comment
Open

Discrepency in code and paper related to HGRNBitAttention #37

loki-r opened this issue Aug 7, 2024 · 1 comment

Comments

@loki-r
Copy link

loki-r commented Aug 7, 2024

The equations in the paper and the code don't match for the last equation.

The figure shows the last output equation as
image

But based on the current code. It looks like this is the execution

$o_t^{'} = RMSNORM(g_t) * \sigma(h_t)$

instead of

$o_t^{'} = RMSNORM(h_t) * \sigma(g_t)$

Seems like this is fixed in recent commit HGRN - flash-linear-attention repository

        last_state = (recurrent_state,)
        past_key_values.update(last_state, self.layer_idx, i.shape[2])

-       o = self.g_norm(self.g_proj(hidden_states), rearrange(o, 'b h l d -> b l (h d)'))
+       o = self.g_norm(rearrange(o, 'b h l d -> b l (h d)'), self.g_proj(hidden_states))
        o = self.o_proj(o)

        return o, None, past_key_values

Existing code path of current repository :

Are the results with the inverted equation or with the fixed equation ?

@ridgerchu
Copy link
Owner

Hi, I think it is a bug, due to the HGRN api modifications. the sigma should be applied to g_t for better performance, but now it is applied to h_t. and our pre-trained model also still using sigma h_t... we will fix this problem in arxiv soon, and we believe that if applied to g_t it would be better performance compared with our current version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants