Discrepency in code and paper related to HGRNBitAttention #37

loki-r · 2024-08-07T12:34:10Z

The equations in the paper and the code don't match for the last equation.

The figure shows the last output equation as

But based on the current code. It looks like this is the execution

$o_t^{'} = RMSNORM(g_t) * \sigma(h_t)$

instead of

$o_t^{'} = RMSNORM(h_t) * \sigma(g_t)$

Seems like this is fixed in recent commit HGRN - flash-linear-attention repository

        last_state = (recurrent_state,)
        past_key_values.update(last_state, self.layer_idx, i.shape[2])

-       o = self.g_norm(self.g_proj(hidden_states), rearrange(o, 'b h l d -> b l (h d)'))
+       o = self.g_norm(rearrange(o, 'b h l d -> b l (h d)'), self.g_proj(hidden_states))
        o = self.o_proj(o)

        return o, None, past_key_values

Existing code path of current repository :

(g_norm is called with $g_t$ and $h_t$) : g_norm(self.g_proj(hidden_states), rearrange(o, 'b h l d -> b l (h d)'))

matmulfreellm/mmfreelm/layers/hgrn_bit.py

Line 139 in ec1c298

o = self.g_norm(self.g_proj(hidden_states), rearrange(o, 'b h l d -> b l (h d)'))
(X is $g_t$ and O is $h_t$) in FusedRMSNormSwishGate.forward(self, x, o, ...)

matmulfreellm/mmfreelm/layers/hgrn_bit.py

Line 139 in ec1c298

o = self.g_norm(self.g_proj(hidden_states), rearrange(o, 'b h l d -> b l (h d)'))
(sigmoid is called on O which is $h_t$ instead of $g_t$) in _layer_norm_fwd_1pass_kernel y = y * o * tl.sigmoid(o)

matmulfreellm/mmfreelm/modules/fused_norm_gate.py

Line 128 in ec1c298

y = y * o * tl.sigmoid(o)

Are the results with the inverted equation or with the fixed equation ?

ridgerchu · 2024-08-07T13:58:30Z

Hi, I think it is a bug, due to the HGRN api modifications. the sigma should be applied to g_t for better performance, but now it is applied to h_t. and our pre-trained model also still using sigma h_t... we will fix this problem in arxiv soon, and we believe that if applied to g_t it would be better performance compared with our current version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepency in code and paper related to HGRNBitAttention #37

Discrepency in code and paper related to HGRNBitAttention #37

loki-r commented Aug 7, 2024 •

edited

Loading

ridgerchu commented Aug 7, 2024

Discrepency in code and paper related to HGRNBitAttention #37

Discrepency in code and paper related to HGRNBitAttention #37

Comments

loki-r commented Aug 7, 2024 • edited Loading

$o_t^{'} = RMSNORM(g_t) * \sigma(h_t)$

$o_t^{'} = RMSNORM(h_t) * \sigma(g_t)$

ridgerchu commented Aug 7, 2024

loki-r commented Aug 7, 2024 •

edited

Loading