Magic number in SADelARTransformer #114

Subuday · 2024-03-04T08:19:35Z

Could you provide insists please about 8 in this line of code:
qk_scale = self.tunables.query_mult * 8 / math.sqrt(head_width)

The text was updated successfully, but these errors were encountered:

jpc · 2024-03-04T15:49:06Z

Hey, good question :)

The 8 is sqrt(64) which is the default head_width (so by default, the whole expression is going to be query_mult * 1). Similar to the 3 in the base_width calculation.

The "maximum update parametrization" from the Tensor Program V paper requires us to assume some "base" widths which we can use to determine if we need to adjust the learning rate, initialization stddev, etc. The choice is completely arbitrary since only thing that matters is the ratio between this base width and the lr/init_stddev hyperparameters (which we tune in the end).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Magic number in SADelARTransformer #114

Magic number in SADelARTransformer #114

Subuday commented Mar 4, 2024

jpc commented Mar 4, 2024

Magic number in SADelARTransformer #114

Magic number in SADelARTransformer #114

Comments

Subuday commented Mar 4, 2024

jpc commented Mar 4, 2024