[hotfix] doubled bias in FusedMLP #317

blefaudeux · 2022-05-31T23:48:18Z

What does this PR do?

Fixes the FusedMLP block having twice the bias layers, found randomly when working on the weight inits (#312)

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

blefaudeux · 2022-06-01T00:01:25Z

Perf numbers without the doubled bias:

--- Type: torch.float16 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - no bias - 0.0 drop - fw	0.19	1.26	1.27	5.97
fused - gelu - no bias - 0.0 drop - fw	0.19	1.26	1.27	5.52

--- Type: torch.float32 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - no bias - 0.0 drop - fw	0.36	2.56	2.54	10.13
fused - gelu - no bias - 0.0 drop - fw	0.36	2.54	2.54	10.14

--- Type: torch.float16 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - no bias - 0.1 drop - fw	0.24	1.44	1.45	5.98
fused - gelu - no bias - 0.1 drop - fw	0.21	1.36	1.36	5.53

--- Type: torch.float32 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - no bias - 0.1 drop - fw	0.44	2.86	2.85	10.63
fused - gelu - no bias - 0.1 drop - fw	0.39	2.66	2.64	10.60

--- Type: torch.float16 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - bias - 0.0 drop - fw	0.23	1.43	1.42	5.88
fused - gelu - bias - 0.0 drop - fw	0.23	1.43	1.43	5.92

--- Type: torch.float32 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - bias - 0.0 drop - fw	0.43	2.83	2.83	10.76
fused - gelu - bias - 0.0 drop - fw	0.43	2.83	2.83	10.97

--- Type: torch.float16 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - bias - 0.1 drop - fw	0.27	1.59	1.59	5.95
fused - gelu - bias - 0.1 drop - fw	0.22	1.36	1.36	5.54

--- Type: torch.float32 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - bias - 0.1 drop - fw	0.50	3.10	3.12	11.60
fused - gelu - bias - 0.1 drop - fw	0.40	2.65	2.64	10.37

--- Type: torch.float16 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - no bias - 0.0 drop - fw+bw	0.60	4.26	4.27	16.46
fused - gelu - no bias - 0.0 drop - fw+bw	0.60	4.71	4.26	16.28

--- Type: torch.float32 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - no bias - 0.0 drop - fw+bw	1.14	9.02	8.45	31.93
fused - gelu - no bias - 0.0 drop - fw+bw	1.15	8.88	8.37	34.53

--- Type: torch.float16 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - no bias - 0.1 drop - fw+bw	0.67	4.56	4.58	16.84
fused - gelu - no bias - 0.1 drop - fw+bw	0.66	4.37	4.69	19.42

--- Type: torch.float32 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - no bias - 0.1 drop - fw+bw	1.29	9.11	9.12	32.54
fused - gelu - no bias - 0.1 drop - fw+bw	1.18	8.49	8.69	34.63

--- Type: torch.float16 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - bias - 0.0 drop - fw+bw	0.67	4.48	4.71	17.17
fused - gelu - bias - 0.0 drop - fw+bw	0.79	4.62	4.62	16.83

--- Type: torch.float32 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - bias - 0.0 drop - fw+bw	1.24	9.81	8.81	32.42
fused - gelu - bias - 0.0 drop - fw+bw	1.35	9.20	9.01	32.51

--- Type: torch.float16 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - bias - 0.1 drop - fw+bw	0.75	5.26	5.16	17.36
fused - gelu - bias - 0.1 drop - fw+bw	0.96	4.55	4.57	16.64

--- Type: torch.float32 ---

Units: runtime in ms, lower is better. BMK - mul:	8 x 256 x 512 - 4	8 x 512 x 1024 - 4	4 x 1024 x 1024 - 4	2 x 2048 x 2048 - 4
standard - gelu - bias - 0.1 drop - fw+bw	1.40	9.47	9.72	33.54
fused - gelu - bias - 0.1 drop - fw+bw	1.30	8.56	8.56	32.66

If/when I get the time to revisit the fused linear this could probably be improved

blefaudeux · 2022-06-01T00:02:03Z

xformers/components/feedforward/fused_mlp.py

@@ -46,16 +46,26 @@ def __init__(
                dim_mlp = hidden_layer_multiplier * dim_model

                self.mlp = nn.Sequential(
-                    nn.Linear(in_features=dim_model, out_features=dim_mlp, bias=bias),
+                    nn.Linear(
+                        in_features=dim_model, out_features=dim_mlp, bias=False


the gist of it, this was a typo, the bias is handled in the next layer already

blefaudeux · 2022-06-01T00:04:38Z

insta-land @danthe3rd / @fmassa / @dianaml0 , I hope that's ok, semi-obvious typo

blefaudeux · 2022-06-01T00:04:52Z

xformers/benchmarks/benchmark_mlp.py

@@ -88,7 +88,7 @@ def mlp_fused():
                    ),
                ]:
                    time = triton.testing.do_bench(testcase.function)[0]
-                    key = f"B={B}, M={M}, K={K}, HLM={hlm}"


minor presentation changes

blefaudeux · 2022-06-01T00:05:02Z

xformers/benchmarks/benchmark_mlp.py

@@ -19,8 +19,8 @@
    (8, 512, 1024),
    (4, 1024, 1024),
    (2, 2048, 2048),
-    (1, 2048, 12288),
-    (2, 4096, 4096),
+    (1, 2048, 4096),


trying to make this fit on a smaller GPU

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 31, 2022

hotfix, dual bias in FusedMLP

e21c7ab

blefaudeux force-pushed the hotfix_fused_mlp branch from a188d5c to e21c7ab Compare June 1, 2022 00:00

blefaudeux requested review from dianaml0 and fmassa June 1, 2022 00:01

blefaudeux commented Jun 1, 2022

View reviewed changes

blefaudeux changed the title ~~[hotfix] dual bias in FusedMLP~~ [hotfix] doubled bias in FusedMLP Jun 1, 2022

blefaudeux requested a review from danthe3rd June 1, 2022 00:06

blefaudeux merged commit cbf4526 into main Jun 1, 2022

blefaudeux deleted the hotfix_fused_mlp branch June 3, 2022 03:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hotfix] doubled bias in FusedMLP #317

[hotfix] doubled bias in FusedMLP #317

blefaudeux commented May 31, 2022 •

edited

Loading

blefaudeux commented Jun 1, 2022

blefaudeux Jun 1, 2022

blefaudeux commented Jun 1, 2022

blefaudeux Jun 1, 2022

blefaudeux Jun 1, 2022

[hotfix] doubled bias in FusedMLP #317

[hotfix] doubled bias in FusedMLP #317

Conversation

blefaudeux commented May 31, 2022 • edited Loading

What does this PR do?

Before submitting

PR review

blefaudeux commented Jun 1, 2022

blefaudeux Jun 1, 2022

Choose a reason for hiding this comment

blefaudeux commented Jun 1, 2022

blefaudeux Jun 1, 2022

Choose a reason for hiding this comment

blefaudeux Jun 1, 2022

Choose a reason for hiding this comment

blefaudeux commented May 31, 2022 •

edited

Loading