[backend] 3/3 Triton 2 update #272

blefaudeux · 2022-04-13T05:07:02Z

What does this PR do?

Push things a little forward with triton 2. My thinking was to try to land all 3 PRs in one go when this last one is green
Happy to update this PR on that front (and others), this is something quickly wrapped together to try to unlock some other PRs (like #263)

TODOs:

Upgrade the CI
Fix the kernel syntax to triton 2
Fix the timm benchmark / points to layernorm being broken
Sparse softmax / bmm unit test failure -> triton2 changed some of the internal computation formats (and pytorch did the same, with a default switch to tf32 for some matmuls I believe). @fmassa the sparse_tensor unit tests were not passing even if almost nothing in the codebase had changed (except for newer CUDA + Triton2), so this PR relaxed some of the parity constraints. Please have a look, we can change things
Some speed improvements given triton2 changes

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

blefaudeux · 2022-04-13T05:16:47Z

to peek into the remaining issues: https://app.circleci.com/pipelines/github/facebookresearch/xformers/1395/workflows/b90368b4-eda1-4037-8cd3-a4d138b6e320/jobs/3075 getting there

blefaudeux · 2022-04-15T04:26:20Z

@dianaml0 just FYI, investigating the layernorm crash, this works fine with cuda 11.6 / Ampere actually, but I can repro with 11.4

blefaudeux · 2022-04-15T05:14:57Z

@dianaml0 just FYI, investigating the layernorm crash, this works fine with cuda 11.6 / Ampere actually, but I can repro with 11.4

some writes where not masked, could be why. I just pushed a small update, works on my machine (tm)

blefaudeux · 2022-04-16T06:00:47Z

@dianaml0 just FYI, investigating the layernorm crash, this works fine with cuda 11.6 / Ampere actually, but I can repro with 11.4

layernorm fixed, as far as I can see, should be a bit faster for very big sizes also

blefaudeux · 2022-04-16T06:27:10Z

@fmassa if you have some cycles at some point, would you mind having a look at the test_sparse_softmax test which does not pass with this branch ? You have a little more context around there (and you could check the changes from this PR, probably a bit rough in this area)

blefaudeux · 2022-04-17T16:08:30Z

@fmassa if you have some cycles at some point, would you mind having a look at the test_sparse_softmax test which does not pass with this branch ? You have a little more context around there (and you could check the changes from this PR, probably a bit rough in this area)

I just improved a bit on the changes in /sparse_tensor, should be ok by now except for the fact that the new blocksparse does not acccept a per pixel mask anymore, so this probably breaks some of this abstraction.

blefaudeux · 2022-04-19T00:25:48Z

tests/test_sparse_tensors.py

+        return torch.float32, 1e-1
+
+    # Force pytorch to keep its computations as float32 (will default to tf32 with recent cuda and ampere+ GPU)
+    torch.backends.cuda.matmul.allow_tf32 = False


@fmassa this fixed issues that I was seeing with these unit tests on an ampere GPU, which I presume stemmed from the fact that the sparse kernels were fp32 while pytorch defaulted to tf32

oh wow, thanks for spotting this!

One more instance where tf32 is being somewhat harmful. Maybe worth commenting on pytorch/pytorch#67384 ?

It's a strange format, range of fp32 but precision of fp16, it's also kind of peculiar that it's really 18bits but named tf32..

blefaudeux · 2022-04-19T00:26:25Z

tests/test_sparse_tensors.py

+        # Upstream GPU blocksparse (Triton op) uses TF32 by default for all internal computations
+        # TF32 has the precision of fp16 but the range of fp32
+        # See https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
+        torch.backends.cuda.matmul.allow_tf32 = True


@fmassa this seems to be a better fit following the switch to triton2, which internally moved all tl.dot() operations to tf32

cc @ptillet, just swapping triton 1.1 for 2.dev meant that this test would not pass anymore, as we discussed

SGTM wrt the tests!

blefaudeux · 2022-04-19T00:27:08Z

tests/test_sparse_tensors.py

+
+
+def _get_dtype_atol(tensor_type, device: str):
+    _seed()


this was to remove some reproducibility issues in between circleci and my machine..

blefaudeux · 2022-04-19T00:27:57Z

tests/test_triton_blocksparse.py

+        MODE,
+        trans_a=TRANS_A,
+        trans_b=TRANS_B,
+        device=torch.device("cuda"),


triton blocksparse op now requires the device to be passed in

blefaudeux · 2022-04-19T00:28:18Z

tests/test_triton_blocksparse.py


    # triton result
-    op = blocksparse_softmax(layout, BLOCK)
+    op = blocksparse_softmax(layout, BLOCK, device=torch.device("cuda"))


triton blocksparse softmax now requires the device to be passed in

blefaudeux · 2022-04-19T00:28:56Z

tests/test_triton_blocksparse.py

-    ty = op(
-        tx,
-        scale=scale,
-        key_padding_mask=kp_mask,


triton blocksparse now does not support attn mask or key_padding mask. We can pass a "causal" flag though

tests/test_triton_dropout.py

blefaudeux · 2022-04-19T00:30:44Z

tests/test_triton_layernorm.py

@@ -50,7 +50,7 @@ def test_layernorm_parity(shape, amp):
    torch.random.manual_seed(0)
    X_ = torch.normal(0, 1, size=shape, device="cuda", requires_grad=True)

-    eps = 1e-5
+    eps = 1e-4


1/1e-5 overflows in fp16..

blefaudeux · 2022-04-19T00:32:39Z

xformers/components/attention/blocksparse.py

-            # Properties specific to this attention mechanism
-            self.supports_attention_mask = True
-            self.supports_key_padding_mask = True
+            # The underlying triton op does not support per element attention mask


see previous PR which introduced these flags, we can now flip them in this case

This is in principle a BC-breaking change. Should we bother, or should we just follow what Triton does?

basically we don't have much choice, short of implementing blocksparse ourselves ? Phil arguments were that it was not typically used (short of causal, which is supported), and that the attention mask (additive -> floats) ended up taking a significant amount of space in memory. My take would be that we have a in-house fallback since people can use sparse attention, and else they can stick to the current pip release for some time, so not something that I would do every day (breaking BC), but in that case that was ok ?

blefaudeux · 2022-04-19T00:33:54Z

xformers/components/attention/blocksparse.py

-            # If blocks are to be constantly masked, better perf would thus be reached by signalling them out in the
-            # initial attention setup
+            # Delayed triton init, to make sure that we get the right device
+            if not hasattr(self, "sparse_dot_sdd"):


triton blocksparse need the correct device to be passed, but it could be that it's not the case at construction time (if constructed on CPU then moved), and it's not possible to just default to cuda:0 (would break many multi-gpu cases). So we defer the construction until after the first input tensor comes in

sounds good to me.

One other option would be to inherit the .cuda() / .to() methods so that they re-create those objects.

The current approach is fine as is because the those objects don't contain learnable parameters, but if that was the case it would mess up with optimizers / distributed.

yep, that's something we discussed with @colehawkins on another PR, one possible issue for me is that some sharded trainers intercept the .to() calls, so this would silently fail in that case. I think that both takes have issues (the delayed init and the .to() overload), the only clean way out that I can think of is to make this attention take a "device" as a construction argument, and put it to the right place from the beginning

@blefaudeux @fmassa I've been using this with both pytorch lightning and the huggingface trainer, and this method (initialization at first forward) is the cleanest way I found that doesn't break any standard workflows or require model initialization workarounds. If we just take the device at construction this breaks the huggingface trainer "natural approach" for single-node, multi-gpu which (1) create the model, then (2) call .to().

One alternative is to initialize with a device at construction, keep that as self.device, and then check that against query.device and possibly re-initialize at the forward pass.

codecov-commenter · 2022-04-19T01:00:14Z

Codecov Report

❗ No coverage uploaded for pull request base (label_attention_properties@8113277). Click here to learn what that means.
The diff coverage is n/a.

@@                      Coverage Diff                      @@
##             label_attention_properties     #272   +/-   ##
=============================================================
  Coverage                              ?   92.69%           
=============================================================
  Files                                 ?       61           
  Lines                                 ?     3393           
  Branches                              ?        0           
=============================================================
  Hits                                  ?     3145           
  Misses                                ?      248           
  Partials                              ?        0

Flag	Coverage Δ
Python	`92.69% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8113277...c4c7b5f. Read the comment docs.

blefaudeux · 2022-04-19T06:37:34Z

I think that there's some margin in terms of speed around the fused linear layers, the matmul triton op improved a lot in the last few months and fused linear could probably benefit from some of it. Better keeping it for another PR though

colehawkins · 2022-04-19T13:44:07Z

@blefaudeux larger block sizes (at least up to 128) are supported by triton v2.

Also, is any type of recompile option possible for the blocksparse attention? I ran into this issue using tritonv2 blocksparse attention in a distributed environment (specifically huggingface trainer). Since the ops are device-specific, they need to be recreated or there is a device mismatch. It's not too hard to work around with a delayed device-specific model initialization, but I think there are potentially smoother work arounds by inheriting the to function or attempting re-creation in the event of a device mismatch. It's also possible that I'm missing a trivial workaround.

Happy to submit both in a PR to either branch, much more timely than last time (1-2 days).

blefaudeux · 2022-04-19T14:07:41Z

@blefaudeux larger block sizes (at least up to 128) are supported by triton v2.

Also, is any type of recompile option possible for the blocksparse attention? I ran into this issue using tritonv2 blocksparse attention in a distributed environment (specifically huggingface trainer). Since the ops are device-specific, they need to be recreated or there is a device mismatch. It's not too hard to work around with a delayed device-specific model initialization, but I think there are potentially smoother work arounds by inheriting the to function or attempting re-creation in the event of a device mismatch. It's also possible that I'm missing a trivial workaround.

Happy to submit both in a PR to either branch, much more timely than last time (1-2 days).

Oh sure for the block size, I didn't know, and there may be a type check or cast which should be removed also (top of head we used to force cast to fp16).

For the device, overloading .to() is possible but there can be a lot of cases to handle (for instance if blocksparse is part of a wrap which intercepts this call), I'm not sure that it will be much cleaner ? The first FW is much slower with Triton anyway due to the JIT, so I think that this has no measurable perf impact. If .to() can read cleaner then sure, and PR welcome to this branch if you want ?

No problem for the timing, not very timely myself on the topic :)

SeanNaren · 2022-04-19T22:25:37Z

xformers/components/attention/blocksparse.py

-            self.supports_key_padding_mask = True
+            # The underlying triton op does not support per element attention mask
+            self.supports_attention_mask = False
+            self.supports_key_padding_mask = False

        def update_mask_type(self, mask: torch.Tensor):


Was just reading the code and realised this is not used anymore, safe to delete?

Nice catch ! In general I need to give it a second look and clean things up, I was waiting for @colehawkins so that there's no conflict but will do

@blefaudeux Posted in #277. Pending CI, but tests passed locally so I have high hopes.

fmassa · 2022-04-20T17:53:16Z

tests/test_sparse_tensors.py

+        # See https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
+        torch.backends.cuda.matmul.allow_tf32 = True
+        torch.backends.cudnn.allow_tf32 = True
+        return torch.float32, 1e-1


wow, that is quite some low precision...

fmassa

Thanks for the PR!

I've left a few comments, but given the size of this PR already, I'd propose that we address them in a separate PR after this one is merged.

tests/test_triton_dropout.py

fmassa · 2022-04-20T17:56:20Z

tests/test_triton_dropout.py

        x_ref = (x + b if bias else x).to(y.dtype)
        assert not torch.allclose(x_ref, y, rtol=tol)

        # Check that the drops are different for every row (could catch broken seeds per row)
        y = triton_dropout(x, p=0.5)

+        print(y)


leftover from debugging?

fmassa · 2022-04-20T17:57:17Z

xformers/__init__.py

@@ -8,7 +8,7 @@
 import torch

 # Please update the doc version in docs/source/conf.py as well.
-__version__ = "0.0.10"
+__version__ = "0.0.11.dev"


that's a good thing to do indeed!

We should remember to remove this during releases though.

Maybe it would be better to split this off into its own PR?

fmassa · 2022-04-20T18:02:10Z

xformers/components/attention/blocksparse.py

-            # If blocks are to be constantly masked, better perf would thus be reached by signalling them out in the
-            # initial attention setup
+            # Delayed triton init, to make sure that we get the right device
+            if not hasattr(self, "sparse_dot_sdd"):


sounds good to me.

One other option would be to inherit the .cuda() / .to() methods so that they re-create those objects.

The current approach is fine as is because the those objects don't contain learnable parameters, but if that was the case it would mess up with optimizers / distributed.

fmassa · 2022-04-20T18:02:59Z

xformers/components/attention/blocksparse.py

-            # Properties specific to this attention mechanism
-            self.supports_attention_mask = True
-            self.supports_key_padding_mask = True
+            # The underlying triton op does not support per element attention mask


This is in principle a BC-breaking change. Should we bother, or should we just follow what Triton does?

fmassa · 2022-04-20T18:03:33Z

xformers/sparse/blocksparse_tensor.py

-            # TODO triton softmax performs an in-place operation
-            # res = arg0.__sparse_softmax(arg0.__values)
-            res = arg0.__sparse_softmax(arg0.__values.clone())
+            res = arg0.__sparse_softmax(arg0.__values)


fmassa · 2022-04-20T18:04:58Z

xformers/triton/dropout.py

@@ -166,15 +169,22 @@ def dropout(
    Optionally add a bias, the computation will be fused.
    """

+    assert p < 1.0, f"We don't want to drop all the values, most probably {p}"


PyTorch supports this case, so if we want our dropout to be a drop-in replacement to PyTorch's implementation it would be good to support this as well.

fixed, incoming update, thanks for the catch !

blefaudeux · 2022-04-21T01:31:58Z

checking right now with a small ViT/cifar training, then landing the whole stack asap

author Kashif Rasul <kashif.rasul@gmail.com> 1648069860 +0100 committer Benjamin Lefaudeux <benjamin.lefaudeux@pm.me> 1650256563 -0700 Move to Triton 2 Author: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Benjamin Lefaudeux <benjamin.lefaudeux@pm.me> Tentatively fixing layernorm - faster all around - bugfix better take on sparse tensors, put layout on the correct device update the pip packages, minor cleanup

…power of two constraint (#277) * Relax device size restrictions * Refactor device creation and run all tests * linting Co-authored-by: Cole Hawkins <colehawk@amazon.com>

blefaudeux · 2022-04-21T03:32:07Z

I tried to address most comments @fmassa, we can follow up on the version numbering and delayed init in another PR. I just checked with my "classical" ViT/Cifar test, just in case, same accuracy as before when pulling in all the triton layers

@fmassa

…h combo (#271) * testing using conda to get the pytorch nightlies and matching cuda * [fix] Making it explicit whether the attention mechanism supports an attention mask or not (#266) check the assert * [backend] 3/3 Triton 2 update (#272) * parent be72b26 author Kashif Rasul <kashif.rasul@gmail.com> 1648069860 +0100 committer Benjamin Lefaudeux <benjamin.lefaudeux@pm.me> 1650256563 -0700 Move to Triton 2 Author: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Benjamin Lefaudeux <benjamin.lefaudeux@pm.me> Tentatively fixing layernorm - faster all around - bugfix better take on sparse tensors, put layout on the correct device update the pip packages, minor cleanup * catering for triton blocksparse being probably more reliable in fp16 * faster layernorm * Minor blocksparse refactoring, update block size restrictions, relax power of two constraint (#277) * Relax device size restrictions * Refactor device creation and run all tests * linting Co-authored-by: Cole Hawkins <colehawk@amazon.com> * code review, thanks @fmassa ! Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: colepshawkins <31542048+colehawkins@users.noreply.github.com> Co-authored-by: Cole Hawkins <colehawk@amazon.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: colepshawkins <31542048+colehawkins@users.noreply.github.com> Co-authored-by: Cole Hawkins <colehawk@amazon.com>

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 13, 2022

blefaudeux requested review from fmassa and dianaml0 and removed request for fmassa April 13, 2022 05:07

blefaudeux changed the base branch from main to label_attention_properties April 13, 2022 05:07

blefaudeux requested a review from fmassa April 13, 2022 05:16

blefaudeux force-pushed the triton-2 branch 3 times, most recently from 6c97cca to 4327d39 Compare April 17, 2022 15:55

blefaudeux force-pushed the triton-2 branch from 4327d39 to 4ff051b Compare April 17, 2022 16:33

blefaudeux marked this pull request as draft April 18, 2022 04:32

blefaudeux force-pushed the label_attention_properties branch from be72b26 to 8113277 Compare April 18, 2022 04:34

blefaudeux force-pushed the triton-2 branch 4 times, most recently from 6369798 to 9554e3f Compare April 19, 2022 00:23

blefaudeux commented Apr 19, 2022

View reviewed changes

tests/test_triton_dropout.py Show resolved Hide resolved

blefaudeux commented Apr 19, 2022

View reviewed changes

blefaudeux force-pushed the triton-2 branch from 00b9763 to c4c7b5f Compare April 19, 2022 06:17

blefaudeux mentioned this pull request Apr 19, 2022

Rebase attention off triton v2 #275

Closed

7 tasks

blefaudeux mentioned this pull request Apr 19, 2022

Added SmeLU #263

Merged

10 tasks

blefaudeux added the enhancement New feature or request label Apr 19, 2022

blefaudeux linked an issue Apr 19, 2022 that may be closed by this pull request

Is bfloat16 supported? #231

Closed

blefaudeux marked this pull request as ready for review April 19, 2022 21:24

SeanNaren reviewed Apr 19, 2022

View reviewed changes

blefaudeux force-pushed the triton-2 branch from b212063 to 72fb5c7 Compare April 20, 2022 03:16

fmassa reviewed Apr 20, 2022

View reviewed changes

fmassa approved these changes Apr 20, 2022

View reviewed changes

blefaudeux changed the title ~~[RFC] 3/3 Triton 2 update~~ [backend] 3/3 Triton 2 update Apr 21, 2022

kashif and others added 5 commits April 20, 2022 20:24

catering for triton blocksparse being probably more reliable in fp16

3340f74

faster layernorm

e6c4046

Minor blocksparse refactoring, update block size restrictions, relax …

7ec9726

…power of two constraint (#277) * Relax device size restrictions * Refactor device creation and run all tests * linting Co-authored-by: Cole Hawkins <colehawk@amazon.com>

code review, thanks @fmassa !

a9c6065

blefaudeux force-pushed the triton-2 branch from 90d0b9e to a9c6065 Compare April 21, 2022 03:24

blefaudeux changed the base branch from label_attention_properties to conda_ci April 21, 2022 03:24

blefaudeux merged commit 4ecbec1 into conda_ci Apr 21, 2022

blefaudeux deleted the triton-2 branch April 21, 2022 16:52

jramapuram mentioned this pull request May 1, 2022

xformers ViT-B ImageNet MAE + Deepnorm training instability #219

Open

[backend] 3/3 Triton 2 update #272

[backend] 3/3 Triton 2 update #272

Conversation

blefaudeux commented Apr 13, 2022 • edited Loading

What does this PR do?

Before submitting

PR review

blefaudeux commented Apr 13, 2022

blefaudeux commented Apr 15, 2022

blefaudeux commented Apr 15, 2022 • edited Loading

blefaudeux commented Apr 16, 2022

blefaudeux commented Apr 16, 2022

blefaudeux commented Apr 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blefaudeux Apr 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Apr 19, 2022 • edited Loading

Codecov Report

blefaudeux commented Apr 19, 2022

colehawkins commented Apr 19, 2022

blefaudeux commented Apr 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fmassa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blefaudeux commented Apr 21, 2022

blefaudeux commented Apr 21, 2022

blefaudeux commented Apr 13, 2022 •

edited

Loading

blefaudeux commented Apr 15, 2022 •

edited

Loading

blefaudeux commented Apr 17, 2022 •

edited

Loading

blefaudeux Apr 19, 2022 •

edited

Loading

codecov-commenter commented Apr 19, 2022 •

edited

Loading