5/N Ragged Inference - Move triton_v2 ragged inference code to new `experimental` directory #189

nottombrown · 2022-01-21T07:13:02Z

Since this code using triton_v2 it's currently incompatible with our CI pipeline. This PR moves it to a separate package that can avoid breaking CI while still letting imports work correctly.

Once triton v2 is stable then we can upgrade xformers core to 2.0 and pull the experimental code into the core package

blefaudeux · 2022-01-22T00:52:48Z

thanks for the PR @nottombrown, having a look asap ! @fmassa @dianaml0 I'll try to put up a PR this week end to move the mem efficient attention to triton 2 and move it to /experimental also, and same could be done with the favor specific kernel which was breaking triton 1.0

nottombrown · 2022-01-22T01:11:02Z

Great! I'll make a separate branch for further changes so as not to collide with this one!

experimental/ragged_inference_v2/ragged_inference_v2/garbage_pad_ragged_acts.py

blefaudeux · 2022-01-22T04:27:49Z

see nottombrown#2 for minor changes, I hope that works

blefaudeux · 2022-01-22T04:56:15Z

interesting, on a ampere laptop some tests do not pass, and others segfault. How HW ready is triton v2.0 @ptillet ? Note that it could be something else, wrong gcc or Cuda version, but I've tried a few

ptillet · 2022-01-22T19:28:17Z

Oh this is strange. What are the nature of the failures? It's possible that some tests are not configured to skip configs that require too much shared memory

blefaudeux · 2022-01-22T19:39:55Z

Not a shared memory size issue, segfault at JIT time in Triton/compile. (No errors when installing or importing Triton). Only happens for some tests, not all of them

…

On Sat, Jan 22, 2022 at 11:28 AM Philippe Tillet ***@***.***> wrote: Oh this is strange. What are the nature of the failures? It's possible that some tests are not configured to skip configs that require too much shared memory — Reply to this email directly, view it on GitHub <#189 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXOGXLVBNFVH2PRZOJIHGTUXMANZANCNFSM5MOYADSA> . You are receiving this because your review was requested.Message ID: ***@***.***>

blefaudeux · 2022-01-22T19:45:46Z

Gcc 9 and 10, cuda 11.5 by the way On Sat, Jan 22, 2022 at 11:39 AM Benjamin Lefaudeux < ***@***.***> wrote:

…

Not a shared memory size issue, segfault at JIT time in Triton/compile. (No errors when installing or importing Triton). Only happens for some tests, not all of them On Sat, Jan 22, 2022 at 11:28 AM Philippe Tillet ***@***.***> wrote: > Oh this is strange. What are the nature of the failures? It's possible > that some tests are not configured to skip configs that require too much > shared memory > > — > Reply to this email directly, view it on GitHub > <#189 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAXOGXLVBNFVH2PRZOJIHGTUXMANZANCNFSM5MOYADSA> > . > You are receiving this because your review was requested.Message ID: > ***@***.***> >

[ragged attention] suggested minor changes (will update the other PR if accepted)

blefaudeux · 2022-01-24T06:20:36Z

experimental/ragged_inference/triton_v2_ragged_qk_dotprod.py

+    return scores_out.reshape((n_ctx_q, n_ctx_k))
+
+
+def ragged_qk_dotprod(


curious to get some perf numbers on that one, even if probably early

blefaudeux · 2022-01-24T06:32:54Z

experimental/tests/test_seq_kv_cache.py

+
+    bytes_in_keys_per_seq = n_key_ctx_per_seq * d_model_per_gpu * 2  # 2 from bf16
+    bytes_in_keys_total = bytes_in_keys_per_seq * n_seqs
+    hbm_bw_bytes_per_gpu = 1555e9  # 1.5TB/s


not for this PR, but I would not do this (compare a number to a theoretical one), (a) it's HW specific -how does this test relate to another accelerator ?- and (b) does some suppositions on what's going on -here the data format for instance-.

For other benchmarks we calcul the user-facing throughput, it's also for instance what Phil does here, in that you consider the implementation as a black box, and you count the bytes going in and out (at best for instance you read the seqs and write the attention matrix, the rest is history). It's mostly what happens in this code already, but I would

count the BW with num elem * elem_size() (and not suppose bfloat16, would be nice to compare across types actually, it can give an idea on how the kernels are compute or bandwidth bound, at least it helped me on other tasks)

test for a bunch of sizes, from experience there are a lot of possible holes / scheduling, and testing with one size only is like russian roulette (see for instance, there are helpers for this in the repo if that helps)

I can do that later on if this ends up running locally, brain dump here :)

blefaudeux · 2022-01-24T06:35:32Z

experimental/ragged_inference/triton_v2_ragged_qk_dotprod.py

+
+    # Define indices ranges, we follow the triton convention of prefixing
+    # with "r" to denote a range like "rq" is the range for queries below
+    rq = in_q_token_offset + tl.arange(0, BLOCK_Q)


nice, "all" (well, a lot of) the magic is there, looks like nothing but super well done I think

blefaudeux

looks very good to me, I especially like the wrap for the ragged attentions and how they play well with the kernel. Great comments also, will be nice for newcomers ! Thanks a bunch @nottombrown

…k and the model (facebookresearch#189)

suchenzang · 2022-02-09T22:34:14Z

experimental/ragged_inference/garbage_pad_ragged_acts.py

+    ragged_acts_offset_ptr = ragged_acts_offset_per_seq_ptr + seq_idx
+    ragged_acts_offset = tl.load(ragged_acts_offset_ptr)
+
+    # Create a mask to guard memory operations against out-of-bounds accesses


Is this unavoidable? How much more memory does this consume (if any)?

it's part of the Triton API (masked load), it should not take any registers really. This is all in SoC space, so nothing visible in the RAM, worst case it's a throwaway mask on the actual gpu chip

suchenzang · 2022-02-09T22:34:35Z

experimental/ragged_inference/garbage_pad_ragged_acts.py

+        # We just use one program per n_ctx position for simplicity
+        assert d_model >= 128, f"bad {d_model=}"
+        assert d_model <= 8 * 1024, f"bad {d_model=}"
+        assert d_model % 32 == 0, f"bad {d_model=}"


Why 32 and not 64? Where did these requirements come from?

I think it comes from BLOCK_K = 32, a scheduling/tiling constraint for the matmuls. This could be relaxed if we were to mask over that dimension in the kernel

suchenzang · 2022-02-09T22:34:59Z

experimental/ragged_inference/seq_kv_cache.py

+    for n_ctx in n_ctx_per_kv_cache:
+        for idx_into_seq in range(max_n_ctx):
+            if idx_into_seq < n_ctx:
+                indices_list.append(ragged_idx)


Wondering why we need O(n) append calls here...

please submit a PR :D yes this is not optimal indeed

suchenzang · 2022-02-09T22:35:17Z

experimental/ragged_inference/triton_v2_matmul.py

+def get_all_configs():
+    return [
+        # basic configs for compute-bound matmuls
+        triton.Config(


Where do these magic numbers / configs come from?

comes from here, empirical good values for Ampere GPUs, architecture dependent but Triton does navigate around some specifics thanks for all these scheduling options https://github.com/openai/triton/blob/v2.0/python/triton/ops/matmul.py#L35

suchenzang · 2022-02-09T22:35:40Z

experimental/ragged_inference/triton_v2_ragged_qk_dotprod.py

+
+        # In einsum notation, the tl.dot does: qd,dk->qk
+        # This should use tensorcores, so the inputs might be fp16, but the outputs
+        # and all the internal accumulators are fp32


Do internal accumulators default to fp32 or tf32 on A100s?

tl.dot() always return fp32, it's a very good question, it must be documented somewhere on nvidia's side

suchenzang · 2022-02-09T22:36:20Z

Just asking a bunch of questions for my own learnings - feel free to ignore!

blefaudeux · 2022-02-09T23:10:10Z

Just asking a bunch of questions for my own learnings - feel free to ignore!

no worries, very good questions I think, I tried to give some insights but @nottombrown could probably add a little. Note that the code changed a tiny bit since this PR, and some updates were planned

nottombrown added 5 commits January 20, 2022 22:52

Add new directory

be39f20

pull code out into experimental branch

6cc548e

isort

1ccc6b8

Shorten name

af472e0

isort

dfa36df

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 21, 2022

nottombrown requested a review from blefaudeux January 22, 2022 00:24

nottombrown added 2 commits January 21, 2022 16:25

Rm print

18acb02

fix

e58285d

blefaudeux requested review from fmassa, dianaml0 and jieru-hu January 22, 2022 00:51

blefaudeux reviewed Jan 22, 2022

View reviewed changes

experimental/ragged_inference_v2/ragged_inference_v2/garbage_pad_ragged_acts.py Outdated Show resolved Hide resolved

blefaudeux and others added 3 commits January 23, 2022 09:36

suggested cleanup

5c3edca

Merge pull request #2 from nottombrown/experimental_suggestion

885e00d

[ragged attention] suggested minor changes (will update the other PR if accepted)

Minor, adding credits and link for the copy paste and some lint fix

b9e8906

blefaudeux reviewed Jan 24, 2022

View reviewed changes

blefaudeux approved these changes Jan 24, 2022

View reviewed changes

blefaudeux force-pushed the tom/experimental branch 2 times, most recently from 85a07a0 to 450b178 Compare January 24, 2022 07:29

tweaking CI so that experimental does not break the rest

ffb1d89

blefaudeux force-pushed the tom/experimental branch from 450b178 to ffb1d89 Compare January 24, 2022 07:36

blefaudeux merged commit 02e5abc into facebookresearch:main Jan 24, 2022

blefaudeux mentioned this pull request Jan 24, 2022

[minor] Changelog + codecov #191

Merged

10 tasks

xwhan pushed a commit to xwhan/xformers that referenced this pull request Feb 8, 2022

[refactor] Better split responsibilities in between building the bloc…

7ead2e9

…k and the model (facebookresearch#189)

suchenzang reviewed Feb 9, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5/N Ragged Inference - Move triton_v2 ragged inference code to new `experimental` directory #189

5/N Ragged Inference - Move triton_v2 ragged inference code to new `experimental` directory #189

nottombrown commented Jan 21, 2022

blefaudeux commented Jan 22, 2022

nottombrown commented Jan 22, 2022

blefaudeux commented Jan 22, 2022

blefaudeux commented Jan 22, 2022 •

edited

Loading

ptillet commented Jan 22, 2022

blefaudeux commented Jan 22, 2022 via email

blefaudeux commented Jan 22, 2022 via email

blefaudeux Jan 24, 2022

blefaudeux Jan 24, 2022

blefaudeux Jan 24, 2022

blefaudeux left a comment

suchenzang Feb 9, 2022

blefaudeux Feb 9, 2022

suchenzang Feb 9, 2022

blefaudeux Feb 9, 2022

suchenzang Feb 9, 2022

blefaudeux Feb 9, 2022

suchenzang Feb 9, 2022

blefaudeux Feb 9, 2022

suchenzang Feb 9, 2022

blefaudeux Feb 9, 2022

suchenzang commented Feb 9, 2022

blefaudeux commented Feb 9, 2022

		return scores_out.reshape((n_ctx_q, n_ctx_k))


		def ragged_qk_dotprod(

5/N Ragged Inference - Move triton_v2 ragged inference code to new experimental directory #189

5/N Ragged Inference - Move triton_v2 ragged inference code to new experimental directory #189

Conversation

nottombrown commented Jan 21, 2022

blefaudeux commented Jan 22, 2022

nottombrown commented Jan 22, 2022

blefaudeux commented Jan 22, 2022

blefaudeux commented Jan 22, 2022 • edited Loading

ptillet commented Jan 22, 2022

blefaudeux commented Jan 22, 2022 via email

blefaudeux commented Jan 22, 2022 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blefaudeux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suchenzang commented Feb 9, 2022

blefaudeux commented Feb 9, 2022

5/N Ragged Inference - Move triton_v2 ragged inference code to new `experimental` directory #189

5/N Ragged Inference - Move triton_v2 ragged inference code to new `experimental` directory #189

blefaudeux commented Jan 22, 2022 •

edited

Loading