Memory-efficient attention - forward pass #267

fmassa · 2022-04-11T13:28:50Z

What does this PR do?

This PR implements the memory-efficient attention mechanism from https://arxiv.org/pdf/2112.05682v2.pdf, with both CPU and CUDA kernels. For now, only fp32 is supported.

The CPU implementation is fairly naive and I haven't focused on optimizing it (yet). So you should expect it to be quite a bit slower than a baseline CPU implementation in PyTorch. But it is generic and should support all cases

For the CUDA implementation, the performance is quite competitive with a baseline pytorch implementation for fp32 in terms of runtime (within 10% for most cases), while the memory savings are quite significant (10x+).

Here are some numbers (run on a P100 GPU):

Speed / memory improvements on the CUDA case

Optimized Memory used: 0.0634765625 MB
Vanilla Memory used: 0.171875 MB
===== (1, 128, 16) =====
Optimized Memory used: 0.0322265625 MB
Vanilla Memory used: 0.1494140625 MB
===== (1, 128, 32) =====
Optimized Memory used: 0.0634765625 MB
Vanilla Memory used: 0.1728515625 MB
===== (1, 512, 16) =====
Optimized Memory used: 0.1259765625 MB
Vanilla Memory used: 2.0947265625 MB
===== (1, 512, 32) =====
Optimized Memory used: 0.2509765625 MB
Vanilla Memory used: 2.1884765625 MB
===== (1, 513, 16) =====
Optimized Memory used: 0.1279296875 MB
Vanilla Memory used: 2.10498046875 MB
===== (1, 513, 32) =====
Optimized Memory used: 0.2529296875 MB
Vanilla Memory used: 2.19873046875 MB
===== (1, 1023, 16) =====
Optimized Memory used: 0.2509765625 MB
Vanilla Memory used: 8.173828125 MB
===== (1, 1023, 32) =====
Optimized Memory used: 0.5009765625 MB
Vanilla Memory used: 8.361328125 MB
===== (1, 1024, 16) =====
Optimized Memory used: 0.2509765625 MB
Vanilla Memory used: 8.1884765625 MB
===== (1, 1024, 32) =====
Optimized Memory used: 0.5009765625 MB
Vanilla Memory used: 8.3759765625 MB
===== (8, 127, 16) =====
Optimized Memory used: 0.2490234375 MB
Vanilla Memory used: 1.17236328125 MB
===== (8, 127, 32) =====
Optimized Memory used: 0.4970703125 MB
Vanilla Memory used: 1.3583984375 MB
===== (8, 128, 16) =====
Optimized Memory used: 0.2509765625 MB
Vanilla Memory used: 1.1884765625 MB
===== (8, 128, 32) =====
Optimized Memory used: 0.5009765625 MB
Vanilla Memory used: 1.3759765625 MB
===== (8, 512, 16) =====
Optimized Memory used: 1.0009765625 MB
Vanilla Memory used: 16.7509765625 MB
===== (8, 512, 32) =====
Optimized Memory used: 2.0009765625 MB
Vanilla Memory used: 17.5009765625 MB
===== (8, 513, 16) =====
Optimized Memory used: 1.0029296875 MB
Vanilla Memory used: 16.81591796875 MB
===== (8, 513, 32) =====
Optimized Memory used: 2.0048828125 MB
Vanilla Memory used: 17.5673828125 MB
===== (8, 1023, 16) =====
Optimized Memory used: 1.9990234375 MB
Vanilla Memory used: 65.49951171875 MB
===== (8, 1023, 32) =====
Optimized Memory used: 3.9970703125 MB
Vanilla Memory used: 66.998046875 MB
===== (8, 1024, 16) =====
Optimized Memory used: 2.0009765625 MB
Vanilla Memory used: 65.5009765625 MB
===== (8, 1024, 32) =====
Optimized Memory used: 4.0009765625 MB
Vanilla Memory used: 67.0009765625 MB
===== (32, 127, 16) =====
Optimized Memory used: 0.9931640625 MB
Vanilla Memory used: 4.68359375 MB
===== (32, 127, 32) =====
Optimized Memory used: 1.9853515625 MB
Vanilla Memory used: 5.427734375 MB
===== (32, 128, 16) =====
Optimized Memory used: 1.0009765625 MB
Vanilla Memory used: 4.7509765625 MB
===== (32, 128, 32) =====
Optimized Memory used: 2.0009765625 MB
Vanilla Memory used: 5.5009765625 MB
===== (32, 512, 16) =====
Optimized Memory used: 4.0009765625 MB
Vanilla Memory used: 67.0009765625 MB
===== (32, 512, 32) =====
Optimized Memory used: 8.0009765625 MB
Vanilla Memory used: 70.0009765625 MB
===== (32, 513, 16) =====
Optimized Memory used: 5.87939453125 MB
Vanilla Memory used: 69.12841796875 MB
===== (32, 513, 32) =====
Optimized Memory used: 8.0166015625 MB
Vanilla Memory used: 70.263671875 MB
===== (32, 1023, 16) =====
Optimized Memory used: 8.0068359375 MB
Vanilla Memory used: 262.0087890625 MB
===== (32, 1023, 32) =====
Optimized Memory used: 15.9931640625 MB
Vanilla Memory used: 267.9970703125 MB
===== (32, 1024, 16) =====
Optimized Memory used: 8.0048828125 MB
Vanilla Memory used: 262.0048828125 MB
===== (32, 1024, 32) =====
Optimized Memory used: 16.0087890625 MB
Vanilla Memory used: 268.0087890625 MB
===== (256, 127, 16) =====
Optimized Memory used: 7.9892578125 MB
Vanilla Memory used: 38.0048828125 MB
===== (256, 127, 32) =====
Optimized Memory used: 15.9462890625 MB
Vanilla Memory used: 43.9423828125 MB
===== (256, 128, 16) =====
Optimized Memory used: 8.0556640625 MB
Vanilla Memory used: 38.0556640625 MB
===== (256, 128, 32) =====
Optimized Memory used: 16.0009765625 MB
Vanilla Memory used: 44.0009765625 MB
===== (256, 512, 16) =====
Optimized Memory used: 32.0009765625 MB
Vanilla Memory used: 536.0009765625 MB
===== (256, 512, 32) =====
Optimized Memory used: 64.0009765625 MB
Vanilla Memory used: 560.0009765625 MB
===== (256, 513, 16) =====
Optimized Memory used: 32.0634765625 MB
Vanilla Memory used: 540.0478515625 MB
===== (256, 513, 32) =====
Optimized Memory used: 64.1259765625 MB
Vanilla Memory used: 564.0947265625 MB
===== (256, 1023, 16) =====
Optimized Memory used: 63.9384765625 MB
Vanilla Memory used: 2091.9560546875 MB
===== (256, 1023, 32) =====
Optimized Memory used: 127.9384765625 MB
Vanilla Memory used: 2139.9716796875 MB
===== (256, 1024, 16) =====
Optimized Memory used: 64.0009765625 MB
Vanilla Memory used: 2096.0009765625 MB
===== (256, 1024, 32) =====
Optimized Memory used: 128.0009765625 MB
Vanilla Memory used: 2144.0009765625 MB
[------------------- attention -------------------]
                           |  optimized  |  vanilla
1 threads: ----------------------------------------
      B=1, M=127, K=16     |      24.8   |     43.1
      B=1, M=127, K=32     |      38.3   |     42.8
      B=1, M=128, K=16     |      22.3   |     42.7
      B=1, M=128, K=32     |      40.6   |     42.6
      B=1, M=512, K=16     |      68.0   |     43.3
      B=1, M=512, K=32     |     124.9   |     43.0
      B=1, M=513, K=16     |      69.3   |     43.4
      B=1, M=513, K=32     |     127.7   |     44.2
      B=1, M=1023, K=16    |     129.5   |     65.6
      B=1, M=1023, K=32    |     237.6   |     77.6
      B=1, M=1024, K=16    |     127.0   |     60.7
      B=1, M=1024, K=32    |     236.3   |     67.6
      B=8, M=127, K=16     |      24.6   |     44.6
      B=8, M=127, K=32     |      38.4   |     44.6
      B=8, M=128, K=16     |      22.0   |     44.9
      B=8, M=128, K=32     |      40.4   |     44.4
      B=8, M=512, K=16     |     104.9   |    161.9
      B=8, M=512, K=32     |     191.1   |    167.8
      B=8, M=513, K=16     |     107.7   |    176.2
      B=8, M=513, K=32     |     195.2   |    184.1
      B=8, M=1023, K=16    |     269.7   |    504.8
      B=8, M=1023, K=32    |     570.8   |    526.6
      B=8, M=1024, K=16    |     262.0   |    490.4
      B=8, M=1024, K=32    |     555.5   |    517.5
      B=32, M=127, K=16    |      32.8   |     67.9
      B=32, M=127, K=32    |      58.3   |     68.8
      B=32, M=128, K=16    |      31.7   |     67.7
      B=32, M=128, K=32    |      55.5   |     68.5
      B=32, M=512, K=16    |     343.2   |    495.0
      B=32, M=512, K=32    |     829.6   |    523.0
      B=32, M=513, K=16    |     396.6   |    565.0
      B=32, M=513, K=32    |     791.6   |    606.9
      B=32, M=1023, K=16   |    1216.4   |   1883.1
      B=32, M=1023, K=32   |    2761.4   |   1978.6
      B=32, M=1024, K=16   |    1173.2   |   1826.1
      B=32, M=1024, K=32   |    2471.5   |   1936.5
      B=256, M=127, K=16   |     266.8   |    259.7
      B=256, M=127, K=32   |     708.9   |    275.0
      B=256, M=128, K=16   |     207.5   |    250.3
      B=256, M=128, K=32   |     442.1   |    267.8
      B=256, M=512, K=16   |    2039.8   |   3428.5
      B=256, M=512, K=32   |    4461.3   |   3673.7
      B=256, M=513, K=16   |    2056.5   |   3921.6
      B=256, M=513, K=32   |    4236.2   |   4294.6
      B=256, M=1023, K=16  |    7886.6   |  13886.5
      B=256, M=1023, K=32  |   16808.8   |  14857.4
      B=256, M=1024, K=16  |    7540.8   |  13333.9
      B=256, M=1024, K=32  |   15666.1   |  14408.2

Times are in microseconds (us).

You can see up to 20x memory savings for larger configurations, while the runtime is in the order of 10% slower than the baseline (which leverages CUBLAS internally).

Next steps

This PR has some assumptions on the dimensionality of K (the feature map after splitting in heads). For now, it should be:

if a multiple of 4, K / 4 <= 8
else if a multiple of 2, K / 2 <= 8
if none of the above, K <= 8

This can be fixed in the future if needed.

Fixes #161.

It's for now 1000x slower than the baseline

Now we are *only* 6x slower than baseline

Now we are only 50% slower than baseline

Need to fix the buffer size, which is hard-coded for now

THe use of Dot makes it 2.5% faster already

Still need to make it generic wrt query size, and allow further values of K that go beyond the buffer limit

This is commented out for now as it brings a slowdown to the implementation

blefaudeux · 2022-04-11T15:37:22Z

@fmassa the doc build issue should disappear after a rebase, this was fixed on main a week ago or so. Else this is really great, having a deeper look !

blefaudeux · 2022-04-11T15:41:19Z

tests/test_mem_eff_attention.py

+    out = torch.ops.xformers.efficient_attention(query, key, value)
+    ref = ref_attention(query, key, value)
+
+    assert torch.allclose(out, ref, atol=2e-4)


just to get an idea, do you have a gut feeling on where the small differences come from ? softmax renormalization being a little different with the paper's method seems like an easy explanation, but is there something else ?

That is a good question, and my best explanation so far is indeed that we accumulate a bit more errors because we don't know ahead of time the max value over a row, so we need to renormalize (introducing a bit more rounding errors)

blefaudeux · 2022-04-11T15:48:03Z

xformers/components/attention/csrc/cpu/attention.cpp

+    at::TensorAccessor<scalar_t, 3> buffer //,
+    // at::TensorAccessor<int64_t, 2> mask
+) {
+  constexpr int64_t BLOCK = 1; // 8;


I'm guessing that there's some speed to gain here, to remove some reads / reuse them across a couple of rows ?

actually this is the size of the fetch over N, my bad, I'm guessing (hoping) that the compiler groups them automatically

Exactly. Moving from 1 -> 8 brought a significant speedup. I moved it back to 1 before sending the PR because I was a bit lazy and didn't want to bother handling the remainder cases like I do in the GPU code.

That being said, now that I performed unrolling of both dimensions in the CUDA kernel, I could probably copy-paste the CUDA code and change a few things in the hope of making the CPU code faster.

But given that CPU is normally used for prototyping most of the time, I didn't prioritize it further.

makes sense, as it is the memory accesses are a little too granular, but not a big issue. I think it's fine not to get it super duper optimized right now, but we could add a comment about it for the future, out of context, us ?

blefaudeux · 2022-04-11T15:53:52Z

xformers/components/attention/csrc/cpu/attention.cpp

+  int64_t M = query.size(1);
+  int64_t N = key.size(1);
+  int64_t grain_size = 1;
+  at::parallel_for(0, B, grain_size, [&](int64_t start, int64_t end) {


looks good to me, I kind of recognize the pattern from the triton version I think, it's definitely a bit more complicated to follow I believe but that works ! I'm not super familiar with at::Tensor but it feels a little strange that one has to call .data() all the time ?

the .data_ptr<scalar_t>() call is to get the raw pointers to the tensor. I originally used the TensorAcessor as it adds extra robustness to different strides, but it also adds an extra overhead of striding so I removed it from now.

ah interesting for the stride part, not a zero cost abstraction.. just a free question, thanks for the context

blefaudeux · 2022-04-11T16:11:37Z