Flash Attention 2 #795

patrickvonplaten · 2023-07-17T17:30:02Z

🚀 Feature

Adding Flash Attention 2

Motivation

Flash Attention 2 has just been added to the original repo: https://github.com/Dao-AILab/flash-attention . It clams to be almost twice as fast as Flash Attention 1 which is a huge speed-up. How can we best add it to xformers?

Pitch

Flash Attention 2 is very fast at pretty much no extra cost

Alternatives

N/A

Additional context

Many libraries depend on xformers to run flash attention. It would be great to add it here.

The text was updated successfully, but these errors were encountered:

Skylion007 · 2023-07-17T18:23:24Z

First step would be updating the flash_attention submodule. All the heuristics would probably need to be changed to to prefer it until we get implementations in CUTLASS and triton.

danthe3rd · 2023-07-17T18:47:51Z

Hey,
Thanks for opening this issue! I want to run some benchmarks and testing first, and hopefully we can have it in xFormers this week.

Skylion007 · 2023-07-17T19:23:30Z

Clarification, FlashAttention2 actually uses CUTLASS.

danthe3rd · 2023-07-18T12:03:35Z

I have an initial prototype working - but hitting some nans in Flash-Attention. I've opened an issue Dao-AILab/flash-attention#334
We get great speedups across the board, I'll share some benchmarks soon :)

danthe3rd · 2023-07-18T16:18:09Z

Here are some benchmarks for the FW pass on A100:
https://pastebin.com/YEApkXBM

lucidrains · 2023-07-18T18:16:03Z

can we expect this to be upstreamed to pytorch 2.0's scaled_dot_product_attention ? or should we open a separate issue

danthe3rd · 2023-07-18T19:12:43Z

Hi @lucidrains ,
I would imagine this will be done at some point in the future, but it would be best to ask with an issue in the pytorch repo directly. Cc @drisspg

lucidrains · 2023-07-18T19:23:48Z

ok sounds good, will do!

Boreaso · 2023-07-19T01:55:23Z

Hi @lucidrains , I would imagine this will be done at some point in the future, but it would be best to ask with an issue in the pytorch repo directly. Cc @drisspg

Hi @danthe3rd v100 GPUs is not currently supported in FlashAttention2. Any plans to support it in xformers?

danthe3rd · 2023-07-19T07:35:15Z

Hi @danthe3rd v100 GPUs is not currently supported in FlashAttention2. Any plans to support it in xformers?

V100 GPUs are already supported in xformers (we have our own reimplementation of Flash-Attention)

danthe3rd · 2023-07-19T13:41:53Z

As we update to Flashv2, Flash won't be available to Windows users, as Flashv2 is not available on windows (Dao-AILab/flash-attention#345). We will fallback to our own reimplementation for windows users.

lucidrains · 2023-07-19T15:13:01Z

@danthe3rd you'll probably update your in-house kernel to flash attention 2 though? which kernel is being used to train llama?

danthe3rd · 2023-07-19T15:49:12Z

I don't plan to update the in-house kernel any more. There are a few things from Flashv2 which are already in there, but further work would be needed to get the full performance. Also some changes won't work well within the available CUTLASS v2 abstractions we are using.
CUTLASS plans to add support for windows within months tho...

lucidrains · 2023-07-19T15:53:55Z

@danthe3rd ahh ok, so Tri's implementation will be the best available, for the right hardware

thanks for clarifying!

bhack · 2023-07-20T12:28:00Z

triton-lang/triton#1970

danthe3rd · 2023-07-20T12:32:20Z

We also plan to update the Triton version in xFormers at a later stage, but for now we focus on the CUDA one from @tridao as it provides the best performance

bhack · 2023-07-20T12:36:21Z

Just in the case you are interested there is a parallel activity also in the official Pytorch repo:
pytorch/pytorch#105474

danthe3rd · 2023-07-20T14:31:21Z

We just merged an initial support for Flash-Attention v2 in xformers
cfea89f

Wheels will be available shortly, but in the meantime you can build it from source.

Summary:

Update third-party/flash-attention to the new repo/version
It's not available on Windows (where we will fallback on our home-made CUTLASS kernel, which is mostly as fast as Flash v1) [Flashv2] Windows support Dao-AILab/flash-attention#345
Limited to A100+ (whereas the previous Flashv1 worked with Sm75 as well)
Currently the BW pass only works when seqlen % 128 == 0 [flashv2] NaNs in bw pass for some inputs Dao-AILab/flash-attention#334. We will update the 3rd-party module once Tri has a fix
NOTE: Currently the Bw pass is not deterministic. Fix will come later from Tri
TESTING: Because we only support a subset of the test cases, I made sure when generating random shapes we filter-out the non-compatible ones. This allows to make sure we always have 20 random shapes that are tested (see shape_not_supported_reasons)
DISPATCH: Now we always dispatch to Flash with priority 1
SUPPORT: This adds support for head dimensions up to 256 (although performance is much worse after 160)
TRITON: This disables the fMHA triton implementation - I'm preparing an upgrade (cc @dianaml0 ). That's because it's imported directly from the Flash-Attention repo that we embed. Anyway we need to have an implementation that supports the post-mlir rewrite of triton

ekagra-ranjan · 2023-07-26T16:51:17Z

Here are some benchmarks for the FW pass on A100:
https://pastebin.com/YEApkXBM

@danthe3rd Can you please share the schema needed to compare different rows in the table you shared?

Like what is the the different numbers here: f16 1-16384-16-80, p=0.0, BiasT=NoneType ?
Which column represents flash v1: is it the cutlassF ?
Do you have any ideas as to why BiasT=Tensor takes 50% more time than BiasT=NoneType for cutlassF?

WindowsXp-Beta · 2023-07-27T10:15:21Z

@ekagra-ranjan

You can take a look at the benchmark code here. The output format is {dtype} {B}-{M}-{H}-{K}, where B, M, H, and K stand for batch size, sequence length, number of heads, and head dimension, respectively.
cutlassF is xFormer's reimplementation of flash-attention using cutlass.
Great question. I'm also trying to figure it out. You can refer to flashv2's blog and paper for more information. In fact, xFormers has already applied several optimizations in it such as parallel on Q's seqlen in FW and K, V's seqlen in BW. However, there are some differences. I believe their warp partition policy plays an important role. As it can reduce shared memory IO and warp synchronization significantly. Additionally, if you examine their code, you'll notice that they use three kernels for BW while xFormers fuses these operations in one kernel. Moreover, xFormers uses cutlass 2.x while flash-v2 uses cutlass 3.x.

Since I'm also new to cutlass and CUDA programming, perhaps @danthe3rd can provide us with more insights?

danthe3rd · 2023-07-27T15:33:04Z

Do you have any ideas as to why BiasT=Tensor takes 50% more time than BiasT=NoneType for cutlassF?

This is mainly because of 2 reasons;
(1) fundamental reason: flash is fast because it avoids memory IOs (writing and reading the N^2 attention matrix multiple times). When you have a bias, you also need to read an N^2 tensor, which will take time.
(2) implementation: we have not really focused our efforts on the custom attention bias setting. It's mainly used for prototyping, and the correct solution should be to fuse whatever attention bias you want to use directly in the kernel, to avoid memory IO

However, there are some differences. I believe their warp partition policy plays an important role.

Yes I believe this is one of the big things that explains some of the performance gap. I haven't investigated more at this point to understand more precisely where the gap is.

WindowsXp-Beta · 2023-07-27T16:45:55Z

Thanks for explaining. I just realized that I misunderstood the third question. I thought it was about the gap between cutlassF and flash v2 lol.

ekagra-ranjan · 2023-07-27T19:00:09Z

Thank you @WindowsXp-Beta and @danthe3rd for your replies! This is helpful!

tmm1 · 2023-08-03T18:10:28Z

Currently the BW pass only works when seqlen % 128 == 0 [flashv2] NaNs in bw pass for some inputs Dao-AILab/flash-attention#334. We will update the 3rd-party module once Tri has a fix

NOTE: Currently the Bw pass is not deterministic. Fix will come later from Tri

these should be resolved with #816

killawhale2 · 2023-08-08T06:08:58Z

Shouldn't this issue be re-opened until #816 is merged? cc. @danthe3rd

patrickvonplaten changed the title ~~Adding~~ Flash Attention 2 Jul 17, 2023

ghunkins mentioned this issue Jul 17, 2023

[WIP] Upgrade FlashAttention to 2.0 #796

Closed

bhack mentioned this issue Jul 20, 2023

Is xformers going to add support for flash-attn2? #801

Closed

bhack mentioned this issue Jul 20, 2023

support FlashAttention-2 pytorch/pytorch#105474

Closed

danthe3rd self-assigned this Jul 20, 2023

danthe3rd closed this as completed Jul 20, 2023

danthe3rd pinned this issue Jul 20, 2023

patrickvonplaten mentioned this issue Jul 24, 2023

Add flash attention 2 huggingface/diffusers#4200

Closed

patrickvonplaten mentioned this issue Aug 3, 2023

LoRA training for sdxl on diffusers CUDA out of memory? huggingface/diffusers#4368

Closed

tmm1 mentioned this issue Aug 3, 2023

Bump flash-attn to v2.0.4 #816

Merged

10 tasks

fmassa unpinned this issue Sep 6, 2023

1049451037 mentioned this issue Oct 16, 2023

run cli_demo error THUDM/CogVLM#33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash Attention 2 #795

Flash Attention 2 #795

patrickvonplaten commented Jul 17, 2023

Skylion007 commented Jul 17, 2023

danthe3rd commented Jul 17, 2023

Skylion007 commented Jul 17, 2023

danthe3rd commented Jul 18, 2023

danthe3rd commented Jul 18, 2023

lucidrains commented Jul 18, 2023 •

edited

Loading

danthe3rd commented Jul 18, 2023

lucidrains commented Jul 18, 2023

Boreaso commented Jul 19, 2023 •

edited

Loading

danthe3rd commented Jul 19, 2023

danthe3rd commented Jul 19, 2023

lucidrains commented Jul 19, 2023 •

edited

Loading

danthe3rd commented Jul 19, 2023

lucidrains commented Jul 19, 2023

bhack commented Jul 20, 2023

danthe3rd commented Jul 20, 2023

bhack commented Jul 20, 2023 •

edited

Loading

danthe3rd commented Jul 20, 2023 •

edited

Loading

ekagra-ranjan commented Jul 26, 2023

WindowsXp-Beta commented Jul 27, 2023

danthe3rd commented Jul 27, 2023

WindowsXp-Beta commented Jul 27, 2023

ekagra-ranjan commented Jul 27, 2023 •

edited

Loading

tmm1 commented Aug 3, 2023

killawhale2 commented Aug 8, 2023

Flash Attention 2 #795

Flash Attention 2 #795

Comments

patrickvonplaten commented Jul 17, 2023

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Skylion007 commented Jul 17, 2023

danthe3rd commented Jul 17, 2023

Skylion007 commented Jul 17, 2023

danthe3rd commented Jul 18, 2023

danthe3rd commented Jul 18, 2023

lucidrains commented Jul 18, 2023 • edited Loading

danthe3rd commented Jul 18, 2023

lucidrains commented Jul 18, 2023

Boreaso commented Jul 19, 2023 • edited Loading

danthe3rd commented Jul 19, 2023

danthe3rd commented Jul 19, 2023

lucidrains commented Jul 19, 2023 • edited Loading

danthe3rd commented Jul 19, 2023

lucidrains commented Jul 19, 2023

bhack commented Jul 20, 2023

danthe3rd commented Jul 20, 2023

bhack commented Jul 20, 2023 • edited Loading

danthe3rd commented Jul 20, 2023 • edited Loading

Summary:

ekagra-ranjan commented Jul 26, 2023

WindowsXp-Beta commented Jul 27, 2023

danthe3rd commented Jul 27, 2023

WindowsXp-Beta commented Jul 27, 2023

ekagra-ranjan commented Jul 27, 2023 • edited Loading

tmm1 commented Aug 3, 2023

killawhale2 commented Aug 8, 2023

lucidrains commented Jul 18, 2023 •

edited

Loading

Boreaso commented Jul 19, 2023 •

edited

Loading

lucidrains commented Jul 19, 2023 •

edited

Loading

bhack commented Jul 20, 2023 •

edited

Loading

danthe3rd commented Jul 20, 2023 •

edited

Loading

ekagra-ranjan commented Jul 27, 2023 •

edited

Loading