Flash Attention V2 #485

nivibilla · 2023-07-17T20:48:41Z

https://github.com/Dao-AILab/flash-attention

Flash attention v2 was released claiming 2x speedups. Making an issue to remind myself to have a look at it. And also if anyone else wants to try implement it.

chenrui17 · 2023-07-27T12:22:37Z

I use benchmarks/benchmark_throughput.py to test flash attention V2, but it doesn't seem to have any effect. my test step is like this,

update xformers to latest version
modify the line to self.attn_op = xops.fmha.flash.FwOp()
python3 benchmark_throughput.py --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=/huggingface_data/llama-7b-hf/ --tokenizer=hf-internal-testing/llama-tokenizer --num-prompts=500

test time is like this,

Further analysis of performance, i found that the replaced part (flash attention V2) cost is too small, only at the beginning of the execution, i am confused , for flash attention V2, what can we do for vllm？

tmm1 · 2023-08-03T18:26:14Z

update xformers to latest version

hi, which version specifically? i don't think flash v2 support has been released yet, so you would have to install from git. also there are still some open PRs to bump xformers to flash-attn v2.0.4 bugfix release (facebookresearch/xformers#816).

tmm1 · 2023-08-08T18:26:36Z

modify the line to self.attn_op = xops.fmha.flash.FwOp()

python3 benchmark_throughput.py --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=/huggingface_data/llama-7b-hf/ --tokenizer=hf-internal-testing/llama-tokenizer --num-prompts=500

I tried this as well, and there was no improvement in the benchmarks after switching to flash-attn v2.

I will try to profile the benchmark script.

Zhuqln · 2023-08-09T11:14:33Z

modify the line to self.attn_op = xops.fmha.flash.FwOp()

i dont think this one really works. because flash-attn's another important feature is to decrease the highly gpu-memory usage in super long-context like more than 5k.
when i set that line and run inference . i dont see any changes on memory usage.

WoosukKwon · 2023-08-25T08:49:53Z

Hi @nivibilla, thanks for submitting the issue. The latest version of xformers now uses the FlashAttention-V2 algorithm, so vLLM also now takes advantage of it. Please upgrade vLLM to v0.1.4.

@tmm1 @Zhuqln To my understanding, the overall speedup should depend on your workload. At the inference time, FlashAttention is only used for the prompt inputs, and never used for the decoding inputs. For many workloads, the decoding stage takes a majority of the total execution time, so changing to FlashAttention V2 may not give a notable speedup. However, for other workload like text summarization where the prompts are very long, I believe computing attention for the prompt inputs will take a significant portion of the execution time, and thus FlashAttention V2 will have a huge impact on the overall performance.

nivibilla · 2023-08-25T09:38:36Z

@WoosukKwon thanks for the explanation!

tmm1 · 2023-08-25T19:17:07Z

The latest version of xformers now uses the FlashAttention-V2 algorithm, so vLLM also now takes advantage of it. Please upgrade vLLM to v0.1.4.

Hi, this is inaccurate since the code is still forcing xops.fmha.cutlass.FwOp to be used. If you want to take advantage of FA2, you would need to switch to xops.fmha.flash.FwOp

See benchmark results in facebookresearch/xformers#832

zhaoyang-star · 2023-08-29T07:27:59Z

Hi @nivibilla, thanks for submitting the issue. The latest version of xformers now uses the FlashAttention-V2 algorithm, so vLLM also now takes advantage of it. Please upgrade vLLM to v0.1.4.

@tmm1 @Zhuqln To my understanding, the overall speedup should depend on your workload. At the inference time, FlashAttention is only used for the prompt inputs, and never used for the decoding inputs. For many workloads, the decoding stage takes a majority of the total execution time, so changing to FlashAttention V2 may not give a notable speedup. However, for other workload like text summarization where the prompts are very long, I believe computing attention for the prompt inputs will take a significant portion of the execution time, and thus FlashAttention V2 will have a huge impact on the overall performance.

Thanks for the details @WoosukKwon . I just have a question. Why FlashAttention could not be used for decoding phase?

learning-chip · 2023-09-02T09:00:50Z

Why FlashAttention could not be used for decoding phase?

Its tiling strategy is not optimized for Q with seqlen=1 Dao-AILab/flash-attention#427 (comment)

Lvjinhong · 2023-12-19T07:03:43Z

你好@nivibilla，感谢您提交问题。最新版本的 xformers 现在使用 FlashAttention-V2 算法，因此 vLLM 现在也利用了它。请将vLLM升级到v0.1.4。
@tmm1 @Zhuqln据我了解，整体加速应该取决于您的工作负载。在推理时，FlashAttention 仅用于提示输入，从不用于解码输入。对于许多工作负载，解码阶段占用了总执行时间的大部分，因此更改为 FlashAttention V2 可能不会带来显着的加速。然而，对于其他工作负载，例如提示很长的文本摘要，我相信提示输入的计算注意力将占用执行时间的很大一部分，因此 FlashAttention V2 将对整体性能产生巨大影响。

感谢您提供详细信息@WoosukKwon。我只是有一个问题。为什么FlashAttention不能用于解码阶段？

I'm delighted to engage in this discussion. Your report has been immensely helpful, but I do have some questions. For instance, I'm curious to know if there's a performance comparison available between trtLLM and vLLM. Such information would be greatly beneficial in guiding my decision on which framework to choose.

matanhol · 2024-01-24T13:33:55Z

Hi @nivibilla, thanks for submitting the issue. The latest version of xformers now uses the FlashAttention-V2 algorithm, so vLLM also now takes advantage of it. Please upgrade vLLM to v0.1.4.
@tmm1 @Zhuqln To my understanding, the overall speedup should depend on your workload. At the inference time, FlashAttention is only used for the prompt inputs, and never used for the decoding inputs. For many workloads, the decoding stage takes a majority of the total execution time, so changing to FlashAttention V2 may not give a notable speedup. However, for other workload like text summarization where the prompts are very long, I believe computing attention for the prompt inputs will take a significant portion of the execution time, and thus FlashAttention V2 will have a huge impact on the overall performance.

Thanks for the details @WoosukKwon . I just have a question. Why FlashAttention could not be used for decoding phase?

you assume that in summarization task most of the workload is by decoding the input. in my experimentation I saw that the scale of generation is much bigger. so, if you generate only 1-5 token then most of the workload is decoding input, there will be dependency on input length and flash attention 2 will be advantageous (as it linear in input length while naive implementation is exponential in input length). but if you generate a considerable amount of tokens, then that factor is prominent, the input decoding is negligible, and flash attention 2 has no power here.
(usually when you have long text you want a longer summarization. it doesn't make sense to summarize 1000 words article by 5 tokens)
attached link to the simulation.
please LMK if you have any comments.

https://github.com/matanhol/summarization_with_flash_attn_2_simulation

brando90 · 2024-08-21T16:57:32Z

I tried installing vllm with flash attn but it didn't work, my attempts:

Install flash attention:
```bash
# my current vllm setup without flash
# pip install --upgrade pip
# pip install torch==2.2.1
# pip install vllm==0.4.1

# flash attn https://amzn-aws.slack.com/archives/C06Q26TNN8G/p1724182667464149
# flash-attn>=2.5.8
# pip install flash-attn
# Collabs's setup with flash
# vllm                              0.5.4
# vllm-flash-attn                   2.6.1
# flash-attn                        2.6.3
# torch                             2.4.0
# Python 3.10.8 

# try to install flash attn in a new py env
python3.11 -m venv ~/.virtualenvs/flash_attn_test_py10
source ~/.virtualenvs/flash_attn_test/bin/activate
pip install --upgrade pip
pip install -e ~/snap-cluster-setup

pip list | grep vllm
pip list | grep torch
pip list | grep flash-attn
pip list | grep vllm-flash-attn

# # didn't work
# pip install torch==2.2.1
# pip install vllm==0.4.1
# MAX_JOBS=4 pip install flash-attn --no-build-isolation --force

# this installed flash but vllm didn't say in it's output it was using it
pip install torch==2.4.0
pip install vllm==0.5.4
pip install flash-attn==2.6.3
pip install vllm-flash-attn==2.6.1

python ~/snap-cluster-setup/py_src/evals/boxed_acc_eval.py --model internlm/internlm2_5-1_8b --hf_gen_type vllm --path_2_eval_dataset ~/snap-cluster-setup/data/MATH/test --max_tokens 2048 --batch_size 100 --end 100 -n 1 --shuffle True --mode dryrun 2>&1 | tee $LOG_FILE && echo "Log file created at: $LOG_FILE"

# later try with py 3.10
# python3xxx -m venv ~/.virtualenvs/flash_attn_test_py10
# source ~/.virtualenvs/flash_attn_test_py10/bin/activate
# pip install --upgrade pip
# pip install -e ~/snap-cluster-setup
# pip install torch==2.4.0
# pip install vllm==0.5.4
# pip install flash-attn==2.6.3
# pip install vllm-flash-attn==2.6.1

brando90 · 2024-08-21T16:57:59Z

my setting is python 3.11, that is what I really want/need.

brando90 · 2024-08-21T16:58:10Z

related vllm general issues for vllm ver: #2747

zhuohan123 mentioned this issue Jul 18, 2023

[Roadmap] vLLM Development Roadmap: H2 2023 #244

Closed

76 tasks

zhuohan123 added the feature request label Jul 18, 2023

tmm1 mentioned this issue Aug 9, 2023

Improve _prune_hidden_states micro-benchmark #707

Merged

WoosukKwon closed this as completed Aug 25, 2023

tmm1 mentioned this issue Aug 25, 2023

use flash-attn via xformers #877

Merged

Lvjinhong mentioned this issue Dec 19, 2023

[new feature] flash decoding ++ #1568

Open

zyxnlp mentioned this issue Jul 16, 2024

[Usage]: Flash Attention not working any more #4322

Open

brando90 mentioned this issue Aug 21, 2024

ImportError: /ramyapra/vllm/vllm/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: #2747

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash Attention V2 #485

Flash Attention V2 #485

nivibilla commented Jul 17, 2023

chenrui17 commented Jul 27, 2023

tmm1 commented Aug 3, 2023

tmm1 commented Aug 8, 2023

Zhuqln commented Aug 9, 2023 •

edited

Loading

WoosukKwon commented Aug 25, 2023

nivibilla commented Aug 25, 2023

tmm1 commented Aug 25, 2023

zhaoyang-star commented Aug 29, 2023 •

edited

Loading

learning-chip commented Sep 2, 2023

Lvjinhong commented Dec 19, 2023

matanhol commented Jan 24, 2024

brando90 commented Aug 21, 2024

brando90 commented Aug 21, 2024

brando90 commented Aug 21, 2024

Flash Attention V2 #485

Flash Attention V2 #485

Comments

nivibilla commented Jul 17, 2023

chenrui17 commented Jul 27, 2023

tmm1 commented Aug 3, 2023

tmm1 commented Aug 8, 2023

Zhuqln commented Aug 9, 2023 • edited Loading

WoosukKwon commented Aug 25, 2023

nivibilla commented Aug 25, 2023

tmm1 commented Aug 25, 2023

zhaoyang-star commented Aug 29, 2023 • edited Loading

learning-chip commented Sep 2, 2023

Lvjinhong commented Dec 19, 2023

matanhol commented Jan 24, 2024

brando90 commented Aug 21, 2024

brando90 commented Aug 21, 2024

brando90 commented Aug 21, 2024

Zhuqln commented Aug 9, 2023 •

edited

Loading

zhaoyang-star commented Aug 29, 2023 •

edited

Loading