Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
-
Updated
Sep 7, 2024 - C++
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.
Poplar implementation of FlashAttention for IPU
Vulkan & GLSL implementation of FlashAttention-2
Add a description, image, and links to the flash-attention topic page so that developers can more easily learn about it.
To associate your repository with the flash-attention topic, visit your repo's landing page and select "manage topics."