-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whisper : use flash attention #2152
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ggerganov
force-pushed
the
gg/flash-attn
branch
3 times, most recently
from
May 14, 2024 16:07
497dbf4
to
bfbfde8
Compare
Looking for feedback on the performance / accuracy - plan is to merge this PR and release Run the tools as usual and add |
bygreencn
added a commit
to bygreencn/whisper.cpp
that referenced
this pull request
Aug 9, 2024
* tag 'v1.6.2': release : v1.6.2 Revert "whisper : remove extra backend instance (huh?)" (ggerganov#2182) server : fix typo (ggerganov#2181) ruby : update bindings (ggerganov#2154) release : v1.6.1 examples : add support for decoding input with ffmpeg (Linux) (ggerganov#2133) node : add flash_attn param (ggerganov#2170) ci: Update build.yml to suppress warnings about node.js versions (ggerganov#2166) release : v1.6.0 whisper : use flash attention (ggerganov#2152) talk-llama : reject runs without required arguments (ggerganov#2153) sync : ggml metal : support FA without mask + add asserts (llama/7278) ggml : add RPC backend (llama/6829) rm wait() (llama/7233) CUDA: add FP32 FlashAttention vector kernel (llama/7188) scripts : sync ggml-rpc
iThalay
pushed a commit
to iThalay/whisper.cpp
that referenced
this pull request
Sep 23, 2024
* whisper : use flash attention in the encoder * whisper : add kv_pad * whisper : remove extra backend instance (huh?) * whisper : use FA for cross-attention * whisper : use FA for self-attention * whisper : simplify encoder FA * whisper : add flash_attn runtime parameter * scripts : add bench log * scripts : add M1 Pro bench log
iThalay
pushed a commit
to iThalay/whisper.cpp
that referenced
this pull request
Sep 23, 2024
* whisper : use flash attention in the encoder * whisper : add kv_pad * whisper : remove extra backend instance (huh?) * whisper : use FA for cross-attention * whisper : use FA for self-attention * whisper : simplify encoder FA * whisper : add flash_attn runtime parameter * scripts : add bench log * scripts : add M1 Pro bench log
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Flash attention can now be enabled via
whisper_context.flash_attn = true
.Examples use the command-line argument
-fa
to enable the kernels (similar tollama.cpp
)Performance gains should be expected for Metal and CUDA. On the CPU, enabling FA will likely degrade the performance.
M1 Pro
M2 Ultra
Ryzen 9 5950X + RTX 2060
V100