-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add a warning about ExLlamaV2 without flash-attn
- Loading branch information
Showing
2 changed files
with
22 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
605ec3c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit for some reason breaks the loader when trying to use it without flash attention (only on normal exllamav2, not hf).
Commenting the new lines fixes it, or doing #3992.
On my case I don't have flash attention, since it is not compatible with Windows yet, Dao-AILab/flash-attention#345 and Dao-AILab/flash-attention#553.
605ec3c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also flash attention 2. Old flash attention moved to a different repo. Some changes have to be made to the hijack or that will break.
Ph0rk0z@e082762
605ec3c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Panchovix I have just fixed that missing import. I wasn't aware that flash-attention wasn't available for Windows. In my case, I could load llama 70b 2.5b with 2048 context on exllamav2, but generation would OOM after around 800 tokens or so. Now I can go all the way to 2048 tokens.
@Ph0rk0z the hijack is unrelated. ExLlamaV2 uses flash-attn internally by default. Failing to import it, it will proceed without flash-attn.
605ec3c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I forgot flash attention for everything never got merged here.