Skip to content

Commit

Permalink
Add a warning about ExLlamaV2 without flash-attn
Browse files Browse the repository at this point in the history
  • Loading branch information
oobabooga committed Sep 18, 2023
1 parent f0ef971 commit 605ec3c
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 0 deletions.
11 changes: 11 additions & 0 deletions modules/exllamav2.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,17 @@
from modules import shared
from modules.text_generation import get_max_prompt_length

try:
import flash_attn
except ModuleNotFoundError:
logger.warning(
'You are running ExLlamaV2 without flash-attention. This will cause the VRAM usage '
'to be a lot higher than it could be.\n'
'Try installing flash-attention following the instructions here: '
'https://github.com/Dao-AILab/flash-attention#installation-and-features'
)
pass


class Exllamav2Model:
def __init__(self):
Expand Down
11 changes: 11 additions & 0 deletions modules/exllamav2_hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,17 @@
from modules import shared
from modules.logging_colors import logger

try:
import flash_attn
except ModuleNotFoundError:
logger.warning(
'You are running ExLlamaV2 without flash-attention. This will cause the VRAM usage '
'to be a lot higher than it could be.\n'
'Try installing flash-attention following the instructions here: '
'https://github.com/Dao-AILab/flash-attention#installation-and-features'
)
pass


class Exllamav2HF(PreTrainedModel):
def __init__(self, config: ExLlamaV2Config):
Expand Down

4 comments on commit 605ec3c

@Panchovix
Copy link
Contributor

@Panchovix Panchovix commented on 605ec3c Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit for some reason breaks the loader when trying to use it without flash attention (only on normal exllamav2, not hf).

image

Commenting the new lines fixes it, or doing #3992.

On my case I don't have flash attention, since it is not compatible with Windows yet, Dao-AILab/flash-attention#345 and Dao-AILab/flash-attention#553.

@Ph0rk0z
Copy link
Contributor

@Ph0rk0z Ph0rk0z commented on 605ec3c Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also flash attention 2. Old flash attention moved to a different repo. Some changes have to be made to the hijack or that will break.
Ph0rk0z@e082762

@oobabooga
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Panchovix I have just fixed that missing import. I wasn't aware that flash-attention wasn't available for Windows. In my case, I could load llama 70b 2.5b with 2048 context on exllamav2, but generation would OOM after around 800 tokens or so. Now I can go all the way to 2048 tokens.

@Ph0rk0z the hijack is unrelated. ExLlamaV2 uses flash-attn internally by default. Failing to import it, it will proceed without flash-attn.

@Ph0rk0z
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I forgot flash attention for everything never got merged here.

Please sign in to comment.