Add flash-attention 2 for windows #4235

bdashore3 · 2023-10-09T03:24:11Z

Flash attention 2.3.2 has added support for windows, but the caveat is that it requires cuda 12.1 to run. This requires a requirements update to use cuda 12.1 and torch 2.1 with cuda 12.1 in one_click.py.

Tested with a fresh install using the one click installer for windows and loaded a 70B 2.4 bit model via exllama 2. Generation was successful on 1x 3090ti.

This commit pulls my wheel, but I won't be updating it. Please see The original issue for more information.

Checklist:

I have read the Contributing guidelines.

bdashore3 · 2023-10-09T03:29:52Z

I'm not sure if FA2 will have official wheels for windows or if someone else needs to build them once FA2 updates.

Ph0rk0z · 2023-10-09T10:33:22Z

There is no xformers for pytorch 2.1 yet, had to build it. Probably a ton of things like that.

jllllll · 2023-10-10T03:16:52Z

One thing to consider is that FA2 only supports Ampere and newer GPUs. Anyone with older cards will be required to uninstall FA2 in order to use exllamav2. That said, those older cards run very poorly with exllamav2 anyway.

This issue should be avoidable by adding a config option to set model.config.no_flash_attn for exllamav2.
Should probably be added regardless given that FA2 is already being installed by default on Linux.

Here are some commands for easily updating the Pytorch and CUDA installation inside the webui env:

conda install -y -k cuda==12.1.1 -c nvidia/label/cuda-12.1.1
python -m pip install torch==2.1.0+cu121 torchvision torchaudio --upgrade --index-url https://download.pytorch.org/whl/cu121

There is another, more serious, caveat to updating the CUDA version to 12.X: CUDA 12 dropped support for Kepler GPUs
With this update, some people are going to lose access to the webui, mainly those using Kepler datacenter GPUs.
At the moment, CUDA 11.8 supports pretty much every NVIDIA GPU that is in use.

It might be worth adding the CUDA 12 wheels to a separate requirements.txt file that can be installed later for those who want it.

Ph0rk0z · 2023-10-10T12:10:55Z

People on windows can install their own binaries? Forcing everyone to upgrade is just something else.

oobabooga · 2023-10-10T20:40:07Z

Is it not possible to compile flash-attention for cuda 11.8? They have Linux wheels for 11.6, 11.7, 11.8, and 12.2. I wonder if the same wheels could be created for Windows using GitHub actions.

That would also help keep things more transparent (distributing a binary compiled outside GitHub might be perceived as a security vulnerability by some).

Panchovix · 2023-10-10T21:18:25Z

Is it not possible to compile flash-attention for cuda 11.8? They have Linux wheels for 11.6, 11.7, 11.8, and 12.2. I wonder if the same wheels could be created for Windows using GitHub actions.

That would also help keep things more transparent (distributing a binary compiled outside GitHub might be perceived as a security vulnerability by some).

Compiling with 11.8 is not possible. NVIDIA added cutlass support from CUDA 12.x https://github.com/NVIDIA/cutlass/blob/main/media/docs/build/building_in_windows_with_visual_studio.md

Also, I tried to build on CUDA 11.8 after I succesully built it on CUDA 12.1 and CUDA 12.2, but it failts since the variable that it uses on C/C++ is not in that CUDA version for Windows.

jllllll · 2023-10-10T21:41:19Z

Does the wheel built on CUDA 12.1 fail to run on 11.8?
It wouldn't surprise me, but it also wouldn't be the first time I saw something built on 12.1 run on 11.8.
Somewhat recently, I saw someone run an exllama/exllamav2 (not sure which was actually used) kernel built for 12.1 on 11.8 installation.

Panchovix · 2023-10-10T21:48:32Z

Does the wheel built on CUDA 12.1 fail to run on 11.8? It wouldn't surprise me, but it also wouldn't be the first time I saw something built on 12.1 run on 11.8. Somewhat recently, I saw someone run an exllama/exllamav2 (not sure which was actually used) kernel built for 12.1 on 11.8 installation.

I have created a wheel here as well https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel, as @bdashore3 did. It can get installed on a env with torch+cu118 and CUDA 11.8, but when you try to load a model with exllamav2, it will error out of a missing DLL.

While, if you use torch+cu121 (be stable or nightly), and CUDA 12.x, it works.

bdashore3 · 2023-10-10T23:11:44Z

@oobabooga @jllllll Apologize for not getting back sooner. Here's some more context for the above comments/questions.

I would not have made a PR to update everything to 12.1 if FA2 built and ran on 11.8. FA2 on windows requires cuda 12.1 and above due to cutlass support on windows requiring those versions (see the parent issue in FA2's repo). Cuda 12.1 wheels will not work on 11.8 and vice versa (which is why I updated all the cuda wheel links to use 12.1 since exllamav2 wasn't loading).
A github action will be a great idea and TriDao is looking for people who can create those actions to distribute official wheels. As for my wheel, a few others have tested it and it works fine and they're actively using flash attention on windows with cuda 12.x
As for the requirements file and dropping kepler, I like the idea of adding separate requirements files. However, the one-click installer will need to be changed to support these separate files. Perhaps a prompt at the beginning for windows users that asks "do you want to use Flash Attention 2?" will be beneficial. I'm also not sure if there's a way to check for GPU support during one-click install to display that prompt in the first place. Another method would be to use cuda 12.1 by default and only use 11.8 for systems not supported by 12.1. There's multiple ways to approach this issue and I'm happy to help get this moving.

It may also be easier to discuss this in Discord, my username is kingbri if you want to ping me in the server.

bdashore3 · 2023-10-14T01:41:45Z

Update: There were some issues regarding flash attention wheels on windows. For some reason, they're building for a specific GPU arch or combination.

Ex. The wheel I provided was built on my 3090ti system, but the wheel does not work on a 4090 or a combined 3090/4090 system. Converting the PR to a draft until this is resolved.

@jllllll Since you have experience with building wheels, I'd appreciate any insight you may have regarding universal wheels for windows.

Parent issue in FA2's repo

Add a separate cuda 12.1 requirements file and update the one-click installer to prompt the user with appropriate information about cuda 12.1 and FA2 for Windows. Cuda 11.8 is still used by default for further GPU compatability, but 12.1 is now an option for those that want it. Signed-off-by: kingbri <bdashore3@proton.me>

Panchovix · 2023-10-14T23:28:54Z

Just tested out with a clean install of @bdashore3 on these systems:

Tested on 2x4090+3090 -> it works.
Tested on single 3090 -> it works.
Tested on single 4090 -> it works.
Tested on 2x4090 -> it works.
Tested on 1x3090+1x4090 -> it works.

So for me now it looks fine and it works with multi-gpu architecture.

bombel28 · 2023-10-19T18:35:00Z

I don't really know much about it, but maybe it helps somebody... I tried this and because of some... text... g I tried to use it via xformers... yeah maybe this has nothing to do with this... xformers can't load anymore. So... the training scripts have to be extended, too? I can't wait for it :-) If I understood, now I have flash-attn 2 loaded but can't use it, right?

Nicoolodion2 · 2023-10-19T20:45:38Z

conda install -y -k cuda==12.1.1 -c nvidia/label/cuda-12.1.1
python -m pip install torch==2.1.0+cu121 torchvision torchaudio --upgrade --index-url https://download.pytorch.org/whl/cu121

Tried this now, but now I get some error "ImportError: DLL load failed while importing exllamav2_ext: Das angegebene Modul wurde nicht gefunden." when I try to load something with exllamav2...
Also won't go away now, even after a update :/

jllllll · 2023-10-19T21:44:45Z

conda install -y -k cuda==12.1.1 -c nvidia/label/cuda-12.1.1
python -m pip install torch==2.1.0+cu121 torchvision torchaudio --upgrade --index-url https://download.pytorch.org/whl/cu121
Tried this now, but now I get some error "ImportError: DLL load failed while importing exllamav2_ext: Das angegebene Modul wurde nicht gefunden." when I try to load something with exllamav2... Also won't go away now, even after a update :/

This will put you back on CUDA 11.8:

conda install -y -k cuda==11.8.0 -c nvidia/label/cuda-11.8.0
python -m pip install torch==2.1.0+cu118 torchvision torchaudio --upgrade --index-url https://download.pytorch.org/whl/cu118

python -c"from one_click import *; update_requirements()"
-OR-
python -m pip install -r requirements.txt --upgrade

A full update to CUDA 12.1 can be done with this:

conda install -y -k cuda==12.1.1 -c nvidia/label/cuda-12.1.1
python -m pip install torch==2.1.0+cu121 torchvision torchaudio --upgrade --index-url https://download.pytorch.org/whl/cu121

python -m pip install --upgrade -rhttps://github.com/bdashore3/text-generation-webui/raw/flash-attention-windows/requirements_cu121.txt

Just keep in mind that, until this PR is merged, updating the webui using the update_* script will require that you use CUDA 11.8 to avoid issues.

oobabooga · 2023-10-20T22:44:48Z

It seems better to just update everything. It doesn't make sense to have Windows at 12.1 and Linux at 11.8.

~~I have made all the necessary changes for that, and also took the opportunity to update Python to 3.11.~~

one_click.py will perform the wheel filename conversions automatically in case ~~Python 3.10 or~~ CUDA 11.8/11.7 are installed.

The remaining problem is that for people who installed manually, the Python version is 3.10, and the requirements.txt will stop working. A solution is to use python_version conditions inside the requirements as done in #4233. Since there are 30 different wheels with '-cp311' in their filenames at the moment, 30 new lines would have to be added. We could also go wild and support python 3.8 and 3.9 as well (the latter is used on Google Colab), thus reaching 90 new lines in the requirements files...

I gave up on Python 3.11 because the wheel here is for Python 3.10.

bdashore3 · 2023-10-21T02:20:20Z

I like the idea!

However, the requirements file still uses cuda 11.8 wheels. I'm not sure if a bad commit reverted the 12.1 changes? In addition, it's probably better to use the flash_attn pip package for linux rather than using the wheels as that adapts to whatever cuda version the user has.

Also, Python 3.11 is viable, I just need to install it and recompile the FA2 wheel. 3.11 apparently has a speed boost compared to 3.10, so it seems like a good idea to switch if everything works okay.

Panchovix · 2023-10-21T02:24:57Z

I can confirm that python 3.11, with a self built wheel works without issues with FA2.

oobabooga · 2023-10-21T02:27:02Z

Oops, you are right. There was a regression. I have just changed everything back to +cu121.

If you can create a 3.11 wheel, that would be great. I have installed the webui with Python 3.11 (for the claimed speed boost) and everything works.

It would be nice to have the cp310 and cp311 wheels side by side in your repository, such that both can be included in the requirements.txt.

bdashore3 · 2023-10-21T02:29:18Z

Yes, I'll put it under the same release once it's compiled.

oobabooga · 2023-10-21T02:31:29Z

In addition, it's probably better to use the flash_attn pip package for linux rather than using the wheels as that adapts to whatever cuda version the user has.

I have tried installing it on a fresh environment and it fails with error:

OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

I think that it requires nvcc install, which in turn is a few extra GB to install. It's probably better to keep the wheels.

bdashore3 · 2023-10-21T06:06:06Z

Added the flash attention wheel for cp311. This wheel works perfectly on my single 3090ti system, but it isn't tested on ada or multi ampere + ada arches. It should work since I built it the same way as last time, but I'd advise testing it first.

https://github.com/bdashore3/flash-attention/releases/download/2.3.2-2

oobabooga · 2023-10-21T06:46:18Z

In the worst case scenario, flash attention will fail to import, the exception will be caught, and a warning will be shown. So it should be fine.

I don't see anything else pending for this PR, so I'll merge it. Thanks a lot for making this wheel @bdashore3. Flash-attention is essential to get the most out of exllamav2 and you made it accessible to Windows users.

bdashore3 mentioned this pull request Oct 9, 2023

Adding flash attention to one click installer #4015

Closed

oobabooga mentioned this pull request Oct 11, 2023

fix: broken builds, upgrade image to cuda 12.2 #4256

Closed

1 task

bdashore3 force-pushed the flash-attention-windows branch from 82fb209 to 722fac5 Compare October 14, 2023 19:51

bdashore3 force-pushed the flash-attention-windows branch from 722fac5 to 06ad145 Compare October 14, 2023 19:52

bdashore3 mentioned this pull request Oct 17, 2023

Might be a solution to get built/compiles Flash Attention 2 on Windows Dao-AILab/flash-attention#595

Open

oobabooga added 4 commits October 20, 2023 15:09

Bump to CUDA 12.1 & Python 3.11

0b714d7

Merge branch 'main' into bdashore3-flash-attention-windows

cef34a8

Change choise to "Would you like CUDA 11.8?"

9bc1406

Minor fixes

0f25ee5

oobabooga added 5 commits October 20, 2023 17:19

Merge branch 'main' into bdashore3-flash-attention-windows

b4f5653

Don't use python 3.11

8e7d765

Minor change

83140c3

Update README

e7c662e

CUDA_118 environment variable

991a9ff

Fix typo

63f4c34

oobabooga added 2 commits October 20, 2023 19:21

More foolproof

eb4cc7d

cu118 -> cu121

81f9112

cu122 for flash-attn

2d0ea33

oobabooga added 5 commits October 20, 2023 23:30

Add new wheels

5a61197

Use python 3.11

df9966d

More robust flash-attention import

cd33585

Lint

ec9ddee

Update dockerfile

8465893

oobabooga merged commit 3345da2 into oobabooga:main Oct 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flash-attention 2 for windows #4235

Add flash-attention 2 for windows #4235

bdashore3 commented Oct 9, 2023 •

edited

Loading

bdashore3 commented Oct 9, 2023

Ph0rk0z commented Oct 9, 2023 •

edited

Loading

jllllll commented Oct 10, 2023 •

edited

Loading

Ph0rk0z commented Oct 10, 2023

oobabooga commented Oct 10, 2023

Panchovix commented Oct 10, 2023 •

edited

Loading

jllllll commented Oct 10, 2023

Panchovix commented Oct 10, 2023 •

edited

Loading

bdashore3 commented Oct 10, 2023 •

edited

Loading

bdashore3 commented Oct 14, 2023

Panchovix commented Oct 14, 2023 •

edited

Loading

bombel28 commented Oct 19, 2023

Nicoolodion2 commented Oct 19, 2023 •

edited

Loading

jllllll commented Oct 19, 2023 •

edited

Loading

oobabooga commented Oct 20, 2023 •

edited

Loading

bdashore3 commented Oct 21, 2023 •

edited

Loading

Panchovix commented Oct 21, 2023

oobabooga commented Oct 21, 2023

bdashore3 commented Oct 21, 2023

oobabooga commented Oct 21, 2023

bdashore3 commented Oct 21, 2023

oobabooga commented Oct 21, 2023

Add flash-attention 2 for windows #4235

Add flash-attention 2 for windows #4235

Conversation

bdashore3 commented Oct 9, 2023 • edited Loading

Checklist:

bdashore3 commented Oct 9, 2023

Ph0rk0z commented Oct 9, 2023 • edited Loading

jllllll commented Oct 10, 2023 • edited Loading

Ph0rk0z commented Oct 10, 2023

oobabooga commented Oct 10, 2023

Panchovix commented Oct 10, 2023 • edited Loading

jllllll commented Oct 10, 2023

Panchovix commented Oct 10, 2023 • edited Loading

bdashore3 commented Oct 10, 2023 • edited Loading

bdashore3 commented Oct 14, 2023

Panchovix commented Oct 14, 2023 • edited Loading

bombel28 commented Oct 19, 2023

Nicoolodion2 commented Oct 19, 2023 • edited Loading

jllllll commented Oct 19, 2023 • edited Loading

oobabooga commented Oct 20, 2023 • edited Loading

bdashore3 commented Oct 21, 2023 • edited Loading

Panchovix commented Oct 21, 2023

oobabooga commented Oct 21, 2023

bdashore3 commented Oct 21, 2023

oobabooga commented Oct 21, 2023

bdashore3 commented Oct 21, 2023

oobabooga commented Oct 21, 2023

bdashore3 commented Oct 9, 2023 •

edited

Loading

Ph0rk0z commented Oct 9, 2023 •

edited

Loading

jllllll commented Oct 10, 2023 •

edited

Loading

Panchovix commented Oct 10, 2023 •

edited

Loading

Panchovix commented Oct 10, 2023 •

edited

Loading

bdashore3 commented Oct 10, 2023 •

edited

Loading

Panchovix commented Oct 14, 2023 •

edited

Loading

Nicoolodion2 commented Oct 19, 2023 •

edited

Loading

jllllll commented Oct 19, 2023 •

edited

Loading

oobabooga commented Oct 20, 2023 •

edited

Loading

bdashore3 commented Oct 21, 2023 •

edited

Loading