Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flash-attention 2 for windows #4235

Merged
merged 19 commits into from
Oct 21, 2023

Conversation

bdashore3
Copy link
Contributor

@bdashore3 bdashore3 commented Oct 9, 2023

Flash attention 2.3.2 has added support for windows, but the caveat is that it requires cuda 12.1 to run. This requires a requirements update to use cuda 12.1 and torch 2.1 with cuda 12.1 in one_click.py.

Tested with a fresh install using the one click installer for windows and loaded a 70B 2.4 bit model via exllama 2. Generation was successful on 1x 3090ti.

This commit pulls my wheel, but I won't be updating it. Please see The original issue for more information.

Checklist:

@bdashore3
Copy link
Contributor Author

I'm not sure if FA2 will have official wheels for windows or if someone else needs to build them once FA2 updates.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Oct 9, 2023

There is no xformers for pytorch 2.1 yet, had to build it. Probably a ton of things like that.

@jllllll
Copy link
Contributor

jllllll commented Oct 10, 2023

One thing to consider is that FA2 only supports Ampere and newer GPUs. Anyone with older cards will be required to uninstall FA2 in order to use exllamav2. That said, those older cards run very poorly with exllamav2 anyway.

This issue should be avoidable by adding a config option to set model.config.no_flash_attn for exllamav2.
Should probably be added regardless given that FA2 is already being installed by default on Linux.

Here are some commands for easily updating the Pytorch and CUDA installation inside the webui env:

conda install -y -k cuda==12.1.1 -c nvidia/label/cuda-12.1.1
python -m pip install torch==2.1.0+cu121 torchvision torchaudio --upgrade --index-url https://download.pytorch.org/whl/cu121

There is another, more serious, caveat to updating the CUDA version to 12.X: CUDA 12 dropped support for Kepler GPUs
With this update, some people are going to lose access to the webui, mainly those using Kepler datacenter GPUs.
At the moment, CUDA 11.8 supports pretty much every NVIDIA GPU that is in use.

It might be worth adding the CUDA 12 wheels to a separate requirements.txt file that can be installed later for those who want it.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Oct 10, 2023

People on windows can install their own binaries? Forcing everyone to upgrade is just something else.

@oobabooga
Copy link
Owner

Is it not possible to compile flash-attention for cuda 11.8? They have Linux wheels for 11.6, 11.7, 11.8, and 12.2. I wonder if the same wheels could be created for Windows using GitHub actions.

That would also help keep things more transparent (distributing a binary compiled outside GitHub might be perceived as a security vulnerability by some).

@Panchovix
Copy link
Contributor

Panchovix commented Oct 10, 2023

Is it not possible to compile flash-attention for cuda 11.8? They have Linux wheels for 11.6, 11.7, 11.8, and 12.2. I wonder if the same wheels could be created for Windows using GitHub actions.

That would also help keep things more transparent (distributing a binary compiled outside GitHub might be perceived as a security vulnerability by some).

Compiling with 11.8 is not possible. NVIDIA added cutlass support from CUDA 12.x https://github.com/NVIDIA/cutlass/blob/main/media/docs/build/building_in_windows_with_visual_studio.md

Also, I tried to build on CUDA 11.8 after I succesully built it on CUDA 12.1 and CUDA 12.2, but it failts since the variable that it uses on C/C++ is not in that CUDA version for Windows.

@jllllll
Copy link
Contributor

jllllll commented Oct 10, 2023

Does the wheel built on CUDA 12.1 fail to run on 11.8?
It wouldn't surprise me, but it also wouldn't be the first time I saw something built on 12.1 run on 11.8.
Somewhat recently, I saw someone run an exllama/exllamav2 (not sure which was actually used) kernel built for 12.1 on 11.8 installation.

@Panchovix
Copy link
Contributor

Panchovix commented Oct 10, 2023

Does the wheel built on CUDA 12.1 fail to run on 11.8? It wouldn't surprise me, but it also wouldn't be the first time I saw something built on 12.1 run on 11.8. Somewhat recently, I saw someone run an exllama/exllamav2 (not sure which was actually used) kernel built for 12.1 on 11.8 installation.

I have created a wheel here as well https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel, as @bdashore3 did. It can get installed on a env with torch+cu118 and CUDA 11.8, but when you try to load a model with exllamav2, it will error out of a missing DLL.

While, if you use torch+cu121 (be stable or nightly), and CUDA 12.x, it works.

@bdashore3
Copy link
Contributor Author

bdashore3 commented Oct 10, 2023

@oobabooga @jllllll Apologize for not getting back sooner. Here's some more context for the above comments/questions.

  1. I would not have made a PR to update everything to 12.1 if FA2 built and ran on 11.8. FA2 on windows requires cuda 12.1 and above due to cutlass support on windows requiring those versions (see the parent issue in FA2's repo). Cuda 12.1 wheels will not work on 11.8 and vice versa (which is why I updated all the cuda wheel links to use 12.1 since exllamav2 wasn't loading).
  2. A github action will be a great idea and TriDao is looking for people who can create those actions to distribute official wheels. As for my wheel, a few others have tested it and it works fine and they're actively using flash attention on windows with cuda 12.x
  3. As for the requirements file and dropping kepler, I like the idea of adding separate requirements files. However, the one-click installer will need to be changed to support these separate files. Perhaps a prompt at the beginning for windows users that asks "do you want to use Flash Attention 2?" will be beneficial. I'm also not sure if there's a way to check for GPU support during one-click install to display that prompt in the first place. Another method would be to use cuda 12.1 by default and only use 11.8 for systems not supported by 12.1. There's multiple ways to approach this issue and I'm happy to help get this moving.

It may also be easier to discuss this in Discord, my username is kingbri if you want to ping me in the server.

@bdashore3
Copy link
Contributor Author

Update: There were some issues regarding flash attention wheels on windows. For some reason, they're building for a specific GPU arch or combination.

Ex. The wheel I provided was built on my 3090ti system, but the wheel does not work on a 4090 or a combined 3090/4090 system. Converting the PR to a draft until this is resolved.

@jllllll Since you have experience with building wheels, I'd appreciate any insight you may have regarding universal wheels for windows.

Parent issue in FA2's repo

@bdashore3 bdashore3 force-pushed the flash-attention-windows branch from 82fb209 to 722fac5 Compare October 14, 2023 19:51
Add a separate cuda 12.1 requirements file and update the one-click
installer to prompt the user with appropriate information about
cuda 12.1 and FA2 for Windows.

Cuda 11.8 is still used by default for further GPU compatability,
but 12.1 is now an option for those that want it.

Signed-off-by: kingbri <bdashore3@proton.me>
@bdashore3 bdashore3 force-pushed the flash-attention-windows branch from 722fac5 to 06ad145 Compare October 14, 2023 19:52
@Panchovix
Copy link
Contributor

Panchovix commented Oct 14, 2023

Just tested out with a clean install of @bdashore3 on these systems:

Tested on 2x4090+3090 -> it works.
Tested on single 3090 -> it works.
Tested on single 4090 -> it works.
Tested on 2x4090 -> it works.
Tested on 1x3090+1x4090 -> it works.

So for me now it looks fine and it works with multi-gpu architecture.

@bombel28
Copy link

I don't really know much about it, but maybe it helps somebody... I tried this and because of some... text... g I tried to use it via xformers... yeah maybe this has nothing to do with this... xformers can't load anymore. So... the training scripts have to be extended, too? I can't wait for it :-) If I understood, now I have flash-attn 2 loaded but can't use it, right?

@Nicoolodion2
Copy link

Nicoolodion2 commented Oct 19, 2023

conda install -y -k cuda==12.1.1 -c nvidia/label/cuda-12.1.1
python -m pip install torch==2.1.0+cu121 torchvision torchaudio --upgrade --index-url https://download.pytorch.org/whl/cu121

Tried this now, but now I get some error "ImportError: DLL load failed while importing exllamav2_ext: Das angegebene Modul wurde nicht gefunden." when I try to load something with exllamav2...
Also won't go away now, even after a update :/

@jllllll
Copy link
Contributor

jllllll commented Oct 19, 2023

conda install -y -k cuda==12.1.1 -c nvidia/label/cuda-12.1.1
python -m pip install torch==2.1.0+cu121 torchvision torchaudio --upgrade --index-url https://download.pytorch.org/whl/cu121

Tried this now, but now I get some error "ImportError: DLL load failed while importing exllamav2_ext: Das angegebene Modul wurde nicht gefunden." when I try to load something with exllamav2... Also won't go away now, even after a update :/

This will put you back on CUDA 11.8:

conda install -y -k cuda==11.8.0 -c nvidia/label/cuda-11.8.0
python -m pip install torch==2.1.0+cu118 torchvision torchaudio --upgrade --index-url https://download.pytorch.org/whl/cu118

python -c"from one_click import *; update_requirements()"
-OR-
python -m pip install -r requirements.txt --upgrade

A full update to CUDA 12.1 can be done with this:

conda install -y -k cuda==12.1.1 -c nvidia/label/cuda-12.1.1
python -m pip install torch==2.1.0+cu121 torchvision torchaudio --upgrade --index-url https://download.pytorch.org/whl/cu121

python -m pip install --upgrade -rhttps://github.com/bdashore3/text-generation-webui/raw/flash-attention-windows/requirements_cu121.txt

Just keep in mind that, until this PR is merged, updating the webui using the update_* script will require that you use CUDA 11.8 to avoid issues.

@oobabooga
Copy link
Owner

oobabooga commented Oct 20, 2023

It seems better to just update everything. It doesn't make sense to have Windows at 12.1 and Linux at 11.8.

I have made all the necessary changes for that, and also took the opportunity to update Python to 3.11.

one_click.py will perform the wheel filename conversions automatically in case Python 3.10 or CUDA 11.8/11.7 are installed.

The remaining problem is that for people who installed manually, the Python version is 3.10, and the requirements.txt will stop working. A solution is to use python_version conditions inside the requirements as done in #4233. Since there are 30 different wheels with '-cp311' in their filenames at the moment, 30 new lines would have to be added. We could also go wild and support python 3.8 and 3.9 as well (the latter is used on Google Colab), thus reaching 90 new lines in the requirements files...

I gave up on Python 3.11 because the wheel here is for Python 3.10.

@bdashore3
Copy link
Contributor Author

bdashore3 commented Oct 21, 2023

I like the idea!

However, the requirements file still uses cuda 11.8 wheels. I'm not sure if a bad commit reverted the 12.1 changes? In addition, it's probably better to use the flash_attn pip package for linux rather than using the wheels as that adapts to whatever cuda version the user has.

Also, Python 3.11 is viable, I just need to install it and recompile the FA2 wheel. 3.11 apparently has a speed boost compared to 3.10, so it seems like a good idea to switch if everything works okay.

@Panchovix
Copy link
Contributor

I can confirm that python 3.11, with a self built wheel works without issues with FA2.

@oobabooga
Copy link
Owner

Oops, you are right. There was a regression. I have just changed everything back to +cu121.

If you can create a 3.11 wheel, that would be great. I have installed the webui with Python 3.11 (for the claimed speed boost) and everything works.

It would be nice to have the cp310 and cp311 wheels side by side in your repository, such that both can be included in the requirements.txt.

@bdashore3
Copy link
Contributor Author

Yes, I'll put it under the same release once it's compiled.

@oobabooga
Copy link
Owner

In addition, it's probably better to use the flash_attn pip package for linux rather than using the wheels as that adapts to whatever cuda version the user has.

I have tried installing it on a fresh environment and it fails with error:

OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

I think that it requires nvcc install, which in turn is a few extra GB to install. It's probably better to keep the wheels.

@bdashore3
Copy link
Contributor Author

Added the flash attention wheel for cp311. This wheel works perfectly on my single 3090ti system, but it isn't tested on ada or multi ampere + ada arches. It should work since I built it the same way as last time, but I'd advise testing it first.

https://github.com/bdashore3/flash-attention/releases/download/2.3.2-2

@oobabooga
Copy link
Owner

In the worst case scenario, flash attention will fail to import, the exception will be caught, and a warning will be shown. So it should be fine.

I don't see anything else pending for this PR, so I'll merge it. Thanks a lot for making this wheel @bdashore3. Flash-attention is essential to get the most out of exllamav2 and you made it accessible to Windows users.

@oobabooga oobabooga merged commit 3345da2 into oobabooga:main Oct 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants