-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add flash-attention 2 for windows #4235
Conversation
I'm not sure if FA2 will have official wheels for windows or if someone else needs to build them once FA2 updates. |
There is no xformers for pytorch 2.1 yet, had to build it. Probably a ton of things like that. |
One thing to consider is that FA2 only supports Ampere and newer GPUs. Anyone with older cards will be required to uninstall FA2 in order to use exllamav2. That said, those older cards run very poorly with exllamav2 anyway. This issue should be avoidable by adding a config option to set Here are some commands for easily updating the Pytorch and CUDA installation inside the webui env:
There is another, more serious, caveat to updating the CUDA version to 12.X: CUDA 12 dropped support for Kepler GPUs It might be worth adding the CUDA 12 wheels to a separate requirements.txt file that can be installed later for those who want it. |
People on windows can install their own binaries? Forcing everyone to upgrade is just something else. |
Is it not possible to compile flash-attention for cuda 11.8? They have Linux wheels for 11.6, 11.7, 11.8, and 12.2. I wonder if the same wheels could be created for Windows using GitHub actions. That would also help keep things more transparent (distributing a binary compiled outside GitHub might be perceived as a security vulnerability by some). |
Compiling with 11.8 is not possible. NVIDIA added cutlass support from CUDA 12.x https://github.com/NVIDIA/cutlass/blob/main/media/docs/build/building_in_windows_with_visual_studio.md Also, I tried to build on CUDA 11.8 after I succesully built it on CUDA 12.1 and CUDA 12.2, but it failts since the variable that it uses on C/C++ is not in that CUDA version for Windows. |
Does the wheel built on CUDA 12.1 fail to run on 11.8? |
I have created a wheel here as well https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel, as @bdashore3 did. It can get installed on a env with torch+cu118 and CUDA 11.8, but when you try to load a model with exllamav2, it will error out of a missing DLL. While, if you use torch+cu121 (be stable or nightly), and CUDA 12.x, it works. |
@oobabooga @jllllll Apologize for not getting back sooner. Here's some more context for the above comments/questions.
It may also be easier to discuss this in Discord, my username is |
Update: There were some issues regarding flash attention wheels on windows. For some reason, they're building for a specific GPU arch or combination. Ex. The wheel I provided was built on my 3090ti system, but the wheel does not work on a 4090 or a combined 3090/4090 system. Converting the PR to a draft until this is resolved. @jllllll Since you have experience with building wheels, I'd appreciate any insight you may have regarding universal wheels for windows. Parent issue in FA2's repo |
82fb209
to
722fac5
Compare
Add a separate cuda 12.1 requirements file and update the one-click installer to prompt the user with appropriate information about cuda 12.1 and FA2 for Windows. Cuda 11.8 is still used by default for further GPU compatability, but 12.1 is now an option for those that want it. Signed-off-by: kingbri <bdashore3@proton.me>
722fac5
to
06ad145
Compare
Just tested out with a clean install of @bdashore3 on these systems: Tested on 2x4090+3090 -> it works. So for me now it looks fine and it works with multi-gpu architecture. |
I don't really know much about it, but maybe it helps somebody... I tried this and because of some... text... g I tried to use it via xformers... yeah maybe this has nothing to do with this... xformers can't load anymore. So... the training scripts have to be extended, too? I can't wait for it :-) If I understood, now I have flash-attn 2 loaded but can't use it, right? |
Tried this now, but now I get some error "ImportError: DLL load failed while importing exllamav2_ext: Das angegebene Modul wurde nicht gefunden." when I try to load something with exllamav2... |
This will put you back on CUDA 11.8:
A full update to CUDA 12.1 can be done with this:
Just keep in mind that, until this PR is merged, updating the webui using the |
It seems better to just update everything. It doesn't make sense to have Windows at 12.1 and Linux at 11.8.
I gave up on Python 3.11 because the wheel here is for Python 3.10. |
I like the idea! However, the requirements file still uses cuda 11.8 wheels. I'm not sure if a bad commit reverted the 12.1 changes? In addition, it's probably better to use the flash_attn pip package for linux rather than using the wheels as that adapts to whatever cuda version the user has. Also, Python 3.11 is viable, I just need to install it and recompile the FA2 wheel. 3.11 apparently has a speed boost compared to 3.10, so it seems like a good idea to switch if everything works okay. |
I can confirm that python 3.11, with a self built wheel works without issues with FA2. |
Oops, you are right. There was a regression. I have just changed everything back to +cu121. If you can create a 3.11 wheel, that would be great. I have installed the webui with Python 3.11 (for the claimed speed boost) and everything works. It would be nice to have the cp310 and cp311 wheels side by side in your repository, such that both can be included in the requirements.txt. |
Yes, I'll put it under the same release once it's compiled. |
I have tried installing it on a fresh environment and it fails with error:
I think that it requires nvcc install, which in turn is a few extra GB to install. It's probably better to keep the wheels. |
Added the flash attention wheel for cp311. This wheel works perfectly on my single 3090ti system, but it isn't tested on ada or multi ampere + ada arches. It should work since I built it the same way as last time, but I'd advise testing it first. https://github.com/bdashore3/flash-attention/releases/download/2.3.2-2 |
In the worst case scenario, flash attention will fail to import, the exception will be caught, and a warning will be shown. So it should be fine. I don't see anything else pending for this PR, so I'll merge it. Thanks a lot for making this wheel @bdashore3. Flash-attention is essential to get the most out of exllamav2 and you made it accessible to Windows users. |
Flash attention 2.3.2 has added support for windows, but the caveat is that it requires cuda 12.1 to run. This requires a requirements update to use cuda 12.1 and torch 2.1 with cuda 12.1 in one_click.py.
Tested with a fresh install using the one click installer for windows and loaded a 70B 2.4 bit model via exllama 2. Generation was successful on 1x 3090ti.
This commit pulls my wheel, but I won't be updating it. Please see The original issue for more information.
Checklist: