-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flash Attention 2 doesn't get built/compiles on Windows. #553
Comments
Install from pip error
Is there any additional requisite besides the mentioned, to install flash-attn on Windows? |
I've no idea since it's only been tested on Linux, and I don't have access to a Windows machine. If you figure out how to build on Windows (or what we need to change to support Windows), please lmk. |
Closing as 5a83425 fixes it. |
@Panchovix are you saying we can now compile flash-attn on Windows somehow? I couldn't with the latest pull, unless I'm missing something. |
Yes, now it is possible. Latest pull should work. You do need CUDA 12.x though, since CUDA 11.8 and lower don't support it. I've uploaded a wheel here https://huggingface.co/Panchovix/flash-attn-2-windows-test-wheel More discussion here: #595 |
Thanks, 11.8 was my error. Woohoo! |
The link gives a 404 now |
There are binaries here. I can't build anything beyond 2.4.2 from source myself and can't find Windows binaries beyond that anywhere. 2.4.2 works fine with current packages though. |
With some untraceable magic I've built 2.5.6 on windows 10. Cuda 12.4 |
For anyone looking to use Flash Attention on Windows, I got it working after some tweaking. You have to make sure that Cuda 12.4 is installed, and PyTorch should be 2.2.2+cu121. I used pip and it took about 2 hours to finish setup. Hope this helps anyone who wants to use flash-attn on Windows. BTW I am using windows 11 pro, mileage may vary on Windows 10. |
have seen significant improvements after using flash attention? how much ? |
I was able to get it working. The problem seems to be that many ml
frameworks don’t support flash attention on windows. You would have to do
tests for yourself, but it seems like ctransformers does use it. Since I
didn’t check the performance before installing flash attention, I couldn’t
say what the improvements were.
…On Mon, Apr 8, 2024 at 11:56 PM sadimoodi ***@***.***> wrote:
For anyone looking to use Flash Attention on Windows, I got it working
after some tweaking. You have to make sure that Cuda 12.4 is installed, and
PyTorch should be 2.2.2+cu121. I used pip and it took about 2 hours to
finish setup. Hope this helps anyone who wants to use flash-attn on
Windows. BTW I am using windows 11 pro, mileage may vary on Windows 10.
have seen significant improvements after using flash attention? how much ?
—
Reply to this email directly, view it on GitHub
<#553 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A42G35XPZ52I7HND4ZCNSODY4NRF7AVCNFSM6AAAAAA45N5VROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBUGEYDIMJXGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Got it working on Windows 10 as well on Torch 2.2.2 (with Cuda 12.4 installed). Took around 15-20 min to compile on a 64-core threadripper with Ninja, so it does scale well with compute. |
Version 2.5.7 working on my Windows 10, building took around 2h:
|
Any luck getting it to work with cuda 11.8? |
A package that needs 2 hours to install? Sorry, but that's a no-go for me. |
Well it doesn't take that long if you have a multi-core processor (it's the compile time). In general you're right, someone should maintain pre-built wheels, and someone usually does, but it's not consistent for Windows builds right now and you have to search GitHub for someone who has uploaded a recent build. The good news is FA2 is a pretty stable product right now I think, and you can grab an older wheel and it'll probably work just as well, as long as it supports the CUDA version you're using.
I tried but it would not compile. Might be one of the dependencies (like cutlass?) needs 12.0. |
are there more recent builds for windows? I get the same error. and for the 2.4.2 binaries I get this error:
|
thanks for quick reply. unfortunately the same error persists with these builds. I have cuda 12.4 btw and these say cu123... hmm. |
That should be fine, technically. CUDA libs are generally backwards compatible, as long as your torch also has a compatible CUDA build. Does the latest pre-built wheel work? I do get the error you’re getting if I use a newer package with an older flash-attn wheel, or build an older version of flash-attn. Maybe some non-compatible change that was never reported in Windows. But most recent build or wheel of flash-attn removes that error for me. |
I finally found a root cause for the build fail. https://stackoverflow.com/a/78576792/13305027 I don't understand the VS 2022 version thing because that's what I have installed, but apparently it is related to some minor version it wasn't entirely clear how to downgrade to another 2022 version, so perhaps installing <2022 would suffice. alternatively upgrade to cuda 12.4, preferably 12.5 it seems. I am now testing a different approach to fix support without reinstallation. ps. perhaps worth mentioning on WSL it's pretty much hassle free, except there I had an error related to flash-attn, so on that other project I could simply bypass flash-attn by finding and setting |
That makes sense, the cases when I got that error I was probably linking with CUDA 12.1, and in my recent builds I had switched to 12.5. Also have a very early version of 2022 and have never updated since it was first released. What was the issue with WSL? It seems to work fine for me. |
I had this error: I ended up using the same type of solution they proposed which is to bypass flash attn altogether. perhaps this is the case for anyone reading this thread, but not so helpful if you actually need flash attn. I'm not sure what the implications of this would be but that repo seemed to work without it. maybe you have some insights? |
I suspect that this is caused by version differences and how absurdly easy the import paths get messed up on Windows, and ultimately caused by that with Windows unless you're using Conda you really need to figure out yourself which versions are compatible, and even then you need to know to install things in the right order. What worked for me, unintuitive things in bold:
TL;DR: Pay super close attention to which versions are installed all over your system, and consider doing a clean re-install of CUDA stuff. As for easing this going forward, I think adding some sanity checks in the build process to see which versions are installed, if the include paths are sensible, etc, and if they make sense would be a good step. As a 'crash early' mitigation, maybe we could do a quick build of some Cuda hello world before kicking off the main process? As long as the program isn't too trivial I think it's highly likely to catch build misconfigurations. |
I followed steps 1-4 (made sure to remove all CUDA / CuDNN from Add/Remove programs - only the Geforce drivers & Geforce experience remained). Installed the Latest Microsoft Visual C++ Redistributable Version after step 8 to fix "OSError WinError 126, error loading fbgemm.dll or dependencies]" (occured when running "import pytorch") Installed CUDA 12.4.1 Windows 11: cuDNN 9.2 was installed from the tarball:
I created a new venv and installed PyTorch 2.4 by modifying Step 7:
Finally:
It's currently building (with a lot of warnings in the process, such as |
@sunsetcoder Thank you!, I've been trying to get flash_attn installed for days, these instructions are the first ones that worked. |
@evilalmus You're welcome. Make sure to use Python 3.10. 3.12 no bueno |
3.11.9 worked for me. |
@sunsetcoder @the-xentropy Thank you for the provided instructions. I tried to install it for several days, but nothing worked. I tried using Docker. In the end, I came across your instructions, followed them, and it worked. It installed on the latest Python 3.12.5. But my graphics card is not suitable for Flash Attention 2 😭😭😭 |
First remove any existing flash-attentionpip uninstall flash-attn -y Install build requirementspip install ninja packaging Try with specific compiler settingsFLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn==2.3.2 --no-build-isolation Alternative installation with explicit CUDA pathCUDA_HOME=/usr/local/cuda pip install flash-attn==2.3.2 --no-build-isolation |
Hi there, impressive work. Tested in on Linux and the VRAM and speeds with higher context is impressive (tested on exllamav2)
I've tried to do the same on Windows for exllamav2, but I have issues when either compiling or building from source.
I tried with:
Torch 2.0.1+cu118 and CUDA 11.8
Torch 2.2+cu121 and CUDA 12.1
Visual Studio 2022
The errors are these, based on if doing
python setup.py install
from source or doing it via pip.Compiling from source error
The text was updated successfully, but these errors were encountered: