-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
amd-go-fast slows down Comfy on RX 7900 XTX #2
Comments
What's the unet dtype? FP32 will run slower. Also if you haven't already, run each test twice because the first run using new attention/res/model on ROCm cards triggers an auto tune lookup that adds a few seconds. |
One more thing I'll add is some ops in the flash attn fork are slower than baseline SDPA. Usually it's a net gain but for some very specific models like Pixart it ends up being a net loss. XL on its own should be a lot faster, but maybe with other additions it slows down? Try XL alone with nothing added. |
I ran each test several times, just didn't attach all the logs here, including running a few images in a row with each method. I tried running both using fp16 unet and fp8 unet mode (which as expected slowed down the generation on it's own, but there, the model is ~6GB which I believe means it's fp16? Unless dtype is something else that I don't know about. Edit: I'm testing it again right now, will update you as soon as I'm done. |
By "XL" do you mean SDXL model? Also what do you mean by "alone with nothing added"? Loras? Extensions in the pipeline? Extensions installed on my UI but not used in the pipeline? Or something else? |
SDXL yes. With nothing added I mean just the model with no extensions or adapters or anything. Loras at the very least seem to still run faster for me but I've never tried controlnet, ipadapter, or any of the other stuff. |
For a single 1024x1024 image using the |
Is that with amd-go-fast or without? |
With. Without was something like 2.9 iirc |
I haven't edited my GPU power limits and honestly I'm not sure how to, but I guess I could check that. Before that, I ran the tests again generating 1024x1024 image with euler sampler using SDXL without any extensions I'm getting ~2.80it/s when not using amd-go-fast and ~2.40 when amd-go-fast is on, stays consistent after generating 5 images so I doubt it'll speed up when I keep going. Using btop, which doesn't show me power usage, probably because I'll have to configure it or something, shows exactly 99% GPU usage which stays consistent through the entire length of the KSampler node running, both with amd-go-fast on and off. Also just remembered that I tested this back then both with and without live preview and it didn't change the fact that amd-go-fast slowed the generation down by a similar % no matter if I have it on or off. Looking at the comment you've sent just now, guessing the 0.1 it/s difference between your 2.9 and my 2.8 is probably the live preview or fp8 mode but yeah no matter the combination of fp8/fp16 and live preview on/off it keeps slowing it down. |
2.4 is way too slow yea. What python are you on? I might have a premade RDNA 3 wheel somewhere I can share. For now I'd try re-installing by activating the ComfyUI environment and running pip uninstall flash_attn
pip install -U --no-cache-dir git+https://github.com/ROCm/flash-attention@howiejay/navi_support You also might need the full ROCm 6 SDK installed at a system level. I don't think the hipcc and libs bundled with pytorch are enough to compile flash_attn Specifically it needs the BLASlt lib visible for the Navi 3 stuff iirc |
You can see power and fan usage with |
My comfy venv is on 3.11.9, my flash_attn works properly with exllamav2 and lets my GPU run large language models faster than otherwise so I believe that part should be fine. On the other hand not sure if I should point it out or not, my initial logs do show the following warning: Tho that warning seems to be triggered by this sdp_utils thing even when I don't have amd-go-fast enabled, as seen when starting it without amd-go-fast in that same log:
Yeah that's exactly what I'm doing, but I manually cloned the git repo so that I don't need to redownload every time I reinstall it. I am on the
Hello, this is the "See Below" from earlier, so yeah I tried looking up "BLASlt" but can't find anything about it anywhere. I will double check the flash_attn documentation tho iirc the rocm README was updated poorly, if at all. Probably worth mentioning, I do have a bunch of hip and BLAS related libraries installed that I've been using for LLMs and other AI stuff, but yeah I'll look into it more.
Yeah, I also feel like it should be alright, but I'll double check just to be sure. |
Update: Seems like my GPU is limited to 327 W and it uses above 300 W reliably even when using amd-go-fast, so that's not the issue. |
The Navi 3 flash attention isn't compatible with exllamav2... It's feature gated internally at v2.2.1 or something, far above the Navi 3 fork's v2.0.4. The only thing it changes is the oobabooga webui won't nag you. I think I have a python 3.11 wheel somewhere... Let me try to find it. |
Well, in that case I have no idea as to why it works with TabbyAPI and one other exlv2 backend I tried in the past, I guess maybe it's not using it's full potential or something.
Thank you a lot. |
IIRC the memory usage isn't a whole lot different, but if you run exllama2's benchmark the performance degrades hard with loaded context. Down to like half speed @ 8k. If flash attention is working properly it should be almost as fast as 0 ctx. Anyways I couldn't find the wheel so I just made a new one since I still have precompiled python 3.11 laying around. ROCm 6, torch 2.3.0, python 3.11.9. |
I decided to double check to be sure and looks like I have torch 2.4.0 dev branch but pytorch-triton-rocm 2.3.0, not sure if that matters. I'll test if it works and if not, then I'll install torch 2.3.0 and test again. |
Update: 3.70it/sSeems like my flash_attn was the problem after all. I'll double check my compilation logs at some point in future (and post my findings here) to see which libraries I must've been missing while compiling to make this solution more future-proof, but for now this wheel you've posted probably closes the issue. It's not exactly 3.8it/s like you've mentioned, but I think that might be just within error margin, caused by hardware or kernel version or similar. Thanks a lot for the help! |
That was measured on my own diffusion program just now. Comfy used to be basically the same but maybe it's a little different now, I haven't re-measured the perf since I first made this addon. Might also be the live preview. You're probably fine. |
Oh right, I forgot I turned it on again heh |
Marking the issue resolved but I'd still be curious on how you managed to compile a slower flash attention if you ever figure it out. |
BTW since kernel 6.3 ish amd gpus index starting at 1 instead of 0. I wonder if that applies to HIP_VISIBLE_DEVICES too? If that's the case, HIP_VISIBLE_DEVICES=0 might mask out your extant GPU. Could try compiling without that and the HSA override. That env var should really only be used for runtime control anyways, arch=gfx1100 controls the llvm compilation target[s]. |
4.5 it/sTLDR: export PYTORCH_TUNABLEOP_ENABLED=1 for faster iterations. Sorry for posting on a monts old post. Edit: I forgot to mention, gpu limits; #!/bin/bash
source venv
# PYTORCH_TUNABLEOP_ENABLED env caches some matrix operations, significant improvement, first time with a resolution ~1 min cache operation
# amd_go_fast.py github repo, significant improvement <3
# MIOPEN_FIND_MODE fast resolution switching.
# there is an annoying ~10 seconds lag while doing vae encode/decode when you switch input latent resolution.
# this env variable fixes that.
# to see it's effect clearly test it without PYTORCH_TUNABLEOP_ENABLED.
# PYTORCH_HIP_ALLOC_CONF i think this is bs. i always get ram-vram issues without --lowvram
export VLLM_USE_TRITON_FLASH_ATTN=0
export PYTORCH_TUNABLEOP_ENABLED=1
export MIOPEN_FIND_MODE=FAST
export PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.2,max_split_size_mb:128,expandable_segments:True
python main.py --use-pytorch-cross-attention --lowvram 2024-10-30.13-35-29.mp4 |
I see you found FIND_MODE=FAST too. The fact that kernel tuning is so slow by default in 6.2 is insane, especially for the 0.03 it/s or whatever it is you gain after waiting 5 whole minutes. I had monkeyed with tunableop before but it was never worth the overhead for my use. I was hoping that some recent pytorch changes, such as pytorch/pytorch#138947 and pytorch/pytorch#137317 might be more of a direct upgrade. I really don't feel like building my own torch again though. |
Oh well :) Hope those merge requests will do good in the future. |
AMD devs are active in the "GPU MODE" server on discord. I was actually just inquiring about kernel driver hangs during contention. The server is very cringe but it can be helpful. That's also how I learned about the fast kernel finder. |
I now get 5.3it/s with some very moderate overclocking. Thanks for the Also very much thanks for AMD Go Fast :) I think this is like... two or three times faster than ComfyUI stock now. |
Here you can see me using Comfy with amd-go-fast and then renaming the file to use it without. I disabled all my extensions but the effect is identical when all my extensions are enabled: Not having this enabled makes it faster.
My card is 7900 XTX, I'm on kernel version 6.8.8, here is my
pip freeze | grep torch
(I'm on today's nightly):Here is
pip freeze | grep flash_attn
:Which I compiled by hand by running:
just to be 100% sure it compiles with the right gfx (before it wouldn't display AMD GO FAST in the terminak but that fixed it)
I'm on ROCm 6.0
Is there anything I could try to fix this?
The text was updated successfully, but these errors were encountered: