Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amd-go-fast slows down Comfy on RX 7900 XTX #2

Closed
ikcikoR opened this issue May 8, 2024 · 27 comments
Closed

amd-go-fast slows down Comfy on RX 7900 XTX #2

ikcikoR opened this issue May 8, 2024 · 27 comments

Comments

@ikcikoR
Copy link

ikcikoR commented May 8, 2024

Total VRAM 24560 MB, total RAM 31802 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7900 XTX : native
VAE dtype: torch.float32
Using pytorch cross attention
# # #
AMD GO FAST
# # #

Import times for custom nodes:
   0.0 seconds: /home/ikcikor/software/ComfyUI/custom_nodes/comfyui-amd-go-fast

Starting server

To see the GUI go to: http://0.0.0.0:8188
got prompt
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
clip missing: ['clip_l.logit_scale', 'clip_l.transformer.text_projection.weight']
Requested to load SDXLClipModel
Loading 1 new model
/home/ikcikor/software/ComfyUI/custom_nodes/comfyui-amd-go-fast/amd_go_fast.py:20: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:505.)
  hidden_states = sdpa(
Requested to load SDXL
Loading 1 new model
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:13<00:00,  2.25it/s]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 17.84 seconds
^C
Stopped server
ikcikor@v3real ~/s/ComfyUI (master)> cd amd
ikcikor@v3real ~/s/C/c/comfyui-amd-go-fast (master)> mv amd_go_fast.py amd_go_fast.py.notpy
ikcikor@v3real ~/s/C/c/comfyui-amd-go-fast (master)> cd -
ikcikor@v3real ~/s/ComfyUI (master)> ./start.sh
Total VRAM 24560 MB, total RAM 31802 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7900 XTX : native
VAE dtype: torch.float32
Using pytorch cross attention
Traceback (most recent call last):
  File "/home/ikcikor/software/ComfyUI/nodes.py", line 1876, in load_custom_node
    module_spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/ikcikor/software/ComfyUI/custom_nodes/comfyui-amd-go-fast/__init__.py", line 1, in <module>
    from .amd_go_fast import *
ModuleNotFoundError: No module named 'comfyui-amd-go-fast.amd_go_fast'

Cannot import /home/ikcikor/software/ComfyUI/custom_nodes/comfyui-amd-go-fast module for custom nodes: No module named 'comfyui-amd-go-fast.amd_go_fast'

Import times for custom nodes:
   0.0 seconds (IMPORT FAILED): /home/ikcikor/software/ComfyUI/custom_nodes/comfyui-amd-go-fast

Starting server

To see the GUI go to: http://0.0.0.0:8188
got prompt
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
clip missing: ['clip_l.logit_scale', 'clip_l.transformer.text_projection.weight']
Requested to load SDXLClipModel
Loading 1 new model
/home/ikcikor/software/ComfyUI/comfy/ldm/modules/attention.py:345: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:505.)
  out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load SDXL
Loading 1 new model
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:11<00:00,  2.58it/s]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 16.20 seconds
^C
Stopped server
ikcikor@v3real ~/s/ComfyUI (master)>

Here you can see me using Comfy with amd-go-fast and then renaming the file to use it without. I disabled all my extensions but the effect is identical when all my extensions are enabled: Not having this enabled makes it faster.

My card is 7900 XTX, I'm on kernel version 6.8.8, here is my pip freeze | grep torch (I'm on today's nightly):

pytorch-msssim==1.0.0
pytorch-triton-rocm==3.0.0+bbe6246e37
torch==2.4.0.dev20240508+rocm6.0
torchsde==0.2.6
torchtyping==0.1.4
torchvision==0.19.0.dev20240508+rocm6.0

Here is pip freeze | grep flash_attn:

flash_attn==2.0.4

Which I compiled by hand by running:

python setup.py clean
env MAX_JOBS=12 HIP_VISIBLE_DEVICES=0 HSA_OVERRIDE_GFX_VERSION=11.0.0 GPU_TARGETS=gfx1100 GPU_ARCHS="gfx1100" python setup.py install

just to be 100% sure it compiles with the right gfx (before it wouldn't display AMD GO FAST in the terminak but that fixed it)

I'm on ROCm 6.0

Is there anything I could try to fix this?

@Beinsezii
Copy link
Owner

What's the unet dtype? FP32 will run slower.

Also if you haven't already, run each test twice because the first run using new attention/res/model on ROCm cards triggers an auto tune lookup that adds a few seconds.

@Beinsezii
Copy link
Owner

One more thing I'll add is some ops in the flash attn fork are slower than baseline SDPA. Usually it's a net gain but for some very specific models like Pixart it ends up being a net loss.

XL on its own should be a lot faster, but maybe with other additions it slows down? Try XL alone with nothing added.

@ikcikoR
Copy link
Author

ikcikoR commented May 9, 2024

What's the unet dtype? FP32 will run slower.

Also if you haven't already, run each test twice because the first run using new attention/res/model on ROCm cards triggers an auto tune lookup that adds a few seconds.

I ran each test several times, just didn't attach all the logs here, including running a few images in a row with each method. I tried running both using fp16 unet and fp8 unet mode (which as expected slowed down the generation on it's own, but there, the model is ~6GB which I believe means it's fp16? Unless dtype is something else that I don't know about.

Edit: I'm testing it again right now, will update you as soon as I'm done.

@ikcikoR
Copy link
Author

ikcikoR commented May 9, 2024

One more thing I'll add is some ops in the flash attn fork are slower than baseline SDPA. Usually it's a net gain but for some very specific models like Pixart it ends up being a net loss.

XL on its own should be a lot faster, but maybe with other additions it slows down? Try XL alone with nothing added.

By "XL" do you mean SDXL model? Also what do you mean by "alone with nothing added"? Loras? Extensions in the pipeline? Extensions installed on my UI but not used in the pipeline? Or something else?

@Beinsezii
Copy link
Owner

SDXL yes. With nothing added I mean just the model with no extensions or adapters or anything. Loras at the very least seem to still run faster for me but I've never tried controlnet, ipadapter, or any of the other stuff.

@Beinsezii
Copy link
Owner

For a single 1024x1024 image using the euler sampler and no extensions, adapters, or anything I'm at about 3.8 it/s on a 7900 XTX limited to 300w

@ikcikoR
Copy link
Author

ikcikoR commented May 9, 2024

For a single 1024x1024 image using the euler sampler and no extensions, adapters, or anything I'm at about 3.8 it/s on a 7900 XTX limited to 300w

Is that with amd-go-fast or without?

@Beinsezii
Copy link
Owner

With. Without was something like 2.9 iirc

@ikcikoR
Copy link
Author

ikcikoR commented May 9, 2024

For a single 1024x1024 image using the euler sampler and no extensions, adapters, or anything I'm at about 3.8 it/s on a 7900 XTX limited to 300w

I haven't edited my GPU power limits and honestly I'm not sure how to, but I guess I could check that.

Before that, I ran the tests again generating 1024x1024 image with euler sampler using SDXL without any extensions I'm getting ~2.80it/s when not using amd-go-fast and ~2.40 when amd-go-fast is on, stays consistent after generating 5 images so I doubt it'll speed up when I keep going.

Using btop, which doesn't show me power usage, probably because I'll have to configure it or something, shows exactly 99% GPU usage which stays consistent through the entire length of the KSampler node running, both with amd-go-fast on and off.

Also just remembered that I tested this back then both with and without live preview and it didn't change the fact that amd-go-fast slowed the generation down by a similar % no matter if I have it on or off. Looking at the comment you've sent just now, guessing the 0.1 it/s difference between your 2.9 and my 2.8 is probably the live preview or fp8 mode but yeah no matter the combination of fp8/fp16 and live preview on/off it keeps slowing it down.

@Beinsezii
Copy link
Owner

Beinsezii commented May 9, 2024

2.4 is way too slow yea. What python are you on? I might have a premade RDNA 3 wheel somewhere I can share.

For now I'd try re-installing by activating the ComfyUI environment and running

pip uninstall flash_attn
pip install -U --no-cache-dir git+https://github.com/ROCm/flash-attention@howiejay/navi_support

You also might need the full ROCm 6 SDK installed at a system level. I don't think the hipcc and libs bundled with pytorch are enough to compile flash_attn

Specifically it needs the BLASlt lib visible for the Navi 3 stuff iirc

@Beinsezii
Copy link
Owner

You can see power and fan usage with nvtop btw, but I don't think it's a power issue.

@ikcikoR
Copy link
Author

ikcikoR commented May 10, 2024

2.4 is way too slow yea. What python are you on? I might have a premade RDNA 3 wheel somewhere I can share.

My comfy venv is on 3.11.9, my flash_attn works properly with exllamav2 and lets my GPU run large language models faster than otherwise so I believe that part should be fine.

On the other hand not sure if I should point it out or not, my initial logs do show the following warning:
/home/ikcikor/software/ComfyUI/custom_nodes/comfyui-amd-go-fast/amd_go_fast.py:20: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:505.)

Tho that warning seems to be triggered by this sdp_utils thing even when I don't have amd-go-fast enabled, as seen when starting it without amd-go-fast in that same log:
/home/ikcikor/software/ComfyUI/comfy/ldm/modules/attention.py:345: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:505.)
but I got told by the person who recommended me amd-go-fast that this shouldn't matter and he seemed like he knows what he's talking about. Then again, (See Below next paragraph)

For now I'd try re-installing by activating the ComfyUI environment and running

pip uninstall flash_attn
pip install -U --no-cache-dir git+https://github.com/ROCm/flash-attention@howiejay/navi_support

Yeah that's exactly what I'm doing, but I manually cloned the git repo so that I don't need to redownload every time I reinstall it. I am on the howijay/navi_support branch.

You also might need the full ROCm 6 SDK installed at a system level. I don't think the hipcc and libs bundled with pytorch are enough to compile flash_attn

Specifically it needs the BLASlt lib visible for the Navi 3 stuff iirc

Hello, this is the "See Below" from earlier, so yeah I tried looking up "BLASlt" but can't find anything about it anywhere. I will double check the flash_attn documentation tho iirc the rocm README was updated poorly, if at all. Probably worth mentioning, I do have a bunch of hip and BLAS related libraries installed that I've been using for LLMs and other AI stuff, but yeah I'll look into it more.

You can see power and fan usage with nvtop btw, but I don't think it's a power issue.

Yeah, I also feel like it should be alright, but I'll double check just to be sure.

@ikcikoR
Copy link
Author

ikcikoR commented May 10, 2024

Update: Seems like my GPU is limited to 327 W and it uses above 300 W reliably even when using amd-go-fast, so that's not the issue.

@Beinsezii
Copy link
Owner

My comfy venv is on 3.11.9, my flash_attn works properly with exllamav2 and lets my GPU run large language models faster than otherwise so I believe that part should be fine.

The Navi 3 flash attention isn't compatible with exllamav2... It's feature gated internally at v2.2.1 or something, far above the Navi 3 fork's v2.0.4. The only thing it changes is the oobabooga webui won't nag you.

I think I have a python 3.11 wheel somewhere... Let me try to find it.

@ikcikoR
Copy link
Author

ikcikoR commented May 10, 2024

The Navi 3 flash attention isn't compatible with exllamav2... It's feature gated internally at v2.2.1 or something, far above the Navi 3 fork's v2.0.4. The only thing it changes is the oobabooga webui won't nag you.

Well, in that case I have no idea as to why it works with TabbyAPI and one other exlv2 backend I tried in the past, I guess maybe it's not using it's full potential or something.

I think I have a python 3.11 wheel somewhere... Let me try to find it.

Thank you a lot.

@Beinsezii
Copy link
Owner

Well, in that case I have no idea as to why it works with TabbyAPI and one other exlv2 backend I tried in the past, I guess maybe it's not using it's full potential or something.

IIRC the memory usage isn't a whole lot different, but if you run exllama2's benchmark the performance degrades hard with loaded context. Down to like half speed @ 8k. If flash attention is working properly it should be almost as fast as 0 ctx.

Anyways I couldn't find the wheel so I just made a new one since I still have precompiled python 3.11 laying around. ROCm 6, torch 2.3.0, python 3.11.9.
flash_attn-2.0.4-cp311-cp311-linux_x86_64.zip
Github doesn't like wheels so just rename it from .zip to .whl and it'll be fine.
Also my current python 3.12 wheel for good measure.
flash_attn-2.0.4-cp312-cp312-linux_x86_64.zip

@ikcikoR
Copy link
Author

ikcikoR commented May 10, 2024

Anyways I couldn't find the wheel so I just made a new one since I still have precompiled python 3.11 laying around. ROCm 6, torch 2.3.0, python 3.11.9. flash_attn-2.0.4-cp311-cp311-linux_x86_64.zip Github doesn't like wheels so just rename it from .zip to .whl and it'll be fine. Also my current python 3.12 wheel for good measure. flash_attn-2.0.4-cp312-cp312-linux_x86_64.zip

I decided to double check to be sure and looks like I have torch 2.4.0 dev branch but pytorch-triton-rocm 2.3.0, not sure if that matters. I'll test if it works and if not, then I'll install torch 2.3.0 and test again.

@ikcikoR
Copy link
Author

ikcikoR commented May 10, 2024

Update:

3.70it/s

Seems like my flash_attn was the problem after all. I'll double check my compilation logs at some point in future (and post my findings here) to see which libraries I must've been missing while compiling to make this solution more future-proof, but for now this wheel you've posted probably closes the issue. It's not exactly 3.8it/s like you've mentioned, but I think that might be just within error margin, caused by hardware or kernel version or similar.

Thanks a lot for the help!

@Beinsezii
Copy link
Owner

It's not exactly 3.8it/s like you've mentioned, but I think that might be just within error margin, caused by hardware or kernel version or similar.

That was measured on my own diffusion program just now. Comfy used to be basically the same but maybe it's a little different now, I haven't re-measured the perf since I first made this addon. Might also be the live preview. You're probably fine.

@ikcikoR
Copy link
Author

ikcikoR commented May 10, 2024

Might also be the live preview. You're probably fine.

Oh right, I forgot I turned it on again heh

@Beinsezii
Copy link
Owner

Beinsezii commented May 10, 2024

Marking the issue resolved but I'd still be curious on how you managed to compile a slower flash attention if you ever figure it out.

@Beinsezii
Copy link
Owner

BTW since kernel 6.3 ish amd gpus index starting at 1 instead of 0. I wonder if that applies to HIP_VISIBLE_DEVICES too? If that's the case, HIP_VISIBLE_DEVICES=0 might mask out your extant GPU. Could try compiling without that and the HSA override. That env var should really only be used for runtime control anyways, arch=gfx1100 controls the llvm compilation target[s].

@enesaltinkaya
Copy link

enesaltinkaya commented Oct 30, 2024

4.5 it/s

TLDR: export PYTORCH_TUNABLEOP_ENABLED=1 for faster iterations.

Sorry for posting on a monts old post.
As a thank you, i wanted to share my config for 4.5 it/s.

Edit: I forgot to mention, gpu limits;
maxClock="2600"
voltageOffset="-30"
watts="300"

#!/bin/bash
source venv

# PYTORCH_TUNABLEOP_ENABLED env caches some matrix operations, significant improvement, first time with a resolution ~1 min cache operation

# amd_go_fast.py github repo, significant improvement <3

# MIOPEN_FIND_MODE fast resolution switching.
#   there is an annoying ~10 seconds lag while doing vae encode/decode when you switch input latent resolution. 
#   this env variable fixes that.
#   to see it's effect clearly test it without PYTORCH_TUNABLEOP_ENABLED.

# PYTORCH_HIP_ALLOC_CONF i think this is bs. i always get ram-vram issues without --lowvram

export VLLM_USE_TRITON_FLASH_ATTN=0
export PYTORCH_TUNABLEOP_ENABLED=1
export MIOPEN_FIND_MODE=FAST
export PYTORCH_HIP_ALLOC_CONF=garbage_collection_threshold:0.2,max_split_size_mb:128,expandable_segments:True

python main.py --use-pytorch-cross-attention --lowvram
2024-10-30.13-35-29.mp4

@Beinsezii
Copy link
Owner

I see you found FIND_MODE=FAST too. The fact that kernel tuning is so slow by default in 6.2 is insane, especially for the 0.03 it/s or whatever it is you gain after waiting 5 whole minutes.

I had monkeyed with tunableop before but it was never worth the overhead for my use.

I was hoping that some recent pytorch changes, such as pytorch/pytorch#138947 and pytorch/pytorch#137317 might be more of a direct upgrade. I really don't feel like building my own torch again though.

@enesaltinkaya
Copy link

Oh well :) Hope those merge requests will do good in the future.
I was feeling very disappointed with this gpu in linux, all the crashes, high temps,
your work on amd-go-fast eased my frustrations a little so, thank you again so much :)

@Beinsezii
Copy link
Owner

AMD devs are active in the "GPU MODE" server on discord. I was actually just inquiring about kernel driver hangs during contention.

The server is very cringe but it can be helpful. That's also how I learned about the fast kernel finder.

https://discord.gg/2zhQbAk9

@FeepingCreature
Copy link

FeepingCreature commented Dec 9, 2024

I now get 5.3it/s with some very moderate overclocking. Thanks for the FIND_MODE=FAST hint! That makes TunableOp actually usable.

Also very much thanks for AMD Go Fast :) I think this is like... two or three times faster than ComfyUI stock now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants