-
Notifications
You must be signed in to change notification settings - Fork 966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMD hardware support for training and Inference #346
Comments
Not in the very near future. However we could build bindings around https://www.amd.com/en/graphics/servers-solutions-rocm first in order to start working. I'm not too familiar with the workflow, I do know there are some adapter layers for cuda somewhere. In any case I think we should aim for the same thing as cuda, which would be a bare minimum kernels, and enabling users to write their own kernels. If you want to start working on bindings (or know existing up-to-date ones) we can keep an eye on ! |
llama-cpp now supports rocm ggerganov/llama.cpp#1087 my team has less than 25% the cost/performance using ROCm over CUDA but are stuck on the python side |
I am also looking into the possibility of running LLMs on ROCm-compatible AMD hardware (for potential significant savings), and it seems like llama.cpp might be the only viable option. I have done a test integrating candle and would prefer that, but it looks like I actually may be going back to llama.cpp because of ROCm support. The integration with Rust is awkward though and I would rather stick with a Rust solution if possible. But overall this is such an amazing engineering effort and I really appreciate your work. |
I'd love to contribute to the AMD support initiative for Candle. I'm wondering if HIP might not be a reasonable first pass. Additionally, I propose prioritizing RDNA3 architecture cards due to its advanced features like multi-precision capability and AI Matrix Accelerator, which are crucial for ML. And AMD/ROCm seem to be starting with RDNA3 for serious support for ML/AI. Anyway, I'm ready to contribute my time and skills, though I'd prefer not to lead the effort, but count me in for support! FWIW: This would be a great first project for my new System76 setup with a Ryzen 9 7950X, 128GB RAM, and a Radeon RX 7900XT. I plan to swap in dual 7900 XTXs for the 48GB GPU RAM. |
I'm also happy contribute to the AMD support, But now there are two options to start this support. The first is to compile cuda into hip, and the second is to use hip source language.Which one is better? |
A very elegant migration method. We may also need to handle the migration of flash-attention, but it is not laborious. We can copy it directly from the official amd library. |
Thank you for the work! Which example did you try @vberthet? I tried on gfx1102 (RX7600) HSA_OVERRIDE_GFX_VERSION='11.0.2' cargo run --example phi --features=hip --release -- --model phi-hermes --prompt "A skier slides down a frictionless slope of height 40m and length 80m. What's the skier speed at the bottom?" seeing model loaded ============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device [Model : Revision] Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
Name (20 chars) (Edge) (Avg) (Mem, Compute)
==========================================================================================================================
0 [0x240b : 0xcf] 50.0°C 145.0W N/A, N/A 2868Mhz 1124Mhz 32.94% auto 145.0W 79% 96%
0x7480
But the example never stops without results. Edit: HIP_VISIBLE_DEVICES=0 HSA_OVERRIDE_GFX_VERSION='11.0.2' cargo run --features hip --example yolo-v8 --release -- candle-examples/examples/yolo-v8/assets/bike.jpg |
Same here, I tried the yolov8 one: it downloaded the model and then just got stuck with 100% CPU, I think I aborted after 15 minutes or so, the GPU did not show any signs of usage
So, same question from my side, @vberthet you wrote you were able to run an example - which one did you try? Would be great to get candle to work with ROCm, I see there is quite some interest in doing that, so perhaps we should coordinate in some way and also figure out what kind of PR with ROCm support could get accepted? |
The implementation take the first GPU available and some GPU doesn't seems to work as expected. There still works to do in kernel ports to HIP, some half precision operation doesn't works or compile (eg : https://github.com/vberthet/candle/blob/2a0096af8013634479a3be0190286b60eb27205f/candle-hip-kernels/src/reduce.cu#L363) One better approach is to use orochi to dynamically load CUDA / HIP at runtime. |
Unfortunately this did not help me, it clearly reacts to the variable (i.e. specifying a non existing index will panic), but setting it (in my case to I'll try to look into what is going on there.
This one looks interesting, I wonder if Overall, unfortunately, I was not yet able to reproduce your success with the POC :( |
I have zero experience with GPU programming, so maybe someone could chime in. I attached with gdb to the Yolo example, here's
I think it's pretty much this issue: ROCm/ROCm#2715 |
For those who are not suffering from the ROCm related CPU-hog bug, this project looks like a very interesting alternative: If it does what it says it does, we could simply run unmodified candle code on AMD GPUs. |
I gave it a spin. Compiling Candle requires NV libraries, both cuda, and cudnn. You need to add the env var Attempted to run with:
Once it does, you are greeted with the error
|
I've updated the build script for hip kernel, it should generate kernel with better GPU arch compatibility. I've been able to run :
Phi seems to run into an infinite loop @cantor-set this errors seems to also exists with cuda see #353
|
@vberthet I had to patch build.rs,
Unfortunately I am still experiencing the ROCm 100% CPU hog from ROCm/ROCm#2715 so nothing works for me at the moment anyway :( |
Which rocm version are you running ? |
I am currently on RawHide which provides ROCm 6.0.0, although judging from the comments in the issue this problem was present in 5.7 as well. I could perhaps try to rebuild 6.0.2 and see if it goes away, although I am almost inclined to downgrade to 5.7 in order to try ZLUDA which afaik does not support the 6.x APIs yet. I think the last ROCm version which worked for me was 5.4, back on Fedora 38. |
I finally got past the ROCm hanging memcpy issue, turned out it was enough to So, once that worked I got back to trying candle. @vberthet I am getting a coredump when trying to run the yolo-v8 demo:
The printout "generating predictions" was added by me, right before |
OK, so... after a longer time I finally got back to this and I am not sure what changed - I guess I fixed my installation without realizing it - it did not crash. I got the yolov8 example to run through! @vberthet - awesome! Question is - what's next? Will you be maintaining the fork and perhaps coordinate the efforts? How could we translate the POC into a maintained version? EDIT: I may have been celebrating too early... I do not see any indication that the GPU was actually used, no spikes when looking at nvtop while yolo is processing... EDIT2: all good, yolo was simply too fast to be noticeable on the GPU in nvtop, with SDXL-Turbo I can see that the GPU is being used |
I've got 2 systems each with 8 AMD MI300X's and I'm pissed I can't used Candle with it...Python is yucky. Somebody help me out? ======================================================= Concise Info =======================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Junction) (Socket) (Mem, Compute, ID)
============================================================================================================================
0 26 0x74a1, 8554 37.0°C 131.0W NPS1, SPX, 0 132Mhz 900Mhz 0% manual 750.0W 0% 0%
1 27 0x74a1, 19011 38.0°C 130.0W NPS1, SPX, 0 132Mhz 900Mhz 0% manual 750.0W 0% 0%
2 25 0x74a1, 30036 39.0°C 132.0W NPS1, SPX, 0 132Mhz 900Mhz 0% manual 750.0W 0% 0%
3 24 0x74a1, 23964 36.0°C 132.0W NPS1, SPX, 0 132Mhz 900Mhz 0% manual 750.0W 0% 0%
4 30 0x74a1, 1197 37.0°C 131.0W NPS1, SPX, 0 132Mhz 900Mhz 0% manual 750.0W 0% 0%
5 31 0x74a1, 41351 35.0°C 130.0W NPS1, SPX, 0 131Mhz 900Mhz 0% manual 750.0W 0% 0%
6 29 0x74a1, 26775 40.0°C 134.0W NPS1, SPX, 0 132Mhz 900Mhz 0% manual 750.0W 0% 0%
7 28 0x74a1, 45536 35.0°C 133.0W NPS1, SPX, 0 132Mhz 900Mhz 0% manual 750.0W 0% 0%
============================================================================================================================
=================================================== End of ROCm SMI Log ====================================================
mastersplinter@turtle005:~/candle$ rocm-smi --showproductname
============================ ROCm System Management Interface ============================
====================================== Product Info ======================================
GPU[0] : Card Series: AMD Instinct MI300X OAM
GPU[0] : Card Model: 0x74a1
GPU[0] : Card Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0] : Card SKU: MI3SRIOV
GPU[0] : Subsystem ID: 0x74a1
GPU[0] : Device Rev: 0x00
GPU[0] : Node ID: 26
GPU[0] : GUID: 8554
GPU[0] : GFX Version: gfx942
GPU[1] : Card Series: AMD Instinct MI300X OAM
GPU[1] : Card Model: 0x74a1
GPU[1] : Card Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[1] : Card SKU: MI3SRIOV
GPU[1] : Subsystem ID: 0x74a1
GPU[1] : Device Rev: 0x00
GPU[1] : Node ID: 27
GPU[1] : GUID: 19011
GPU[1] : GFX Version: gfx942
GPU[2] : Card Series: AMD Instinct MI300X OAM
GPU[2] : Card Model: 0x74a1
GPU[2] : Card Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[2] : Card SKU: MI3SRIOV
GPU[2] : Subsystem ID: 0x74a1
GPU[2] : Device Rev: 0x00
GPU[2] : Node ID: 25
GPU[2] : GUID: 30036
GPU[2] : GFX Version: gfx942
GPU[3] : Card Series: AMD Instinct MI300X OAM
GPU[3] : Card Model: 0x74a1
GPU[3] : Card Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[3] : Card SKU: MI3SRIOV
GPU[3] : Subsystem ID: 0x74a1
GPU[3] : Device Rev: 0x00
GPU[3] : Node ID: 24
GPU[3] : GUID: 23964
GPU[3] : GFX Version: gfx942
GPU[4] : Card Series: AMD Instinct MI300X OAM
GPU[4] : Card Model: 0x74a1
GPU[4] : Card Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[4] : Card SKU: MI3SRIOV
GPU[4] : Subsystem ID: 0x74a1
GPU[4] : Device Rev: 0x00
GPU[4] : Node ID: 30
GPU[4] : GUID: 1197
GPU[4] : GFX Version: gfx942
GPU[5] : Card Series: AMD Instinct MI300X OAM
GPU[5] : Card Model: 0x74a1
GPU[5] : Card Vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[5] : Card SKU: MI3SRIOV
GPU[5] : Subsystem ID: 0x74a1
GPU[5] : Device Rev: 0x00
GPU[5] : Node ID: 31
GPU[5] : GUID: 41351
GPU[5] : GFX Version: gfx942
GPU[6] : Card Series: AMD Instint |
@kennethdsheridan I'll be honest, I gave up on Candle, because my goal was to learn and use an AI framework,not to spend time HIPifying CUDA code. Afaik they now work on WGPU support and WGPU does support ROCm, so it should work eventually, although I am not sure if WGPU is the most performant backend at the time. I switched to Burn: https://github.com/tracel-ai/burn Nice system btw, I wish I had so many GPUs :) |
also interested in rocm support in candle for screenpipe |
@vberthet where did this end for you? I'd like to try to help out. My Rust skills are very limited, so is my GPU knowledge. But I'd like to help out where I can with AMD support for Candle.\ |
Hi,
This library is cool. Rust for deep learning is nice and great work from huggingface. I am curious to understand if there are plans for AMD hardware support for training and Inference.
Thanks
The text was updated successfully, but these errors were encountered: