Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installing with ROCM #621

Closed
baderex opened this issue Jul 30, 2023 · 44 comments
Closed

Installing with ROCM #621

baderex opened this issue Jul 30, 2023 · 44 comments

Comments

@baderex
Copy link

baderex commented Jul 30, 2023

Hello,

I'm trying to install VLLM on AMD server. However unable to build the package because CUDA is not installed. Is their anyway we can configure it to work with ROCM instead?

!pip install vllm

Error:
RuntimeError: Cannot find CUDA_HOME. CUDA must be available in order to build the package.

ROCM is installed and verified

PyTorch 2-0-ROCm

@zhuohan123
Copy link
Member

AMD GPUs are not supported now and we just added this into our roadmap.

CC @naed90

@zhuohan123 zhuohan123 added feature request enhancement New feature or request and removed feature request labels Aug 7, 2023
@ccbadd
Copy link

ccbadd commented Aug 14, 2023

My server is running a pair of MI100's and I would love to try this out. Please add rocm support.

@naed90
Copy link
Contributor

naed90 commented Aug 14, 2023

My server is running a pair of MI100's and I would love to try this out. Please add rocm support.

Hey! Currently planning to work on this. If you can reach out, we’d appreciate it (got some questions): Dean.leitersdorf@gmail.com

@jamestwhedbee
Copy link
Contributor

Hey @naed90, I would also love to try this out on MI100s. I am going to go ahead and shoot you an email as well.

@chymian
Copy link

chymian commented Aug 22, 2023

is it possible to get clblast/VULKAN support, for a bigger range of supported cards, as well?
there are a lot of older cards, which are only partly/not supported by the latest ROCm, drivers.

@GdRottoli
Copy link

The enhancement for AMD Support was added two weeks ago. Do you have any info about the current state of this issue? Thanks!

@ehartford
Copy link

Very interested in this feature

@smiraldr
Copy link

@GdRottoli can we have some documentation on this feature too? Don't see any docs on how to use vllm with MI100s

@ehartford
Copy link

I also want to use vllm on mi100s

@smiraldr
Copy link

If anyone can point me towards the code which achieve this - I can experiment and I'd also be willing to contribute to the docs !

@ardfork
Copy link

ardfork commented Sep 29, 2023

The major blocker for a ROCm HIP port is xFormers (and flash attention). Without that, it shouldn't be that hard to hipify vLLM.

Edit: After taking a closer look, vLLM also use a lot of inline PTX assembly, that will also be annoying to port.

@fxmarty
Copy link

fxmarty commented Oct 4, 2023

Inline PTX seem to be mostly related to AWQ, right?

@pcmoritz
Copy link
Collaborator

Some progress on this: #1313

I believe the flash attention part can be solved with https://github.com/ROCmSoftwarePlatform/flash-attention

If somebody has bandwidth to port the AWQ kernels, that would be very much appreciated, but it is not blocking for now :)

@sabreshao
Copy link

Hi, this is sabre from AMD. We recognized the value of VLLM and are already working on both xformers and VLLM for ROCm. Stay tuned!

@ehartford
Copy link

ehartford commented Oct 16, 2023 via email

@bennmann
Copy link

I upvote the main post and have the same needs here

$ python -m pip list | grep rocm5.6
torch 2.2.0.dev20231003+rocm5.6

But when I go to install vllm (whl?) does not seem up to date to allow this version of torch

@tjtanaa
Copy link
Contributor

tjtanaa commented Oct 27, 2023

I upvote the main post and have the same needs here

$ python -m pip list | grep rocm5.6
torch 2.2.0.dev20231003+rocm5.6

But when I go to install vllm (whl?) does not seem up to date to allow this version of torch

You can try my fork which is a ROCm-port of vllm v0.1.4. You can follow the setup procedure in https://github.com/EmbeddedLLM/vllm-rocm
We have been successful in running Llama-2 7b/13b/70b and Vicuna 7b/13b/33b on MI210.
We are also working on v0.2.x.
Stay tune.

@bennmann
Copy link

In addition to the good community forks, here's the llamacpp ROCM commit for inspiration on this topic: https://github.com/ggerganov/llama.cpp/pull/1087/commits

@fxmarty
Copy link

fxmarty commented Oct 30, 2023

I upvote the main post and have the same needs here

$ python -m pip list | grep rocm5.6
torch 2.2.0.dev20231003+rocm5.6

But when I go to install vllm (whl?) does not seem up to date to allow this version of torch

You can try my fork which is a ROCm-port of vllm v0.1.4. You can follow the setup procedure in https://github.com/EmbeddedLLM/vllm-rocm We have been successful in running Llama-2 7b/13b/70b and Vicuna 7b/13b/33b on MI210. We are also working on v0.2.x. Stay tune.

Also building from #1313 went just fine for me.

@tanpinsiang
Copy link

With the integration of flash attention v2 we can report vLLM v0.2.1 on ROCm achieved speedup of > 2x for LLaMA-70B model and > 3x for LLaMA-7B/13B on MI210 compared to vLLM v0.1.4. We are trying to port AWQ and contribute to #1313.
throughput_tokens

@ccbadd
Copy link

ccbadd commented Nov 5, 2023

Hi, this is sabre from AMD. We recognized the value of VLLM and are already working on both xformers and VLLM for ROCm. Stay tuned!

So is only the 210 and newer cards supported? If so that's pretty worthless for most of us. I started to install a little bit ago but once I went to install flash attention it stopped me in my tracks.

@ehartford
Copy link

ehartford commented Nov 5, 2023 via email

@yourbuddyconner
Copy link

FWIW in case anyone else is following this, it looks like ROCm support was ported from the vLLM fork linked in this thread and merged upstream.

Here's the docs: https://docs.vllm.ai/en/latest/getting_started/amd-installation.html

@hongxiayang
Copy link
Collaborator

@ehartford etc: For folks who are interested in vllm on MI100: try this fork: https://github.com/hongxiayang/vllm/tree/navi3x_rocm6. Let me know if you run into problems.

@ehartford
Copy link

@ehartford etc: For folks who are interested in vllm on MI100: try this fork: https://github.com/hongxiayang/vllm/tree/navi3x_rocm6. Let me know if you run into problems.

this is for rocm6 specifically? I use rocm 5.7 because pytorch nightly uses rocm 5.7.

@hongxiayang
Copy link
Collaborator

hongxiayang commented Feb 6, 2024

@ehartford etc: For folks who are interested in vllm on MI100: try this fork: https://github.com/hongxiayang/vllm/tree/navi3x_rocm6. Let me know if you run into problems.

this is for rocm6 specifically? I use rocm 5.7 because pytorch nightly uses rocm 5.7.

This is how you can build and run the docker:
you can change parameters below, but I tested on below 6.0 docker using llama2-7b model weights.

BASE_IMAGE="rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1"
DockerImageName="vllm-${BASE_IMAGE}"
docker build --build-arg BASE_IMAGE="$BASE_IMAGE" --build-arg BUILD_FA="0"  -f Dockerfile.rocm -t "$DockerImageName" . 

sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host \
      --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 8G  -v $PATH_TO_MODEL_WEIGHTS:/app/model "$DockerImageName"

@hongxiayang
Copy link
Collaborator

BASE_IMAGE for ROCm_5.7: "rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1"

@tjtanaa
Copy link
Contributor

tjtanaa commented Feb 7, 2024

@ehartford I found there are people successfully compile vLLM for MI100 on ROCm 5.7. The only changes need to be made is to pass the AMD GPU architecture gfx908 to compile flash-attention-rocm.
Reply by @TNT3530 in Discord
vLLM release version v0.2.7, built from source using the docker ROCm instructions. You'll need to add "gfx908" to the flash attention valid arch array in setup.py so it'll compile as well.

and @jamestwhedbee has opened a simple PR that specify gfx908 during vLLM compilation
#2792

Kudos to them to verify that the flash-attention-rocm can also be compiled for MI100.

@jamestwhedbee
Copy link
Contributor

No review on #2792 yet, not sure the best way to get more eyes on it

@cocoderss
Copy link

ROCm currently has many many issues, would it be a possibility to install vllm using vulkan or clblast with some limited performance? This would be great specifically for AMD iGPU (APU)

@hmellor hmellor added feature request and removed enhancement New feature or request labels Mar 15, 2024
@thebeline
Copy link

ROCm currently has many many issues, would it be a possibility to install vllm using vulkan or clblast with some limited performance? This would be great specifically for AMD iGPU (APU)

It looks like it was merged in,a good sign, no?

Does anyone have any benchmarks on this? How does it stack up against an A100? Lower, I am sure, but if it is even close...

@fxmarty
Copy link

fxmarty commented Mar 19, 2024

@thebeline At least for TGI, that relies on VLLM's paged attention kernel, you can find some benchmarks here: https://huggingface.co/blog/huggingface-and-optimum-amd#production-solutions

@linchen111
Copy link

@ehartford I found there are people successfully compile vLLM for MI100 on ROCm 5.7. The only changes need to be made is to pass the AMD GPU architecture gfx908 to compile flash-attention-rocm.我发现有人在 ROCm 5.7 上成功编译了 MI100 的 vLLM。唯一需要做的改变就是通过AMD GPU架构 gfx908 来编译flash-attention-rocm。 Reply by @TNT3530 in Discord 在 Discord 中回复 vLLM release version v0.2.7, built from source using the docker ROCm instructions. You'll need to add "gfx908" to the flash attention valid arch array in setup.py so it'll compile as well.

and @jamestwhedbee has opened a simple PR that specify gfx908 during vLLM compilation并打开了一个简单的 PR,在 vLLM 编译期间指定 gfx908 #2792

Kudos to them to verify that the flash-attention-rocm can also be compiled for MI100.感谢他们验证了 flash-attention-rocm 也可以为 MI100 编译。

hi, do you have pre-built vllm docker image for rocm-5.7?

@lookfirst
Copy link

lookfirst commented Jun 30, 2024

https://www.nscale.com/blog/nscale-benchmarks-amd-mi300x-gpus-with-gemm-tuning-improves-throughput-and-latency-by-up-to-7-2x

Docker Image

ROCm 6.1.2
Python 3.10.12
PyTorch 2.5.0
Triton 2.1.0
Flash Attention 2.0.4
rocBLAS 4.1.2
hipBLASlt 0.8.0
Rccl 2.18.6
vLLM 0.5.0

@linchen111
Copy link

https://www.nscale.com/blog/nscale-benchmarks-amd-mi300x-gpus-with-gemm-tuning-improves-throughput-and-latency-by-up-to-7-2x

Docker Image

ROCm 6.1.2 Python 3.10.12 PyTorch 2.5.0 Triton 2.1.0 Flash Attention 2.0.4 rocBLAS 4.1.2 hipBLASlt 0.8.0 Rccl 2.18.6 vLLM 0.5.0

Thanks! pulling it now.

if I use MI100s with:

=============================== Link Type between two GPUs ===============================
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 PCIE PCIE PCIE PCIE PCIE PCIE PCIE
GPU1 PCIE 0 PCIE PCIE PCIE PCIE PCIE PCIE
GPU2 PCIE PCIE 0 PCIE PCIE PCIE PCIE PCIE
GPU3 PCIE PCIE PCIE 0 PCIE PCIE PCIE PCIE
GPU4 PCIE PCIE PCIE PCIE 0 PCIE PCIE PCIE
GPU5 PCIE PCIE PCIE PCIE PCIE 0 PCIE PCIE
GPU6 PCIE PCIE PCIE PCIE PCIE PCIE 0 PCIE
GPU7 PCIE PCIE PCIE PCIE PCIE PCIE PCIE 0

should I do some special settings?

@linchen111
Copy link

https://www.nscale.com/blog/nscale-benchmarks-amd-mi300x-gpus-with-gemm-tuning-improves-throughput-and-latency-by-up-to-7-2x

Docker Image

ROCm 6.1.2 Python 3.10.12 PyTorch 2.5.0 Triton 2.1.0 Flash Attention 2.0.4 rocBLAS 4.1.2 hipBLASlt 0.8.0 Rccl 2.18.6 vLLM 0.5.0

And I have met this:
INFO 06-30 11:07:12 selector.py:56] Using ROCmFlashAttention backend.
Traceback (most recent call last):
File "/root/benchmarks/benchmark_throughput.py", line 411, in
main(args)
File "/root/benchmarks/benchmark_throughput.py", line 223, in main
elapsed_time = run_vllm(
File "/root/benchmarks/benchmark_throughput.py", line 86, in run_vllm
llm = LLM(
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 144, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 363, in from_engine_args
engine = cls(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 223, in init
self.model_executor = executor_class(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in init
self._init_executor()
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 23, in _init_executor
self.driver_worker.init_device()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 110, in init_device
self.init_gpu_memory = torch.cuda.mem_get_info()[0]
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py", line 685, in mem_get_info
return torch.cuda.cudart().cudaMemGetInfo(device)
RuntimeError: HIP error: invalid argument
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with TORCH_USE_HIP_DSA to enable device-side assertions.

@linchen111
Copy link

https://www.nscale.com/blog/nscale-benchmarks-amd-mi300x-gpus-with-gemm-tuning-improves-throughput-and-latency-by-up-to-7-2x

Docker Image

ROCm 6.1.2 Python 3.10.12 PyTorch 2.5.0 Triton 2.1.0 Flash Attention 2.0.4 rocBLAS 4.1.2 hipBLASlt 0.8.0 Rccl 2.18.6 vLLM 0.5.0

seems not working in MI100

@lookfirst
Copy link

Likely not as this is focused on mi300x. Probably compiled for gfx942.

@linchen111
Copy link

Likely not as this is focused on mi300x. Probably compiled for gfx942.

guess so , I failed compiling for gfx908~

@bennmann
Copy link

bennmann commented Jul 3, 2024 via email

@hongxiayang
Copy link
Collaborator

hongxiayang commented Jul 3, 2024

For gfx908. As far as I remember:
(1) I have updated the flash-attn branch to ae7928c as someone mentioned this branch has support for gfx908.
(2) you can build using the Dockerfile.rocm by specifying FA_GFX_ARCHS="gfx908" using the --build-arg parameter when doing docker build as explained in this file:

Please build your own docker image, and let's start from there.

@hongxiayang
Copy link
Collaborator

This issue is very old, and should be closed as the initial request for ROCm support has already there. cc @zhuohan123
Any new issues, please open a new one.

@hmellor
Copy link
Collaborator

hmellor commented Jul 4, 2024

@hmellor hmellor closed this as completed Jul 4, 2024
@linchen111
Copy link

File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 110, in init_device
self.init_gpu_memory = torch.cuda.mem_get_info()[0]
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py", line 685, in mem_get_info
return torch.cuda.cudart().cudaMemGetInfo(device)
RuntimeError: HIP error: invalid argument
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with TORCH_USE_HIP_DSA to enable device-side assertions.

Did docker image building, but failed with same error

pi314ever pushed a commit to pi314ever/vllm that referenced this issue Dec 12, 2024
With this patch, mp executor does not hang at the end of application out
of the box, and exits gracefully.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests