-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] offload model weights to CPU conditionally #6317
[Core] offload model weights to CPU conditionally #6317
Conversation
how is it possible? how does it work with cudagraph? |
I have the same question about how to get the same latency with offloading. Base on the code change, the offloaded weights are transferred to GPU synchronously when needed without prefetching. This should introduce a significant latency overhead. |
@youkaichao @comaniac Thanks for looking into my PR. As mentioned in the PR description, "When the percentage specified by the user is enough to hold the weights, the performance will be the same as of now." That is, the CPU offload is not triggered when the GPU memories are enough and cudagraph will be applied as of now. So the performance will be the same. The cpu-offload will kick in only when the percentage specified by the user is insufficient to hold the weights. Cudagraph will be turned off in such cases. That is why I called this feature "conditional". The latency will be much bigger in such cases. That is for sure. However, vllm will continue to work, which will be nice to the users with limited GPU resources. The following code from worker.py in the PR is about turning on/ff cudagraph:
Similar logic is given in vllm/worker/model_runner.py in this PR. Please let me know if you have any other concerns. Thanks. |
Oh I guess the confusion comes from this statement: Meanwhile, CPU offloading can be optimized in multiple ways. For example, we could prefetch weights with CUDA stream to hide data transfer latency. It might be a good idea to have an RFC to document the scope, roadmap and milestones. In addition to that, we should think more about the API design. The name |
I revised the description as "2. When the percentage specified by the user is insufficient to hold the weights, the vLLM will continue to work with higher latency. In cases of large models with very limited GPU resources, the latency could be very high. However, vLLM still works and generates outputs.". Hope it explains better.
As to "prefetching", in the doc https://docs.google.com/document/d/1qY9SJIRCLn6_p_2jHInW0E8qE2oukgOrH26v62jZy70/edit#heading=h.vrhc1mfm2tpm(it is pretty long, so I just gave the link in the description"), I discussed about it in the section "What is Next". As of now, based on my test results, I don't see the point of "prefetching" in vLLM. Based on the tests on AWS a g6.12xlarge machine(that is the most powerful GPU machine I can get), the latency to transfer weight from CPU to GPU is tens and sometimes eve hundreds comparing the tensor multiplication time (the test results are also given in the good doc above). I guess "prefetching" will be practical when the machines with much better transfer speed between CPU and GPU, such as GH200/BG200, have greater availability. As to the roadmap, could you point me to the doc? I did some search in the repo and just found some threads about cpu_offload.
Agree. How about I change the variable name to "weight_cpu_offload_trigger_percent"? Thanks. |
The point is whatever the data transfer time is, it directly adds up to the forward latency without prefetching, and may become critical when inter-token latency during decoding is just a few milliseconds. Your result also shows that the e2e latency could be several times longer when offloading happens. For the roadmap, it's just your "What's Next" section, but we could make it clearer by defining follow-up features along with their scopes and dependencies. btw could you turn on commenting in the doc so that everyone could leave comments? Thanks. |
I got your point. I listed "prefetching" as one my follow-up work item in the google doc shared. In the doc, I said I need to investigate more about prefetching. Here is the one of the test results given in the doc:
The total latency can only be cut by 0.9s if prefetching is in place, which is a gain of less than 1%. Besides, vllm architecture is designed in such a way that each sub-modules are encapsulated within its own boundary. "Prefetching" is supposed to break the boundary. It requires some cooperation between the current layer and the next layer. I need to spend some efforts to find a way which is pretty in respect of engineering. I am more than happy to have any suggestion and discussion about that. I am a new-comer to vLLM as well as cpu-offloading. :-)
Got it. The doc is open to comments now. |
This is exactly what I've been looking/waiting for. I'm currently running a small GH200 cluster (4 x 96G) and NCCL is getting ~42GB/s across the cluster. As @chenqianfzh mentioned above, I think the GH200 will see some major benefits from the CPU<>GPU NVLink. Is there any reason that this wouldn't work across a ray nccl cluster? UPDATE: I was able to confirm this works with pipeline-parallelism on multi-node: python -m vllm.entrypoints.openai.api_server --model /models/Meta-Llama-3-70B-Instruct --pipeline-parallel-size 2 --cpu_offload_trigger_percent 0.50 --distributed-executor-backend ray Currently getting 5.6 t/s over multi-node (x2 nodes) . (vs 4.6 t/s with a single node, which seems strange). |
Did some initial testing of this on the GH200 with Meta-Llama-3-70B-Instruct. Inference looks like 4.6 t/s, which seems surprisingly slow, given the increased speed between CPU<>GPU on GH200. As a reference, I'm seeing ~20 t/s when distributed between 2 nodes (no cpu-offload) with v0.5.1 (400Gbps network). Is there something I'm not seeing here? Details below... Exec flags:python -m vllm.entrypoints.openai.api_server --model /models/Meta-Llama-3-70B-Instruct --cpu_offload_trigger_percent 0.80 Observations:GPU memory used: 86224MiB/97871MiB Details
|
Thanks for trying it on a GH200 machine. Using cpu-offload is expected to have a bigger latency due to the following two factors:
The latency caused by the second factor can not be alleviated by the great speed between GPU & CPU. Wonder whether you have some profiling tools to identify the to latencies in tensor moving and the weight calculation(with the un-optimized cuda-graph). With the data, we will have better idea on how to optimize more. Thanks |
Haven't done something like that before, but will do some research. If anyone has any advice or tools for profiling in vllm and pytorch memory usage, please reply and I'll get into it. Have been thinking about this statement:
If the prefetching is done as a sliding window on the sequential layers, with the window size of I'm brand new to the vllm architecture, so I'm not the person to implement this, but I'm happy to perform testing or give access to hardware (I've got A100s with both NVlink and non, as well as a few GH200s laying around to play with). |
I did some quick profiling with
Going to mess with the Nsight UI and see if I can analyze the nsys-rep a little better. Will also be testing #6496 today as well and will run the same benchmark profiling. |
Thank you so much for your test and suggestion. Let me think through your scheme. |
Closing it as the cpu-offload is implemented and merged in #6496 |
We are developing the "conditional cpu-offload-weight" feature for vLLM, which is comparable to Hugging Face's Accelerate device_map='auto'. This democratizes access to vLLM, empowering a broader community of learners and researchers to engage with cutting-edge AI models. This democratizes access to vLLM, empowering a broader community of learners and researchers to engage with cutting-edge AI models.
To achieve conditional CPU offload, a new CLI parameter, cpu_offload_trigger_percent, whose default value is 0, will be added.
Test results show:
A more detailed doc is given at https://docs.google.com/document/d/1qY9SJIRCLn6_p_2jHInW0E8qE2oukgOrH26v62jZy70/edit#heading=h.vrhc1mfm2tpm