-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Running multi-node offline engine inference ( via SLURM) #2561
Comments
@aflah02 Thanks for pointing this out. We are looking for contributors since we do not use slurm for long. 😂 |
@zhaochenyang20 If you have any pointers on how you might approach the problem I can take a stab at this. The issue right now is that I have 0 clue on how to get started with using either the runtime api or the engine api for multinode. They don't seem to support pipeline parallel so the only method seems to be tensor parallel across all GPUs but if I say use 16 GPUs it can't do that directly as it only sees 8 GPUs per node |
I was thinking of using the Engine API and just converting all server args from the CLI commands but then my question would be that in the CLI version you run 2 commands one per node, how would you do that here via the Engine API. Do you run 2 engine calls (one on each node)? |
Btw this is one of my attempts to just load the model but nothing seems to run on the worker node as I only see logs for head node -
|
Good points. We do not support pipeline parallelism but I do not think this would block the progress of running on slurm. Our team will discuss your issue this Friday. Before that, could you try out some quantization method for llama 405B? Or you can use llama 3.3 70B, which is pretty good. |
BTW, would you like to join our bi-weekly meeting this Saturday? |
Thanks for the invite I've already had success with running the FP8 version as well as the 70B one on a single node for offline inference. So the only thing left is to go multinode for the BF16 version + to get bigger context length for the FP8 version |
Great! How do you run the FP8 version of the 70B model? I think the best way is to first quantize it and then load it, rather than quantizing it online. @aflah02 |
Sorry for not being clear. I've run 2 models - 70B in BF16 and 405B in FP8. I'm not running 70B in FP8. My goal now is to somehow run 405B in FP16 so I'm trying out stuff with SLURM configs and the server API but that isn't looking good so I'm thinking of somehow using the engine or runtime API |
Yeah. I see. We will discuss this in our weekly meeting on this. BTW, how you quantize 405B model on fp8? @aflah02 |
Thanks that would be awesome! |
Cool. Thanks for pointing this out. @JamesSand and I are working on quantization documentation. We will record that "use official repo first" 😂 |
For the slurm issue, let me update this week. If I haven't replied before next week, please reply to this issue and remind me. Thanks! @aflah02 |
Just for reference this is the current script which works well on 1 node - Python file -
SLURM bash file -
|
Thanks |
Some more updates. I tried to run the openai compatible version on SLURM on 2 nodes. For the 8B version it works across 2 nodes (tp=16) -
However for 405B-FP8 I get a timeout (I use the same script with model path changed to 405B-FP8) Logs for timeout from one of the nodes (both have identical logs) - https://gist.github.com/aflah02/70150ed8f73f90d351cd8fe9ac049342 |
Update: The same code worked on 2 A100 nodes with 8 GPUs each. I am now trying the BF16 version on 2 nodes (both H100 and A100). The original issue still stands (which was running offline-inference). I have been able to run online-inference now which is setting up an OpenAI compatible server and hitting it with requests Update: The 405B model in BF16 worked on H100 and gave the timeout error in the A100 run (when setting up a serve for online inference). The code is the same as the one above for 8B with model changed to 405B Update: It seems certain node pairs of mine give errors. So I just picked the pairs that work and enforce their selection in the slurm config |
Thanks! @aflah02. I don't know if things have worked out. If not, could you come to our meeting on this? |
Thanks @zhaochenyang20 however I'm travelling later today and unfortunately will not be able to make it to the meeting. Just to update I was able to run the online inference (starting a server) across 2 nodes in slurm however I haven't been able to figure out how to run multinode via the SGLang engine or runtime for offline inference (the original question in this issue) |
Yeah. I remember even without slurm, we can't serve llama 405B across multiple nodes with offline engine. @aflah02 Also, would you like to provide a docs |
I can share one on running the server via SLURM. Is that what you call SRT? I guess you can technically connect to the endpoint via setting the runtime backend so it makes sense. It's not running via python though and just carefully recreates how you would do it if you had complete access to both nodes but via slurm |
SRT is the HTTP server @aflah02 |
Ah okay nice It would also be great to have a way to do this via the python api though instead of via running commands on the terminal. Is it currently possible to do that? Running multinode by just setting backend model path and tp-size in the python api? |
Sorry. I don't think that's feasible right now. We support running 405B llama in this way: https://sgl-project.github.io/backend/backend.html#example-run-llama-3-1-405b Also, my advisor told me that Llama 405B is rarely used since it's performance is worse than Qwen 2.5 72B instruct and Llama 3.3 70B Instruct. Maybe you can by pass this 😂 |
Yeah but it's not just about the performance. If I want to say benchmark the model I still need to run it and running it via the CLI is much less convenient as compared to having one python script. It would be really useful if this could be added in future releases. |
Also sorry I couldn't join the meeting yesterday. Any updates on if this is in on the roadmap? |
Hi @zhaochenyang20 |
@aflah02 Nice blog! |
Thanks :) |
@aflah02 This blog looks good. Can you submit a PR to write some of the SLRUM commands into a script? |
Sure |
I think here is the right place. Markdown is pretty okay, and you can refer to your blog at: https://aflah02.substack.com/p/multi-node-llm-inference-with-sglang Great job, thanks! @aflah02 |
Thanks! I'll raise a PR shortly |
Look forward @aflah02 |
@aflah02 Hey, how is the PR going? |
Sorry for the delay @zhaochenyang20 |
Checklist
Motivation
A lot of academic institutions only allow access to larger node clusters via SLURM and it is not immediately clear how would I reuse the code to run Llama 405B BF16 on 2 nodes (by starting a server) to perform offline inference
Related resources
No response
The text was updated successfully, but these errors were encountered: