-
Notifications
You must be signed in to change notification settings - Fork 829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encounter NCCL error when runing Pytorch example code #1504
Comments
It looks like Pytorch times out for some reason. Rerunning the job with |
Thanks for response my issue! I try this option and the output in Linux shell is
It seems that I should Compile with |
It's a little hard to figure out from that log if there's a NCCL issue here or not. Is it the case that this application never actually fully initializes and it simply times out 1800s? Would you happen to know when the following two lines are printed:
Specifically, I'm wondering if they are printed at startup or around the time when the 1800s timeout expires? Does NCCL currently run on this system at all? Have you tried compiling and running our nccl-tests? |
|
You may want to also check https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html |
Hi! when I try to run a python scripts for llm inference in pipeline parallelism on single server with multi GPUs. It turned out to errors related to NCCL.
Here is my develop env
Here is my command to running this python code
torchrun --nproc-per-node 4 pippy_llama.py
Here is the bug
I have no exeperenice for cuda programming and implemention of NCCL. It's hard for me to fix this bug. So anyone can answer my question? Thanks!!!!
The text was updated successfully, but these errors were encountered: