-
Notifications
You must be signed in to change notification settings - Fork 861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ncclInternalError: Internal check failed #1499
Comments
We'd need to see the output of |
The WARN is likely This happens when each NCCL rank within a node does not see the same intra-node topology, or if different ranks are run with different parameters. Seeing a different node topology can happen inside VMs sometimes. Having the log with |
@AddyLaddy @sjeaugey Thank you very much for your answers.
|
🐛 Describe the bug
I met an error when I use torchrun for 4 GPUs training and 'nccl' backend (It runs perfect when I use 'gloo'). The environment is python3.9+pytorch2.3.0+cuda12.1.We tried to use uftrace to capture the DLRM code of 4 GPUs launched by torchrun, the command is as follows:
torchrun --nproc_per_node=4 ./multi-uftrace.py
The multi-uftrace.py file content is as follows:
The error message is as follows:
In order to capture the underlying functions of pytorch, we compile pytorch into the pg version. The above error will occur under 4 GPUs, but not under 2 GPUs. At the same time, we try to compile it into the develop version and it will run correctly. So I would like to ask if there is any solution to prevent such errors under the 4 GPUs of the pg version?
Versions
GPU: 4 x A100 80G GPU
Driver Version :530.30.02
CUDA Version : 12.1
OS version :Ubuntu 22.04
python :3.9
pytorch :v2.3.0
nccl: v2.20.5
The text was updated successfully, but these errors were encountered: