Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: NCCL error in #5

Open
cqtanzj opened this issue Aug 25, 2022 · 8 comments
Open

RuntimeError: NCCL error in #5

cqtanzj opened this issue Aug 25, 2022 · 8 comments

Comments

@cqtanzj
Copy link

cqtanzj commented Aug 25, 2022

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180594101/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3

@tech-fisher
Copy link

I met the similar error,pls help

@bobfacer
Copy link

I met this problem too.
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

@bobfacer
Copy link

it can work using 1 gpu

@Boese0601
Copy link
Owner

Looks like issue because of the DistributedDataParallel. Have you installed pytorch and cuda according to the provided version?

@DongyangHuLi
Copy link

DongyangHuLi commented Nov 11, 2022

I configured my environment exactly as the readme file, but it still didn't work.

@Boese0601
Copy link
Owner

I configured my environment exactly as the readme.txt file, but it still didn't work.

What's your graphics card and cuda version?

@DongyangHuLi
Copy link

DongyangHuLi commented Nov 11, 2022

I configured my environment exactly as the readme.txt file, but it still didn't work.

What's your graphics card and cuda version?

RTX 3090 and 11.4
image
and the error is:
image
Could you give me some helps? :)

@alexrich021
Copy link

The issue is argparse isn't properly parsing the --gpu argument into a list. train_rcmvsnet.py:125 then sets the world size to the length of the string passed to --gpu (i.e. 5 when using --gpu [0,1]).

Just change train_rcmvsnet.py:68 to

parser.add_argument('--gpu',default=[0],help='gpu',nargs='+',type=int)

and pass the gpu args as --gpu 0 1 instead. That solved it for me anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants