RuntimeError: NCCL error in #5

cqtanzj · 2022-08-25T16:22:29Z

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180594101/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3

tech-fisher · 2022-10-30T08:54:23Z

I met the similar error,pls help

bobfacer · 2022-10-31T17:01:44Z

I met this problem too.
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

bobfacer · 2022-10-31T17:57:36Z

it can work using 1 gpu

Boese0601 · 2022-11-09T21:17:26Z

Looks like issue because of the DistributedDataParallel. Have you installed pytorch and cuda according to the provided version?

DongyangHuLi · 2022-11-11T01:48:19Z

I configured my environment exactly as the readme file, but it still didn't work.

Boese0601 · 2022-11-11T02:27:33Z

I configured my environment exactly as the readme.txt file, but it still didn't work.

What's your graphics card and cuda version？

DongyangHuLi · 2022-11-11T02:52:34Z

I configured my environment exactly as the readme.txt file, but it still didn't work.

What's your graphics card and cuda version？

RTX 3090 and 11.4

and the error is:

Could you give me some helps? :)

alexrich021 · 2022-11-21T19:50:01Z

The issue is argparse isn't properly parsing the --gpu argument into a list. train_rcmvsnet.py:125 then sets the world size to the length of the string passed to --gpu (i.e. 5 when using --gpu [0,1]).

Just change train_rcmvsnet.py:68 to

parser.add_argument('--gpu',default=[0],help='gpu',nargs='+',type=int)

and pass the gpu args as --gpu 0 1 instead. That solved it for me anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: NCCL error in #5

RuntimeError: NCCL error in #5

cqtanzj commented Aug 25, 2022

tech-fisher commented Oct 30, 2022

bobfacer commented Oct 31, 2022

bobfacer commented Oct 31, 2022

Boese0601 commented Nov 9, 2022

DongyangHuLi commented Nov 11, 2022 •

edited

Loading

Boese0601 commented Nov 11, 2022

DongyangHuLi commented Nov 11, 2022 •

edited

Loading

alexrich021 commented Nov 21, 2022

RuntimeError: NCCL error in #5

RuntimeError: NCCL error in #5

Comments

cqtanzj commented Aug 25, 2022

tech-fisher commented Oct 30, 2022

bobfacer commented Oct 31, 2022

bobfacer commented Oct 31, 2022

Boese0601 commented Nov 9, 2022

DongyangHuLi commented Nov 11, 2022 • edited Loading

Boese0601 commented Nov 11, 2022

DongyangHuLi commented Nov 11, 2022 • edited Loading

alexrich021 commented Nov 21, 2022

DongyangHuLi commented Nov 11, 2022 •

edited

Loading

DongyangHuLi commented Nov 11, 2022 •

edited

Loading