modify variable name #7

zhuhaozhe · 2023-06-09T08:04:30Z

Fixes #ISSUE_NUMBER

Description

Checklist

The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
Only one issue is addressed in this pull request
Labels from the issue that this PR is fixing are added to this pull request
No unnecessary issues are included into this pull request.

Should prevent crashes during NCCL initialization. If `data_parallel_tutorial.py` is executed without this option it would segfault in `ncclShmOpen` while executing `nn.DataParallel(model)` For posterity: ``` % nvidia-smi Fri Jun 16 20:46:45 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla M60 Off | 00000000:00:1B.0 Off | 0 | | N/A 41C P0 37W / 150W | 752MiB / 7680MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla M60 Off | 00000000:00:1C.0 Off | 0 | | N/A 36C P0 38W / 150W | 418MiB / 7680MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla M60 Off | 00000000:00:1D.0 Off | 0 | | N/A 41C P0 38W / 150W | 418MiB / 7680MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla M60 Off | 00000000:00:1E.0 Off | 0 | | N/A 35C P0 38W / 150W | 418MiB / 7680MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ % NCCL_DEBUG=INFO python data_parallel_tutorial.py Let's use 4 GPUs! c825878acf65:32373:32373 [0] NCCL INFO cudaDriverVersion 12010 c825878acf65:32373:32373 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0> c825878acf65:32373:32373 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation NCCL version 2.14.3+cuda11.7 c825878acf65:32373:32443 [0] NCCL INFO NET/IB : No device found. c825878acf65:32373:32443 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0> c825878acf65:32373:32443 [0] NCCL INFO Using network Socket c825878acf65:32373:32445 [2] NCCL INFO Using network Socket c825878acf65:32373:32446 [3] NCCL INFO Using network Socket c825878acf65:32373:32444 [1] NCCL INFO Using network Socket c825878acf65:32373:32446 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 c825878acf65:32373:32445 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 c825878acf65:32373:32443 [0] NCCL INFO Channel 00/02 : 0 1 2 3 c825878acf65:32373:32444 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 c825878acf65:32373:32443 [0] NCCL INFO Channel 01/02 : 0 1 2 3 c825878acf65:32373:32443 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 Bus error (core dumped) (lldb) bt * thread #1, name = 'python', stop reason = signal SIGBUS * frame #0: 0x00007effcd6b0ded libc.so.6`__memset_avx2_erms at memset-vec-unaligned-erms.S:145 frame #1: 0x00007eff3985e425 libnccl.so.2`ncclShmOpen(char*, int, void**, void**, int) at shmutils.cc:52 frame #2: 0x00007eff3985e377 libnccl.so.2`ncclShmOpen(shmPath="/dev/shm/nccl-7dX4mg", shmSize=9637888, shmPtr=0x00007efe4a59ac30, devShmPtr=0x00007efe4a59ac38, create=1) at shmutils.cc:61 frame #3: 0x00007eff39863322 libnccl.so.2`::shmRecvSetup(comm=<unavailable>, graph=<unavailable>, myInfo=<unavailable>, peerInfo=<unavailable>, connectInfo=0x00007efe57fe3fe0, recv=0x00007efe4a05d2f0, channelId=0, connIndex=0) at shm.cc:110 frame #4: 0x00007eff398446a4 libnccl.so.2`ncclTransportP2pSetup(ncclComm*, ncclTopoGraph*, int, int*) at transport.cc:33 frame #5: 0x00007eff398445c0 libnccl.so.2`ncclTransportP2pSetup(comm=0x0000000062355ab0, graph=0x00007efe57fe6a40, connIndex=0, highestTransportType=0x0000000000000000) at transport.cc:89 frame #6: 0x00007eff398367cd libnccl.so.2`::initTransportsRank(comm=0x0000000062355ab0, commId=<unavailable>) at init.cc:790 frame #7: 0x00007eff398383fe libnccl.so.2`::ncclCommInitRankFunc(job_=<unavailable>) at init.cc:1089 frame #8: 0x00007eff3984de07 libnccl.so.2`ncclAsyncJobMain(arg=0x000000006476e6d0) at group.cc:62 frame #9: 0x00007effce0bf6db libpthread.so.0`start_thread + 219 frame #10: 0x00007effcd64361f libc.so.6`__GI___clone at clone.S:95 ```

zhuhaozhe merged commit 0980471 into Valentine233:add_inductor_debug_doc Jun 9, 2023

modify variable name

2276e98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modify variable name #7

modify variable name #7

zhuhaozhe commented Jun 9, 2023

modify variable name #7

modify variable name #7

Conversation

zhuhaozhe commented Jun 9, 2023

Description

Checklist