Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[CI] Spawn docker container with 2Gb shmem (pytorch#2475)
Should prevent crashes during NCCL initialization. If `data_parallel_tutorial.py` is executed without this option it would segfault in `ncclShmOpen` while executing `nn.DataParallel(model)` For posterity: ``` % nvidia-smi Fri Jun 16 20:46:45 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla M60 Off | 00000000:00:1B.0 Off | 0 | | N/A 41C P0 37W / 150W | 752MiB / 7680MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla M60 Off | 00000000:00:1C.0 Off | 0 | | N/A 36C P0 38W / 150W | 418MiB / 7680MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla M60 Off | 00000000:00:1D.0 Off | 0 | | N/A 41C P0 38W / 150W | 418MiB / 7680MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla M60 Off | 00000000:00:1E.0 Off | 0 | | N/A 35C P0 38W / 150W | 418MiB / 7680MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ % NCCL_DEBUG=INFO python data_parallel_tutorial.py Let's use 4 GPUs! c825878acf65:32373:32373 [0] NCCL INFO cudaDriverVersion 12010 c825878acf65:32373:32373 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0> c825878acf65:32373:32373 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation NCCL version 2.14.3+cuda11.7 c825878acf65:32373:32443 [0] NCCL INFO NET/IB : No device found. c825878acf65:32373:32443 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0> c825878acf65:32373:32443 [0] NCCL INFO Using network Socket c825878acf65:32373:32445 [2] NCCL INFO Using network Socket c825878acf65:32373:32446 [3] NCCL INFO Using network Socket c825878acf65:32373:32444 [1] NCCL INFO Using network Socket c825878acf65:32373:32446 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 c825878acf65:32373:32445 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 c825878acf65:32373:32443 [0] NCCL INFO Channel 00/02 : 0 1 2 3 c825878acf65:32373:32444 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 c825878acf65:32373:32443 [0] NCCL INFO Channel 01/02 : 0 1 2 3 c825878acf65:32373:32443 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 Bus error (core dumped) (lldb) bt * thread #1, name = 'python', stop reason = signal SIGBUS * frame #0: 0x00007effcd6b0ded libc.so.6`__memset_avx2_erms at memset-vec-unaligned-erms.S:145 frame #1: 0x00007eff3985e425 libnccl.so.2`ncclShmOpen(char*, int, void**, void**, int) at shmutils.cc:52 frame #2: 0x00007eff3985e377 libnccl.so.2`ncclShmOpen(shmPath="/dev/shm/nccl-7dX4mg", shmSize=9637888, shmPtr=0x00007efe4a59ac30, devShmPtr=0x00007efe4a59ac38, create=1) at shmutils.cc:61 frame #3: 0x00007eff39863322 libnccl.so.2`::shmRecvSetup(comm=<unavailable>, graph=<unavailable>, myInfo=<unavailable>, peerInfo=<unavailable>, connectInfo=0x00007efe57fe3fe0, recv=0x00007efe4a05d2f0, channelId=0, connIndex=0) at shm.cc:110 frame #4: 0x00007eff398446a4 libnccl.so.2`ncclTransportP2pSetup(ncclComm*, ncclTopoGraph*, int, int*) at transport.cc:33 frame #5: 0x00007eff398445c0 libnccl.so.2`ncclTransportP2pSetup(comm=0x0000000062355ab0, graph=0x00007efe57fe6a40, connIndex=0, highestTransportType=0x0000000000000000) at transport.cc:89 frame #6: 0x00007eff398367cd libnccl.so.2`::initTransportsRank(comm=0x0000000062355ab0, commId=<unavailable>) at init.cc:790 frame #7: 0x00007eff398383fe libnccl.so.2`::ncclCommInitRankFunc(job_=<unavailable>) at init.cc:1089 frame #8: 0x00007eff3984de07 libnccl.so.2`ncclAsyncJobMain(arg=0x000000006476e6d0) at group.cc:62 frame #9: 0x00007effce0bf6db libpthread.so.0`start_thread + 219 frame #10: 0x00007effcd64361f libc.so.6`__GI___clone at clone.S:95 ```
- Loading branch information