Skip to content

Commit

Permalink
[CI] Spawn docker container with 2Gb shmem (pytorch#2475)
Browse files Browse the repository at this point in the history
Should prevent crashes during NCCL initialization.

If `data_parallel_tutorial.py` is executed without this option it would segfault in `ncclShmOpen` while executing `nn.DataParallel(model)`

For posterity:
```
% nvidia-smi 
Fri Jun 16 20:46:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   41C    P0    37W / 150W |    752MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   36C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M60           Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   41C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M60           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

% NCCL_DEBUG=INFO python data_parallel_tutorial.py 
Let's use 4 GPUs!
c825878acf65:32373:32373 [0] NCCL INFO cudaDriverVersion 12010
c825878acf65:32373:32373 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
c825878acf65:32373:32373 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
NCCL version 2.14.3+cuda11.7
c825878acf65:32373:32443 [0] NCCL INFO NET/IB : No device found.
c825878acf65:32373:32443 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
c825878acf65:32373:32443 [0] NCCL INFO Using network Socket
c825878acf65:32373:32445 [2] NCCL INFO Using network Socket
c825878acf65:32373:32446 [3] NCCL INFO Using network Socket
c825878acf65:32373:32444 [1] NCCL INFO Using network Socket
c825878acf65:32373:32446 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
c825878acf65:32373:32445 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
c825878acf65:32373:32443 [0] NCCL INFO Channel 00/02 :    0   1   2   3
c825878acf65:32373:32444 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
c825878acf65:32373:32443 [0] NCCL INFO Channel 01/02 :    0   1   2   3
c825878acf65:32373:32443 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
Bus error (core dumped)

(lldb) bt
* thread #1, name = 'python', stop reason = signal SIGBUS
  * frame #0: 0x00007effcd6b0ded libc.so.6`__memset_avx2_erms at memset-vec-unaligned-erms.S:145
    frame #1: 0x00007eff3985e425 libnccl.so.2`ncclShmOpen(char*, int, void**, void**, int) at shmutils.cc:52
    frame #2: 0x00007eff3985e377 libnccl.so.2`ncclShmOpen(shmPath="/dev/shm/nccl-7dX4mg", shmSize=9637888, shmPtr=0x00007efe4a59ac30, devShmPtr=0x00007efe4a59ac38, create=1) at shmutils.cc:61
    frame #3: 0x00007eff39863322 libnccl.so.2`::shmRecvSetup(comm=<unavailable>, graph=<unavailable>, myInfo=<unavailable>, peerInfo=<unavailable>, connectInfo=0x00007efe57fe3fe0, recv=0x00007efe4a05d2f0, channelId=0, connIndex=0) at shm.cc:110
    frame #4: 0x00007eff398446a4 libnccl.so.2`ncclTransportP2pSetup(ncclComm*, ncclTopoGraph*, int, int*) at transport.cc:33
    frame #5: 0x00007eff398445c0 libnccl.so.2`ncclTransportP2pSetup(comm=0x0000000062355ab0, graph=0x00007efe57fe6a40, connIndex=0, highestTransportType=0x0000000000000000) at transport.cc:89
    frame #6: 0x00007eff398367cd libnccl.so.2`::initTransportsRank(comm=0x0000000062355ab0, commId=<unavailable>) at init.cc:790
    frame #7: 0x00007eff398383fe libnccl.so.2`::ncclCommInitRankFunc(job_=<unavailable>) at init.cc:1089
    frame #8: 0x00007eff3984de07 libnccl.so.2`ncclAsyncJobMain(arg=0x000000006476e6d0) at group.cc:62
    frame #9: 0x00007effce0bf6db libpthread.so.0`start_thread + 219
    frame #10: 0x00007effcd64361f libc.so.6`__GI___clone at clone.S:95
```
  • Loading branch information
malfet authored Jun 16, 2023
1 parent f0e587e commit 3eef691
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions .github/workflows/build-tutorials.yml
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ jobs:
--tty \
--detach \
--user jenkins \
--shm-size=2gb \
--name="${container_name}" \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
Expand Down

0 comments on commit 3eef691

Please sign in to comment.