Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using in our testing for speed up numbers and other statement fix #8

Conversation

zhuhaozhe
Copy link
Collaborator

Fixes #ISSUE_NUMBER

Description

Checklist

  • The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • Only one issue is addressed in this pull request
  • Labels from the issue that this PR is fixing are added to this pull request
  • No unnecessary issues are included into this pull request.

@zhuhaozhe zhuhaozhe merged commit cebd517 into Valentine233:add_inductor_debug_doc Jun 12, 2023
svekars pushed a commit that referenced this pull request Jun 18, 2023
Should prevent crashes during NCCL initialization.

If `data_parallel_tutorial.py` is executed without this option it would segfault in `ncclShmOpen` while executing `nn.DataParallel(model)`

For posterity:
```
% nvidia-smi 
Fri Jun 16 20:46:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   41C    P0    37W / 150W |    752MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   36C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M60           Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   41C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M60           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

% NCCL_DEBUG=INFO python data_parallel_tutorial.py 
Let's use 4 GPUs!
c825878acf65:32373:32373 [0] NCCL INFO cudaDriverVersion 12010
c825878acf65:32373:32373 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
c825878acf65:32373:32373 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
NCCL version 2.14.3+cuda11.7
c825878acf65:32373:32443 [0] NCCL INFO NET/IB : No device found.
c825878acf65:32373:32443 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
c825878acf65:32373:32443 [0] NCCL INFO Using network Socket
c825878acf65:32373:32445 [2] NCCL INFO Using network Socket
c825878acf65:32373:32446 [3] NCCL INFO Using network Socket
c825878acf65:32373:32444 [1] NCCL INFO Using network Socket
c825878acf65:32373:32446 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
c825878acf65:32373:32445 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
c825878acf65:32373:32443 [0] NCCL INFO Channel 00/02 :    0   1   2   3
c825878acf65:32373:32444 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
c825878acf65:32373:32443 [0] NCCL INFO Channel 01/02 :    0   1   2   3
c825878acf65:32373:32443 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
Bus error (core dumped)

(lldb) bt
* thread #1, name = 'python', stop reason = signal SIGBUS
  * frame #0: 0x00007effcd6b0ded libc.so.6`__memset_avx2_erms at memset-vec-unaligned-erms.S:145
    frame #1: 0x00007eff3985e425 libnccl.so.2`ncclShmOpen(char*, int, void**, void**, int) at shmutils.cc:52
    frame #2: 0x00007eff3985e377 libnccl.so.2`ncclShmOpen(shmPath="/dev/shm/nccl-7dX4mg", shmSize=9637888, shmPtr=0x00007efe4a59ac30, devShmPtr=0x00007efe4a59ac38, create=1) at shmutils.cc:61
    frame #3: 0x00007eff39863322 libnccl.so.2`::shmRecvSetup(comm=<unavailable>, graph=<unavailable>, myInfo=<unavailable>, peerInfo=<unavailable>, connectInfo=0x00007efe57fe3fe0, recv=0x00007efe4a05d2f0, channelId=0, connIndex=0) at shm.cc:110
    frame #4: 0x00007eff398446a4 libnccl.so.2`ncclTransportP2pSetup(ncclComm*, ncclTopoGraph*, int, int*) at transport.cc:33
    frame #5: 0x00007eff398445c0 libnccl.so.2`ncclTransportP2pSetup(comm=0x0000000062355ab0, graph=0x00007efe57fe6a40, connIndex=0, highestTransportType=0x0000000000000000) at transport.cc:89
    frame #6: 0x00007eff398367cd libnccl.so.2`::initTransportsRank(comm=0x0000000062355ab0, commId=<unavailable>) at init.cc:790
    frame #7: 0x00007eff398383fe libnccl.so.2`::ncclCommInitRankFunc(job_=<unavailable>) at init.cc:1089
    frame #8: 0x00007eff3984de07 libnccl.so.2`ncclAsyncJobMain(arg=0x000000006476e6d0) at group.cc:62
    frame #9: 0x00007effce0bf6db libpthread.so.0`start_thread + 219
    frame #10: 0x00007effcd64361f libc.so.6`__GI___clone at clone.S:95
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant