Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

test_nccl.py script causes a core dump on P2.16xlarge instance when ran against NCCL enabled MXNet build. #9004

Open
leleamol opened this issue Dec 8, 2017 · 9 comments

Comments

@leleamol
Copy link
Contributor

leleamol commented Dec 8, 2017

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

The test_nccl.py script when ran against NCCL enabled MXNet causes a core dump.

Environment info (Required)

MXNet version v1.0.0 built with USE_NCCL=1 and USE_NCCL_PATH
NCCL 2.1
Instance type : p2.16xlarge

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.

[ec2-user@ip-172-31-42-123 tools]$ python diagnose.py 
----------Python Info----------
('Version      :', '2.7.12')
('Compiler     :', 'GCC 4.8.5 20150623 (Red Hat 4.8.5-11)')
('Build        :', ('default', 'Nov  2 2017 19:20:38'))
('Arch         :', ('64bit', 'ELF'))
------------Pip Info-----------
('Version      :', '9.0.1')
('Directory    :', '/usr/lib/python2.7/dist-packages/pip')
----------MXNet Info-----------
('Version      :', '1.0.0')
('Directory    :', '/usr/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet')
Traceback (most recent call last):
  File "diagnose.py", line 171, in <module>
    check_mxnet()
  File "diagnose.py", line 113, in check_mxnet
    except FileNotFoundError:
NameError: global name 'FileNotFoundError' is not defined

Package used (Python/R/Scala/Julia):
(I'm using ...) Python

For Scala user, please provide:

  1. Java version: (java -version)
  2. Maven version: (mvn -version)
  3. Scala runtime if applicable: (scala -version)

For R user, please provide R sessionInfo():

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio): gcc

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)
2b67436

Build config:
(Paste the content of config.mk, or the build command.)
USE_CUDA=1
USE_CUDA_PATH=/usr/local/cuda
USE_CUDNN=1
USE_DIST_KVSTORE=1
USE_MKL2017=1
USE_BLAS=openblas
USE_S3=1
USE_NCCL=1
USE_NCCL_PATH=/usr/nccl/cuda-9
CUDA_ARCH := -gencode arch=compute_35,code=sm_35 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70

Error Message:

Core was generated by `/home/ec2-user/src/anaconda2/bin/python ./src/anaconda2/bin/nosetests /home/ec2'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f21e752aa6e in commFree (comm=0x5567ce4da6f0) at init.cu:100
100 init.cu: No such file or directory.
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.5.8-3.12.amzn1.x86_64 krb5-libs-1.15.1-8.43.amzn1.x86_64 libcom_err-1.42.12-4.40.amzn1.x86_64 libjpeg-turbo-1.2.90-5.14.amzn1.x86_64 libselinux-2.1.10-3.22.amzn1.x86_64 libuuid-2.23.2-33.28.amzn1.x86_64 openssl-1.0.2k-8.106.amzn1.x86_64
(gdb) where
#0 0x00007f21e752aa6e in commFree (comm=0x5567ce4da6f0) at init.cu:100
#1 0x00007f21e752edad in ncclCommInitAll (comms=, ndev=, devlist=) at init.cu:692
#2 0x00007f22294f7a50 in mxnet::kvstore::KVStoreNCCL::Reduce(std::vector<int, std::allocator >, std::vector<std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >, std::allocator<std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > > > const&, int, std::vector<mxnet::NDArray const*, std::allocator<mxnet::NDArray const*> >*) () from /home/ec2-user/src/anaconda2/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#3 0x00007f222950423a in mxnet::kvstore::KVStoreNCCL::PushImpl(std::vector<int, std::allocator > const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&, int) () from /home/ec2-user/src/anaconda2/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#4 0x00007f22294baba1 in mxnet::kvstore::KVStoreLocal::Push(std::vector<std::string, std::allocatorstd::string > const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&, int) () from /home/ec2-user/src/anaconda2/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#5 0x00007f22294377fb in MXKVStorePushEx () from /home/ec2-user/src/anaconda2/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#6 0x00007f224231aec0 in ffi_call_unix64 () from /home/ec2-user/src/anaconda2/lib/python2.7/lib-dynload/../../libffi.so.6
#7 0x00007f224231a87d in ffi_call () from /home/ec2-user/src/anaconda2/lib/python2.7/lib-dynload/../../libffi.so.6
#8 0x00007f2242530736 in _ctypes_callproc () from /home/ec2-user/src/anaconda2/lib/python2.7/lib-dynload/_ctypes.so
#9 0x00007f2242526a61 in PyCFuncPtr_call () from /home/ec2-user/src/anaconda2/lib/python2.7/lib-dynload/_ctypes.so
#10 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#11 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#12 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#13 0x00007f224e0c2482 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#14 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#15 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#16 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#17 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#18 0x00007f224e0c3dac in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#19 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#20 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#21 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#22 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#23 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#24 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#25 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#26 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#27 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#28 0x00007f224e082574 in slot_tp_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#29 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#30 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#31 0x00007f224e0c3dac in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#32 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#33 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#34 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#35 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#36 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#37 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#38 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#39 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#40 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#41 0x00007f224e082574 in slot_tp_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#42 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#43 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#44 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#45 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#46 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#47 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#48 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#49 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#50 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#51 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#52 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#53 0x00007f224e082574 in slot_tp_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#54 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#55 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#56 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#57 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#58 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#59 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#60 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#61 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#62 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#63 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#64 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#65 0x00007f224e082574 in slot_tp_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#66 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#67 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#68 0x00007f224e0c3dac in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#69 0x00007f224e0c3dac in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#70 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#71 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#72 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#73 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#74 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#75 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#76 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#77 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#78 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#79 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#80 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#81 0x00007f224e082254 in slot_tp_init () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#82 0x00007f224e07eb0b in type_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#83 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#84 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#85 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#86 0x00007f224e0c570a in PyEval_EvalCode () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#87 0x00007f224e0de93d in run_mod () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#88 0x00007f224e0dfab8 in PyRun_FileExFlags () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#89 0x00007f224e0e0cd8 in PyRun_SimpleFileExFlags () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#90 0x00007f224e0f2d3c in Py_Main () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
#91 0x00007f224d32fb05 in __libc_start_main (main=0x5567c5f66850

, argc=3, argv=0x7ffe7888ded8, init=, fini=,
rtld_fini=, stack_end=0x7ffe7888dec8) at libc-start.c:269
#92 0x00005567c5f6687f in _start ()

Minimum reproducible example

(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
mxnet/tests/python/gpu/test_nccl.py

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Comment out following line in mxnet/tests/python/gpu/test_nccl.py
    @unittest.skip("Test requires NCCL library installed and enabled during build")
  2. Run following command
    python tests/python/gpu/test_nccl.py

What have you tried to solve it?

@b0noI
Copy link
Contributor

b0noI commented Dec 8, 2017

NCCL is 2.1 with CUDA9

@eric-haibin-lin
Copy link
Member

@ptrendx

@ptrendx
Copy link
Member

ptrendx commented Dec 9, 2017

@leleamol Could you run with env variable NCCL_DEBUG=INFO and post the result?

@leleamol
Copy link
Contributor Author

leleamol commented Dec 12, 2017

@ptrendx

Following is the output of test_nccl.py when ran with NCCL_DEBUG=INFO.

[ec2-user@ip-172-31-46-76 gpu]$ NCCL_DEBUG=INFO python test_nccl.py
ip-172-31-46-76:8258:8258 [0] misc/ibvwrap.cu:60 WARN Failed to open libibverbs.so[.1]
ip-172-31-46-76:8258:8258 [0] INFO Using internal Network Socket
ip-172-31-46-76:8258:8258 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
NCCL version 2.1.2+cuda9.0
ip-172-31-46-76:8258:8258 [0] INFO NET : Using interface eth0:172.31.46.76<0>
ip-172-31-46-76:8258:8258 [0] INFO NET/Socket : 1 interfaces found
ip-172-31-46-76:8258:8258 [1] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [1] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [1] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [1] INFO [0] Ring 0 : 0 1
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8364 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [2] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [2] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [2] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [2] INFO [0] Ring 0 : 0 1 2
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8362 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [3] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [3] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [3] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [3] INFO [0] Ring 0 : 0 1 2 3
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8364 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [4] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [4] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [4] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [4] INFO [0] Ring 0 : 0 1 2 3 4
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8365 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [5] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [5] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [5] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [5] INFO [0] Ring 0 : 0 1 2 3 4 5
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8365 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [6] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [6] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [6] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [6] INFO [0] Ring 0 : 0 1 2 3 4 5 6
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 6 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 6 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [6] INFO 6 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [6] INFO 6 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8363 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [7] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [7] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [7] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [7] INFO [0] Ring 0 : 0 1 2 3 4 5 6 7
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 7 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 6 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [6] INFO 6 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [6] INFO 6 -> 7 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [7] INFO 7 -> 6 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [7] INFO 7 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8362 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [8] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [8] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [8] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [8] INFO [0] Ring 0 : 0 1 2 3 4 5 6 7 8
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 8 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 6 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [6] INFO 6 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [6] INFO 6 -> 7 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [7] INFO 7 -> 6 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [7] INFO 7 -> 8 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [8] INFO 8 -> 7 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [8] INFO 8 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8364 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [9] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [9] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [9] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [9] INFO [0] Ring 0 : 0 1 2 3 4 5 6 7 8 9

ip-172-31-46-76:8258:8258 [0] transport/p2p.cu:393 WARN failed to peer with device 9: 60 peer mapping resources exhausted
ip-172-31-46-76:8258:8258 [0] INFO init.cu:191 -> 3
ip-172-31-46-76:8258:8258 [0] INFO init.cu:266 -> 3
ip-172-31-46-76:8258:8258 [0] INFO init.cu:610 -> 3
ip-172-31-46-76:8258:8258 [0] INFO init.cu:678 -> 3

It created a core dump. The callstack is as follows

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `python test_nccl.py'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f504ef23a6e in commFree (comm=0xece8380) at init.cu:100
100 init.cu: No such file or directory.
Missing separate debuginfos, use: debuginfo-install python26-2.6.9-2.89.amzn1.x86_64 python27-2.7.12-2.121.amzn1.x86_64 python34-3.4.3-1.35.amzn1.x86_64
(gdb) bt
#0 0x00007f504ef23a6e in commFree (comm=0xece8380) at init.cu:100
#1 0x00007f504ef27dad in ncclCommInitAll (comms=, ndev=, devlist=)
at init.cu:692
#2 0x00007f508c715a50 in mxnet::kvstore::KVStoreNCCL::Reduce(std::vector<int, std::allocator >, std::vector<std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >, std::allocator<std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > > > const&, int, std::vector<mxnet::NDArray const*, std::allocator<mxnet::NDArray const*> >*) () from /usr/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#3 0x00007f508c72223a in mxnet::kvstore::KVStoreNCCL::PushImpl(std::vector<int, std::allocator > const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&, int) ()
from /usr/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#4 0x00007f508c6d8ba1 in mxnet::kvstore::KVStoreLocal::Push(std::vector<std::string, std::allocatorstd::string > const&, std::vector<mxnet::NDArray, std::allocatormxnet::NDArray > const&, int) ()
from /usr/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#5 0x00007f508c6557fb in MXKVStorePushEx ()
from /usr/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#6 0x00007f5149206cec in ffi_call_unix64 () from /usr/lib64/libffi.so.6
#7 0x00007f5149206615 in ffi_call () from /usr/lib64/libffi.so.6
#8 0x00007f514941997b in _ctypes_callproc () from /usr/lib64/python2.7/lib-dynload/_ctypes.so
#9 0x00007f5149413915 in ?? () from /usr/lib64/python2.7/lib-dynload/_ctypes.so
#10 0x00007f5150a74173 in PyObject_Call () from /usr/lib64/libpython2.7.so.1.0
#11 0x00007f5150b06f7d in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#12 0x00007f5150b0cd8d in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
#13 0x00007f5150b098cc in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#14 0x00007f5150b09972 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#15 0x00007f5150b0cd8d in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
#16 0x00007f5150b0ce92 in PyEval_EvalCode () from /usr/lib64/libpython2.7.so.1.0
#17 0x00007f5150b25d9f in ?? () from /usr/lib64/libpython2.7.so.1.0
#18 0x00007f5150b26ede in PyRun_FileExFlags () from /usr/lib64/libpython2.7.so.1.0
#19 0x00007f5150b28049 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.7.so.1.0
#20 0x00007f5150b38c8f in Py_Main () from /usr/lib64/libpython2.7.so.1.0
#21 0x00007f514fd76b05 in __libc_start_main (main=0x4006c0

, argc=2, argv=0x7fff28caa7f8,
init=, fini=, rtld_fini=, stack_end=0x7fff28caa7e8)
at libc-start.c:269
#22 0x00000000004006f1 in _start ()
(gdb)

@leleamol
Copy link
Contributor Author

We have following 2 requests for this issue:
Please provide:

1. Fixed test script to test on both P2.16xlarge (16 GPUs) and P3.16xlarge (8 GPUs).

2. A bug fix so that that MXNet does NOT crash even if it is a configuration problem.

@bhavinthaker
Copy link
Contributor

Any update on the two requests above?

@bhavinthaker
Copy link
Contributor

Update from Nvidia: "Issue has been fixed, and will be part of the next release 2.2.1 or later".

This issue can be kept open for verification in Nvidia NCCL 2.2.1.

@nswamy
Copy link
Member

nswamy commented Mar 21, 2018

@leleamol can you verify if this is resolved?

@piyushghai
Copy link
Contributor

@leleamol Bouncing this one for your feedback.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants