Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEGV in libfabric when using SST on Summit #2485

Closed
giltirn opened this issue Oct 8, 2020 · 8 comments
Closed

SEGV in libfabric when using SST on Summit #2485

giltirn opened this issue Oct 8, 2020 · 8 comments

Comments

@giltirn
Copy link

giltirn commented Oct 8, 2020

When running the unit tests I came across an issue with SST on Summit:

0x0000200001ac3018 in init_fabric.isra.0 ()
from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-9.1.0/adios2-2.5.0-yljowfohsh4tqhunychezduzbc55lsxj/lib64/../lib64/libadios2_sst.so.2
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.26-23.el7.ppc64le glibc-2.17-260.el7_6.6.ppc64le keyutils-libs-1.5.8-3.el7.ppc64le krb5-libs-1.15.1-37.el7_6.ppc64le libcom_err-1.42.9-13.el7.ppc64le libcurl-7.29.0-51.el7_6.3.ppc64le libibverbs-41mlnx1-OFED.4.7.0.0.2.47329.ppc64le libidn-1.28-4.el7.ppc64le libmlx4-41mlnx1-OFED.4.7.3.0.3.47329.ppc64le libmlx5-41mlnx1-OFED.4.7.0.3.3.47329.ppc64le libnl3-3.2.28-4.el7.ppc64le librdmacm-41mlnx1-OFED.4.7.3.0.6.47329.ppc64le librxe-41mlnx1-OFED.4.4.2.4.6.47329.ppc64le libselinux-2.5-14.1.el7.ppc64le libssh2-1.4.3-12.el7_6.3.ppc64le nspr-4.19.0-1.el7_5.ppc64le nss-3.36.0-8.el7_6.ppc64le nss-softokn-freebl-3.36.0-6.el7_6.ppc64le nss-util-3.36.0-1.1.el7_6.ppc64le numactl-libs-2.0.9-7.el7.ppc64le openldap-2.4.44-21.el7_6.ppc64le openssl-libs-1.0.2k-16.el7_6.1.ppc64le pcre-8.32-17.el7.ppc64le
(gdb) bt
#0 0x0000200001ac3018 in init_fabric.isra.0 ()
from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-9.1.0/adios2-2.5.0-yljowfohsh4tqhunychezduzbc55lsxj/lib64/../lib64/libadios2_sst.so.2
#1 0x0000200001ac3768 in RdmaInitWriter ()
from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-9.1.0/adios2-2.5.0-yljowfohsh4tqhunychezduzbc55lsxj/lib64/../lib64/libadios2_sst.so.2
#2 0x0000200001ab1d90 in SstWriterOpen ()
from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-9.1.0/adios2-2.5.0-yljowfohsh4tqhunychezduzbc55lsxj/lib64/../lib64/libadios2_sst.so.2
#3 0x0000200000758c4c in adios2::core::engine::SstWriter::SstWriter(adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode, adios2::helper::Comm) ()
from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-9.1.0/adios2-2.5.0-yljowfohsh4tqhunychezduzbc55lsxj/lib64/libadios2.so.2
#4 0x00002000002955a4 in adios2::core::IO::Open(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode, ompi_communicator_t*) ()
from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-9.1.0/adios2-2.5.0-yljowfohsh4tqhunychezduzbc55lsxj/lib64/libadios2.so.2
#5 0x0000200000296070 in adios2::core::IO::Open(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode) ()
from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-9.1.0/adios2-2.5.0-yljowfohsh4tqhunychezduzbc55lsxj/lib64/libadios2.so.2
#6 0x000020000080f448 in adios2::IO::Open(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode) ()
from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-9.1.0/adios2-2.5.0-yljowfohsh4tqhunychezduzbc55lsxj/lib64/libadios2.so.2
#7 0x000000001001f6e0 in SSTrw::openWriter() ()
#8 0x0000000010014788 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<ADParserTestConstructor_opensCorrectlySST_Test::TestBody()::{lambda()#1}> > >::_M_run() ()
#9 0x000020000143ade0 in std::execute_native_thread_routine (
__p=)
at /tmp/belhorn/gcc-build-9.1.0-alpha+20190716/gcc-9.1.0/libstdc++-v3/src/c++11/thread.cc:80
#10 0x00002000012d8b94 in start_thread () from /lib64/libpthread.so.0
#11 0x00002000019c85f4 in clone () from /lib64/libc.so.6

The test is run on a single node interactive session as:

jsrun -n 1 ./test

Note that this is using the system installed version of adios2.5.0. The error occurs in the following code, which is executed in a separate thread:

adios2::ADIOS ad = adios2::ADIOS(MPI_COMM_SELF, adios2::DebugON);
adios2::IO io = ad.DeclareIO("tau-metrics");
io.SetEngine("SST");
io.SetParameters({
                  {"MarshalMethod", "BP"}                                                                                                              
  });
adios2::Engine wr = io.Open(filename, adios2::Mode::Write); <----happens here

As the module doesn't have debug symbols, I hand-built a version of 2.5.0 circa mid-March (git commit f23e72c) and built my library against it. This gives us more information:

#0 0x0000200001377f78 in fi_endpoint (domain=0x0, info=0x200018022d30,
ep=0x200018023578, context=0x0)
at /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/include/rdma/fi_endpoint.h:164
#1 0x00002000013786f8 in init_fabric (fabric=0x200018023540,
Params=0x20001800fe08)
at /ccs/home/ckelly/src/ADIOS2/source/adios2/toolkit/sst/dp/rdma_dp.c:213
#2 0x0000200001378b08 in RdmaInitWriter (Svcs=0x2000016f33c8 ,
CP_Stream=0x20001800ff30, Params=0x20001800fe08, DPAttrs=0x200018022b40)
at /ccs/home/ckelly/src/ADIOS2/source/adios2/toolkit/sst/dp/rdma_dp.c:433
#3 0x000020000135e8c4 in SstWriterOpen (Name=0x7ffffffea7f0 "commFile",
Params=0x20001800fe08, comm=0x20001800fd98)
at /ccs/home/ckelly/src/ADIOS2/source/adios2/toolkit/sst/cp/cp_writer.c:1230
#4 0x00002000010e5414 in adios2::core::engine::SstWriter::SstWriter (
this=0x20001800fd40, io=..., name=..., mode=adios2::Write, comm=...)
at /ccs/home/ckelly/src/ADIOS2/source/adios2/engine/sst/SstWriter.cpp:112
#5 0x0000200000b4b814 in __gnu_cxx::new_allocatoradios2::core::engine::SstWriter::construct<adios2::core::engine::SstWriter, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm>(adios2::core::engine::SstWriter*, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm&&) (
this=0x200016b5dd80, __p=0x20001800fd40, __args#0=..., __args#1=...,
__args#2=@0x200016b5e068: adios2::Write,
__args#3=<unknown type in /autofs/nccs-svm1_home1/ckelly/install/ADIOS2/lib64/libadios2.so.2, CU 0x2a1337, DIE 0x3ce653>)
at /autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/include/c++/9.1.0/ext/new_allocator.h:147
#6 0x0000200000b46fec in std::allocator_traits<std::allocatoradios2::core::engine::SstWriter >::construct<adios2::core::engine::SstWriter, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm>(std::allocatoradios2::core::engine::SstWriter&, adios2::core::engine::SstWriter*, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm&&) (__a=..., __p=0x20001800fd40,
__args#0=..., __args#1=..., __args#2=@0x200016b5e068: adios2::Write,
__args#3=<unknown type in /autofs/nccs-svm1_home1/ckelly/install/ADIOS2/lib64/libadios2.so.2, CU 0x2a1337, DIE 0x3d3f5a>)
at /autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/include/c++/9.1.0/bits/alloc_traits.h:484
#7 0x0000200000b407d8 in std::_Sp_counted_ptr_inplace<adios2::core::engine::SstWriter, std::allocatoradios2::core::engine::SstWriter, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm>(std::allocatoradios2::core::engine::SstWriter, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm&&) (
this=0x20001800fd30, __a=...)
at /autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/include/c++/9.1.0/bits/shared_ptr_base.h:548
#8 0x0000200000b3935c in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<adios2::core::engine::SstWriter, std::allocatoradios2::core::engine::SstWriter, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm>(adios2::core::engine::SstWriter*&, std::_Sp_alloc_shared_tag<std::allocatoradios2::core::engine::SstWriter >, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm&&) (this=0x200016b5e048, __p=@0x200016b5e040: 0x0,
__a=...)
at /autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/include/c++/9.1.0/bits/shared_ptr_base.h:679
#9 0x0000200000b30910 in std::__shared_ptr<adios2::core::engine::SstWriter, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocatoradios2::core::engine::SstWriter, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm>(std::_Sp_alloc_shared_tag<std::allocatoradios2::core::engine::SstWriter >, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm&&) (
this=0x200016b5e040, __tag=...)
at /autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/include/c++/9.1.0/bits/shared_ptr_base.h:1344
#10 0x0000200000b1a9dc in std::shared_ptradios2::core::engine::SstWriter::shared_ptr<std::allocatoradios2::core::engine::SstWriter, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm>(std::_Sp_alloc_shared_tag<std::allocatoradios2::core::engine::SstWriter >, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm&&) (this=0x200016b5e040, __tag=...)
at /autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/include/c++/9.1.0/bits/shared_ptr.h:359
#11 0x0000200000b057ec in std::allocate_shared<adios2::core::engine::SstWriter, std::allocatoradios2::core::engine::SstWriter, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm>(std::allocatoradios2::core::engine::SstWriter const&, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm&&) (__a=..., __args#0=..., __args#1=...,
__args#2=@0x200016b5e068: adios2::Write,
__args#3=<unknown type in /autofs/nccs-svm1_home1/ckelly/install/ADIOS2/lib64/libadios2.so.2, CU 0x2a1337, DIE 0x412896>)
at /autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/include/c++/9.1.0/bits/shared_ptr.h:702
#12 0x0000200000af62d0 in std::make_shared<adios2::core::engine::SstWriter, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm>(adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode const&, adios2::helper::Comm&&) (__args#0=...,
__args#1=..., __args#2=@0x200016b5e068: adios2::Write,
__args#3=<unknown type in /autofs/nccs-svm1_home1/ckelly/install/ADIOS2/lib64/libadios2.so.2, CU 0x2a1337, DIE 0x41ed3d>)
at /autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/include/c++/9.1.0/bits/shared_ptr.h:718
#13 0x0000200000aedf98 in adios2::core::IO::MakeEngineadios2::core::engine::SstWriter (io=..., name=..., mode=adios2::Write, comm=...)
at /ccs/home/ckelly/src/ADIOS2/source/adios2/core/IO.h:493
#14 0x0000200000af5338 in std::_Function_handler<std::shared_ptradios2::core::Engine (adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode, adios2::helper::Comm), std::shared_ptradios2::core::Engine (*)(adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode, adios2::helper::Comm)>::_M_invoke(std::_Any_data const&, adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode&&, adios2::helper::Comm&&) (__functor=..., __args#0=...,
__args#1=...,
__args#2=<unknown type in /autofs/nccs-svm1_home1/ckelly/install/ADIOS2/lib64/libadios2.so.2, CU 0x2a1337, DIE 0x41f6f7>,
__args#3=<unknown type in /autofs/nccs-svm1_home1/ckelly/install/ADIOS2/lib64/libadios2.so.2, CU 0x2a1337, DIE 0x41f707>)
at /autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/include/c++/9.1.0/bits/std_function.h:286
#15 0x0000200000af4998 in std::function<std::shared_ptradios2::core::Engine (adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode, adios2::helper::Comm)>::operator()(adios2::core::IO&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, adios2::Mode, adios2::helper::Comm) const (
this=0x10123ab8, __args#0=..., __args#1=..., __args#2=adios2::Write,
__args#3=...)
at /autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/include/c++/9.1.0/bits/std_function.h:690
#16 0x0000200000ad5d20 in adios2::core::IO::Open (this=0x200018008150,
name=..., mode=adios2::Write, comm=...)
at /ccs/home/ckelly/src/ADIOS2/source/adios2/core/IO.cpp:663
#17 0x0000200000ad627c in adios2::core::IO::Open (this=0x200018008150,
name=..., mode=adios2::Write)
at /ccs/home/ckelly/src/ADIOS2/source/adios2/core/IO.cpp:694
#18 0x000020000121120c in adios2::IO::Open (this=0x7ffffffea810, name=...,
mode=adios2::Write)
at /ccs/home/ckelly/src/ADIOS2/bindings/CXX11/adios2/cxx11/IO.cpp:110

Thus it appears that the libfabric domain pointer is null causing the SEGV in fi_endpoint when libfabric tries to dereference it.

The bug also appears if I run the sst_conn_tool provided with ADIOS:

#0 0x000000001002eac8 in fi_endpoint (domain=0x0, info=0x102faa80,
ep=0x102fa5c8, context=0x0)
at /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/include/rdma/fi_endpoint.h:164
#1 0x000000001002f248 in init_fabric (fabric=0x102fa590,
Params=0x7ffffffeac88)
at /ccs/home/ckelly/src/ADIOS2/source/adios2/toolkit/sst/dp/rdma_dp.c:213
#2 0x000000001002f658 in RdmaInitWriter (Svcs=0x10051ca0 ,
CP_Stream=0x1007a670, Params=0x7ffffffeac88, DPAttrs=0x102fa550)
at /ccs/home/ckelly/src/ADIOS2/source/adios2/toolkit/sst/dp/rdma_dp.c:433
#3 0x0000000010014a38 in SstWriterOpen (Name=0x10031a68 "SstConnToolTemp",
Params=0x7ffffffeac88, comm=0x10051fb8 )
at /ccs/home/ckelly/src/ADIOS2/source/adios2/toolkit/sst/cp/cp_writer.c:1230
#4 0x0000000010009bec in do_listen ()
at /ccs/home/ckelly/src/ADIOS2/source/adios2/toolkit/sst/util/sst_conn_tool.c:350
#5 0x00000000100094c8 in main (argc=1, argv=0x7ffffffeb1b8)
at /ccs/home/ckelly/src/ADIOS2/source/adios2/toolkit/sst/util/sst_conn_tool.c:184

I also tested the latest 2.6.0 git revision (e9b41b1, October 7th) and encountered the same issue with sst_conn_tool:

#0 0x0000000010030414 in fi_endpoint (domain=0x0, info=0x11943db0,
ep=0x11944338, context=0x0)
at /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/include/rdma/fi_endpoint.h:164
#1 0x0000000010030bb4 in init_fabric (fabric=0x11944300,
Params=0x7ffffffeac68)
at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/dp/rdma_dp.c:230
#2 0x0000000010030fcc in RdmaInitWriter (Svcs=0x10051c38 ,
CP_Stream=0x1088e360, Params=0x7ffffffeac68, DPAttrs=0x11944cf0,
Stats=0x1088e3a0)
at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/dp/rdma_dp.c:462
#3 0x0000000010015d4c in SstWriterOpen (Name=0x10033470 "SstConnToolTemp",
Params=0x7ffffffeac68, comm=0x10051f20 )
at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/cp/cp_writer.c:1321
#4 0x0000000010009e0c in do_listen ()
at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/util/sst_conn_tool.c:350
#5 0x00000000100096e8 in main (argc=1, argv=0x7ffffffeb1a8)
at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/util/sst_conn_tool.c:184

Desktop (please complete the following information):

  • OS/Platform: [e.g. Ubuntu 18.04, Cori, Summit]
  • Build [e.g. compiler version gcc 7.4.0, cmake version, build type: static ]

Summit with modules:

  1. hsi/5.0.2.p5 3) lsf-tools/2.0 5) gcc/9.1.0 7) cmake/3.18.2 9) zeromq/4.2.5 11) c-blosc/1.12.1 13) sz/2.0.2.0
  2. xalt/1.2.0 4) DefApps 6) papi/5.7.0 8) openssl/1.0.2 10) zlib/1.2.11 12) bzip2/1.0.6 14) spectrum-mpi/10.3.1.2-20200121

The system installed libfabric version is 1.11.0.

ADIOS2 built as:

CC=gcc CXX=g++ cmake -DCMAKE_INSTALL_PREFIX=/autofs/nccs-svm1_home1/ckelly/install/ADIOS2_latest
-DCMAKE_DISABLE_FIND_PACKAGE_BISON=TRUE \ -DCMAKE_DISABLE_FIND_PACKAGE_FLEX=TRUE
-DCMAKE_DISABLE_FIND_PACKAGE_CrayDRC=TRUE
-DCMAKE_DISABLE_FIND_PACKAGE_NVSTREAM=TRUE
-DADIOS2_USE_MPI:BOOL=ON
-DADIOS2_USE_MGARD:BOOL=OFF
-DADIOS2_USE_SZ:BOOL=ON
-DADIOS2_USE_HDF5:BOOL=OFF
-DADIOS2_USE_ZeroMQ:BOOL=ON
-DADIOS2_USE_Fortran:BOOL=ON
-DADIOS2_USE_Python:BOOL=OFF
-DADIOS2_USE_Endian_Reverse:BOOL=OFF
-DADIOS2_USE_SST:BOOL=ON
-DADIOS2_USE_BZip2:BOOL=ON
-DADIOS2_USE_ZFP:BOOL=OFF
-DADIOS2_USE_DataMan:BOOL=ON
-DADIOS2_USE_Profiling:BOOL=OFF
-DADIOS2_USE_Blosc:BOOL=ON
-DADIOS2_USE_PNG:BOOL=OFF
-DADIOS2_USE_SSC:BOOL=ON
-DADIOS2_USE_DataSpaces:BOOL=OFF
-DBUILD_SHARED_LIBS:BOOL=ON
-DADIOS2_BUILD_TESTING:BOOL=ON
-DADIOS2_BUILD_EXAMPLES_EXPERIMENTAL:BOOL=OFF
-DCMAKE_BUILD_TYPE=Debug
/autofs/nccs-svm1_home1/ckelly/src/ADIOS2_latest

@giltirn
Copy link
Author

giltirn commented Oct 8, 2020

Unfortunately ADIOS2 does not check the return value of the libfabric calls in source/adios2/toolkit/sst/dp/rdma_dp.c. I hacked the source to check for an error and it appears that the call to fi_domain on line 228 of that file is failing with error 22, invalid argument.

@giltirn
Copy link
Author

giltirn commented Oct 8, 2020

After some digging I discovered the FI_LOG_LEVEL environment variable. Setting it to FI_LOG_LEVEL=debug I get the following (and the SEGV in a different location!)

libfabric:109875:core:core:fi_param_get_():279<info> variable perf_cntr=<not set>
libfabric:109875:core:core:fi_param_get_():279<info> variable hook=<not set>
libfabric:109875:core:core:ofi_hmem_init():200<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:109875:core:core:ofi_hmem_init():200<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:109875:core:core:ofi_hmem_init():200<info> Hmem iface FI_HMEM_ZE not supported
libfabric:109875:core:core:fi_param_get_():279<info> variable mr_cache_max_size=<not set>
libfabric:109875:core:core:fi_param_get_():279<info> variable mr_cache_max_count=<not set>
libfabric:109875:core:core:fi_param_get_():279<info> variable mr_cache_monitor=<not set>
libfabric:109875:core:core:fi_param_get_():279<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:109875:core:core:fi_param_get_():279<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:109875:core:mr:ofi_default_cache_size():68<info> default cache size=1844921809
libfabric:109875:core:core:fi_param_get_():279<info> variable universe_size=<not set>
libfabric:109875:core:core:fi_param_get_():279<info> variable provider=<not set>
libfabric:109875:core:core:fi_param_get_():279<info> variable provider_path=<not set>
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable buffer_size=<not set>
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable msg_tx_size=<not set>
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable msg_rx_size=<not set>
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable cm_progress_interval=<not set>
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable cq_eq_fairness=<not set>
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable data_auto_progress=<not set>
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable def_wait_obj=<not set>
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable def_tcp_wait_obj=<not set>
libfabric:109875:core:core:ofi_register_provider():403<info> registering provider: ofi_rxm (111.0)
libfabric:109875:verbs:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:109875:verbs:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:109875:verbs:core:fi_param_get_():279<info> variable tx_iov_limit=<not set>
libfabric:109875:verbs:core:fi_param_get_():279<info> variable rx_iov_limit=<not set>
libfabric:109875:verbs:core:fi_param_get_():279<info> variable inline_size=<not set>
libfabric:109875:verbs:core:fi_param_get_():279<info> variable min_rnr_timer=<not set>
libfabric:109875:verbs:core:fi_param_get_():279<info> variable use_odp=<not set>
libfabric:109875:verbs:core:fi_param_get_():279<info> variable prefer_xrc=<not set>
libfabric:109875:verbs:core:fi_param_get_():279<info> variable xrcd_filename=<not set>
libfabric:109875:verbs:core:fi_param_get_():279<info> variable cqread_bunch_size=<not set>
libfabric:109875:verbs:core:fi_param_get_():279<info> variable gid_idx=<not set>
libfabric:109875:verbs:core:fi_param_get_():279<info> variable device_name=<not set>
libfabric:109875:verbs:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:109875:verbs:core:fi_param_get_():279<info> variable dgram_use_name_server=<not set>
libfabric:109875:verbs:core:fi_param_get_():279<info> variable dgram_name_server_port=<not set>
libfabric:109875:verbs:fabric:verbs_devs_print():872<info> list of verbs devices found for FI_EP_MSG:
libfabric:109875:verbs:fabric:verbs_devs_print():876<info> #1 mlx5_0 - IPoIB addresses:
libfabric:109875:verbs:fabric:verbs_devs_print():886<info> 	10.41.9.56
libfabric:109875:verbs:fabric:verbs_devs_print():886<info> 	fe80::ee0d:9a03:8f:f200
libfabric:109875:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_1: first found active port is 1
libfabric:109875:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_1: first found active port is 1
libfabric:109875:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_1: first found active port is 1
libfabric:109875:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_3: first found active port is 1
libfabric:109875:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_3: first found active port is 1
libfabric:109875:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_3: first found active port is 1
libfabric:109875:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:109875:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:109875:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:109875:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_2: first found active port is 1
libfabric:109875:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_2: first found active port is 1
libfabric:109875:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_2: first found active port is 1
libfabric:109875:core:core:ofi_register_provider():403<info> registering provider: verbs (111.0)
libfabric:109875:ofi_mrail:core:fi_param_get_():279<info> variable config=<not set>
libfabric:109875:ofi_mrail:core:fi_param_get_():279<info> variable addr=<not set>
libfabric:109875:ofi_mrail:core:fi_param_get_():279<info> variable addr_strc=<not set>
libfabric:109875:ofi_mrail:core:mrail_parse_env_vars():115<info> unable to read FI_OFI_MRAIL_ADDR env variable
libfabric:109875:core:core:ofi_register_provider():403<info> registering provider: ofi_mrail (111.0)
libfabric:109875:core:core:ofi_register_provider():403<info> registering provider: ofi_hook_perf (111.0)
libfabric:109875:core:core:ofi_register_provider():403<info> registering provider: ofi_hook_debug (111.0)
libfabric:109875:core:core:ofi_register_provider():403<info> registering provider: ofi_hook_noop (111.0)
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:core:core:fi_getinfo_():1001<info> fi_getinfo: provider verbs returned -61 (No data available)
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:109875:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_rxm
libfabric:109875:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_rxm
libfabric:109875:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:109875:ofi_mrail:fabric:mrail_get_core_info():288<info> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:109875:core:core:fi_getinfo_():1001<info> fi_getinfo: provider ofi_mrail returned -61 (No data available)
[New Thread 0x200014b6f180 (LWP 109893)]
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:109875:core:core:fi_getinfo_():1001<info> fi_getinfo: provider verbs returned -61 (No data available)
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:109875:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_rxm
libfabric:109875:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_rxm
libfabric:109875:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:109875:ofi_mrail:fabric:mrail_get_core_info():288<info> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:109875:core:core:fi_getinfo_():1001<info> fi_getinfo: provider ofi_mrail returned -61 (No data available)
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1-dgram
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3-dgram
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0-dgram
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:109875:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2-dgram
libfabric:109875:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:109875:core:core:fi_fabric_():1201<info> Opened fabric: IB-0xfe80000000000000
libfabric:109875:core:core:fi_fabric_():1201<info> Opened fabric: IB-0xfe80000000000000
libfabric:109875:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:109875:verbs:core:vrb_check_hints():262<info> skipping device mlx5_1-xrc (want mlx5_1)
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:109875:verbs:core:vrb_check_hints():262<info> skipping device mlx5_3 (want mlx5_1)
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:109875:verbs:core:vrb_check_hints():262<info> skipping device mlx5_3-xrc (want mlx5_1)
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:109875:verbs:core:vrb_check_hints():262<info> skipping device mlx5_0 (want mlx5_1)
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:109875:verbs:core:vrb_check_hints():262<info> skipping device mlx5_0-xrc (want mlx5_1)
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:109875:verbs:core:vrb_check_hints():262<info> skipping device mlx5_2 (want mlx5_1)
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:109875:verbs:core:vrb_check_hints():262<info> skipping device mlx5_2-xrc (want mlx5_1)
libfabric:109875:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:109875:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:109875:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:109875:verbs:fabric:vrb_get_rai_id():281<info> rdma_resolve_addr: Invalid argument(22)
libfabric:109875:verbs:fabric:vrb_get_rai_id():282<info> src addr: fi_sockaddr_ib://[fe80::ec0d:9a03:8f:f201]:0xffff:0x13f:0x0

Program received signal SIGSEGV, Segmentation fault.
0x00002000019bbf00 in ofi_straddr_log_internal ()
   from /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/lib/libfabric.so.1
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.ppc64le libffi-3.0.13-18.el7.ppc64le libibverbs-41mlnx1-OFED.4.7.0.0.2.47329.ppc64le libmlx4-41mlnx1-OFED.4.7.3.0.3.47329.ppc64le libmlx5-41mlnx1-OFED.4.7.0.3.3.47329.ppc64le libnl3-3.2.28-4.el7.ppc64le librdmacm-41mlnx1-OFED.4.7.3.0.6.47329.ppc64le librxe-41mlnx1-OFED.4.4.2.4.6.47329.ppc64le numactl-libs-2.0.9-7.el7.ppc64le
(gdb) bt
#0  0x00002000019bbf00 in ofi_straddr_log_internal ()
   from /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/lib/libfabric.so.1
#1  0x00002000019e9e9c in vrb_get_rai_id ()
   from /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/lib/libfabric.so.1
#2  0x00002000019fa2a4 in vrb_getinfo ()
   from /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/lib/libfabric.so.1
#3  0x00002000019af5b8 in fi_getinfo_ ()
   from /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/lib/libfabric.so.1
#4  0x00002000019cc0e8 in ofi_get_core_info ()
   from /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/lib/libfabric.so.1
#5  0x0000200001a0725c in rxm_domain_open ()
   from /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/lib/libfabric.so.1
#6  0x0000000010030154 in fi_domain (fabric=0x11333730, info=0x112faaf0, 
    domain=0x112fb070, context=0x0)
    at /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/include/rdma/fi_domain.h:308
#7  0x0000000010030c08 in init_fabric (fabric=0x112fb040, 
    Params=0x7fffffff19f8)
    at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/dp/rdma_dp.c:233
#8  0x00000000100310a4 in RdmaInitWriter (Svcs=0x10051c40 <Svcs>, 
    CP_Stream=0x10939ef0, Params=0x7fffffff19f8, DPAttrs=0x112fba30, 
    Stats=0x10939f30)
    at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/dp/rdma_dp.c:473
#9  0x0000000010015d9c in SstWriterOpen (Name=0x10033550 "SstConnToolTemp", 
    Params=0x7fffffff19f8, comm=0x10051f28 <CommWorld>)
    at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/cp/cp_writer.c:1321
#10 0x0000000010009e5c in do_listen ()
    at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/util/sst_conn_tool.c:350
#11 0x0000000010009738 in main (argc=1, argv=0x7fffffff1f38)
    at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/util/sst_conn_tool.c:184

The message "unsupported endpoint type" appears regularly in this output; possibly this is the smoking gun?

@philip-davis
Copy link
Collaborator

The lack of error checking is an issue that I will address. Given the version of libfabric, it's possible that this is an MR_CACHE issue, as there seems to be some issues caused by the default MR_CACHE being used with the rxm;verbs provider. I am working to verify this now, but having some trouble accessing Summit.

If this is an MR_CACHE issue, it should be resolvable by either using libfabric 1.9.0, or setting the environment variable FI_MR_CACHE_MAX_COUNT to 0.

@giltirn
Copy link
Author

giltirn commented Oct 8, 2020

Hi Philip, thanks for the reply. Unfortunately export FI_MR_CACHE_MAX_COUNT=0 did not seem to solve the issue:

libfabric:110264:core:core:fi_param_get_():279<info> variable perf_cntr=<not set>
libfabric:110264:core:core:fi_param_get_():279<info> variable hook=<not set>
libfabric:110264:core:core:ofi_hmem_init():200<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:110264:core:core:ofi_hmem_init():200<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:110264:core:core:ofi_hmem_init():200<info> Hmem iface FI_HMEM_ZE not supported
libfabric:110264:core:core:fi_param_get_():279<info> variable mr_cache_max_size=<not set>
libfabric:110264:core:core:fi_param_get_():305<info> read long var mr_cache_max_count=0
libfabric:110264:core:core:fi_param_get_():279<info> variable mr_cache_monitor=<not set>
libfabric:110264:core:core:fi_param_get_():279<info> variable mr_cuda_cache_monitor_enabled=<not set>
libfabric:110264:core:core:fi_param_get_():279<info> variable mr_rocr_cache_monitor_enabled=<not set>
libfabric:110264:core:mr:ofi_default_cache_size():68<info> default cache size=1844921809
libfabric:110264:core:core:fi_param_get_():279<info> variable universe_size=<not set>
libfabric:110264:core:core:fi_param_get_():279<info> variable provider=<not set>
libfabric:110264:core:core:fi_param_get_():279<info> variable provider_path=<not set>
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable buffer_size=<not set>
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable msg_tx_size=<not set>
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable msg_rx_size=<not set>
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable cm_progress_interval=<not set>
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable cq_eq_fairness=<not set>
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable data_auto_progress=<not set>
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable def_wait_obj=<not set>
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable def_tcp_wait_obj=<not set>
libfabric:110264:core:core:ofi_register_provider():403<info> registering provider: ofi_rxm (111.0)
libfabric:110264:verbs:core:fi_param_get_():279<info> variable tx_size=<not set>
libfabric:110264:verbs:core:fi_param_get_():279<info> variable rx_size=<not set>
libfabric:110264:verbs:core:fi_param_get_():279<info> variable tx_iov_limit=<not set>
libfabric:110264:verbs:core:fi_param_get_():279<info> variable rx_iov_limit=<not set>
libfabric:110264:verbs:core:fi_param_get_():279<info> variable inline_size=<not set>
libfabric:110264:verbs:core:fi_param_get_():279<info> variable min_rnr_timer=<not set>
libfabric:110264:verbs:core:fi_param_get_():279<info> variable use_odp=<not set>
libfabric:110264:verbs:core:fi_param_get_():279<info> variable prefer_xrc=<not set>
libfabric:110264:verbs:core:fi_param_get_():279<info> variable xrcd_filename=<not set>
libfabric:110264:verbs:core:fi_param_get_():279<info> variable cqread_bunch_size=<not set>
libfabric:110264:verbs:core:fi_param_get_():279<info> variable gid_idx=<not set>
libfabric:110264:verbs:core:fi_param_get_():279<info> variable device_name=<not set>
libfabric:110264:verbs:core:fi_param_get_():279<info> variable iface=<not set>
libfabric:110264:verbs:core:fi_param_get_():279<info> variable dgram_use_name_server=<not set>
libfabric:110264:verbs:core:fi_param_get_():279<info> variable dgram_name_server_port=<not set>
libfabric:110264:verbs:fabric:verbs_devs_print():872<info> list of verbs devices found for FI_EP_MSG:
libfabric:110264:verbs:fabric:verbs_devs_print():876<info> #1 mlx5_0 - IPoIB addresses:
libfabric:110264:verbs:fabric:verbs_devs_print():886<info> 	10.41.9.56
libfabric:110264:verbs:fabric:verbs_devs_print():886<info> 	fe80::ee0d:9a03:8f:f200
libfabric:110264:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_1: first found active port is 1
libfabric:110264:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_1: first found active port is 1
libfabric:110264:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_1: first found active port is 1
libfabric:110264:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_3: first found active port is 1
libfabric:110264:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_3: first found active port is 1
libfabric:110264:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_3: first found active port is 1
libfabric:110264:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:110264:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:110264:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_0: first found active port is 1
libfabric:110264:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_2: first found active port is 1
libfabric:110264:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_2: first found active port is 1
libfabric:110264:verbs:fabric:vrb_get_device_attrs():618<info> device mlx5_2: first found active port is 1
libfabric:110264:core:core:ofi_register_provider():403<info> registering provider: verbs (111.0)
libfabric:110264:ofi_mrail:core:fi_param_get_():279<info> variable config=<not set>
libfabric:110264:ofi_mrail:core:fi_param_get_():279<info> variable addr=<not set>
libfabric:110264:ofi_mrail:core:fi_param_get_():279<info> variable addr_strc=<not set>
libfabric:110264:ofi_mrail:core:mrail_parse_env_vars():115<info> unable to read FI_OFI_MRAIL_ADDR env variable
libfabric:110264:core:core:ofi_register_provider():403<info> registering provider: ofi_mrail (111.0)
libfabric:110264:core:core:ofi_register_provider():403<info> registering provider: ofi_hook_perf (111.0)
libfabric:110264:core:core:ofi_register_provider():403<info> registering provider: ofi_hook_debug (111.0)
libfabric:110264:core:core:ofi_register_provider():403<info> registering provider: ofi_hook_noop (111.0)
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:core:core:fi_getinfo_():1001<info> fi_getinfo: provider verbs returned -61 (No data available)
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:110264:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_rxm
libfabric:110264:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_rxm
libfabric:110264:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:110264:ofi_mrail:fabric:mrail_get_core_info():288<info> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:110264:core:core:fi_getinfo_():1001<info> fi_getinfo: provider ofi_mrail returned -61 (No data available)
[New Thread 0x200014b6f180 (LWP 110281)]
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_MSG
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_RDM
libfabric:110264:core:core:fi_getinfo_():1001<info> fi_getinfo: provider verbs returned -61 (No data available)
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:110264:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_rxm
libfabric:110264:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_rxm
libfabric:110264:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:110264:ofi_mrail:fabric:mrail_get_core_info():288<info> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:110264:core:core:fi_getinfo_():1001<info> fi_getinfo: provider ofi_mrail returned -61 (No data available)
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1-dgram
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_3-dgram
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_0-dgram
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:110264:verbs:fabric:vrb_get_matching_info():1490<info> hints->ep_attr->rx_ctx_cnt != FI_SHARED_CONTEXT. Skipping XRC FI_EP_MSG endpoints
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_2-dgram
libfabric:110264:core:core:ofi_layering_ok():893<info> Need core provider, skipping ofi_mrail
libfabric:110264:core:core:fi_fabric_():1201<info> Opened fabric: IB-0xfe80000000000000
libfabric:110264:core:core:fi_fabric_():1201<info> Opened fabric: IB-0xfe80000000000000
libfabric:110264:ofi_rxm:core:fi_param_get_():279<info> variable use_srx=<not set>
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #1 mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1515<info> adding fi_info for domain: mlx5_1
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #2 mlx5_1-xrc
libfabric:110264:verbs:core:vrb_check_hints():262<info> skipping device mlx5_1-xrc (want mlx5_1)
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #3 mlx5_1-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #4 mlx5_3
libfabric:110264:verbs:core:vrb_check_hints():262<info> skipping device mlx5_3 (want mlx5_1)
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #5 mlx5_3-xrc
libfabric:110264:verbs:core:vrb_check_hints():262<info> skipping device mlx5_3-xrc (want mlx5_1)
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #6 mlx5_3-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #7 mlx5_0
libfabric:110264:verbs:core:vrb_check_hints():262<info> skipping device mlx5_0 (want mlx5_1)
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #8 mlx5_0-xrc
libfabric:110264:verbs:core:vrb_check_hints():262<info> skipping device mlx5_0-xrc (want mlx5_1)
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #9 mlx5_0-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #10 mlx5_2
libfabric:110264:verbs:core:vrb_check_hints():262<info> skipping device mlx5_2 (want mlx5_1)
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #11 mlx5_2-xrc
libfabric:110264:verbs:core:vrb_check_hints():262<info> skipping device mlx5_2-xrc (want mlx5_1)
libfabric:110264:verbs:fabric:vrb_get_matching_info():1469<info> checking domain: #12 mlx5_2-dgram
libfabric:110264:verbs:core:ofi_check_ep_type():657<info> unsupported endpoint type
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Supported: FI_EP_DGRAM
libfabric:110264:verbs:core:ofi_check_ep_type():658<info> Requested: FI_EP_MSG
libfabric:110264:verbs:fabric:vrb_get_rai_id():281<info> rdma_resolve_addr: Invalid argument(22)
libfabric:110264:verbs:fabric:vrb_get_rai_id():282<info> src addr: fi_sockaddr_ib://[fe80::ec0d:9a03:8f:f201]:0xffff:0x13f:0x0

Program received signal SIGSEGV, Segmentation fault.
0x00002000019bbf00 in ofi_straddr_log_internal ()
   from /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/lib/libfabric.so.1
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.ppc64le libffi-3.0.13-18.el7.ppc64le libibverbs-41mlnx1-OFED.4.7.0.0.2.47329.ppc64le libmlx4-41mlnx1-OFED.4.7.3.0.3.47329.ppc64le libmlx5-41mlnx1-OFED.4.7.0.3.3.47329.ppc64le libnl3-3.2.28-4.el7.ppc64le librdmacm-41mlnx1-OFED.4.7.3.0.6.47329.ppc64le librxe-41mlnx1-OFED.4.4.2.4.6.47329.ppc64le numactl-libs-2.0.9-7.el7.ppc64le
(gdb) bt
#0  0x00002000019bbf00 in ofi_straddr_log_internal ()
   from /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/lib/libfabric.so.1
#1  0x00002000019e9e9c in vrb_get_rai_id ()
   from /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/lib/libfabric.so.1
#2  0x00002000019fa2a4 in vrb_getinfo ()
   from /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/lib/libfabric.so.1
#3  0x00002000019af5b8 in fi_getinfo_ ()
   from /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/lib/libfabric.so.1
#4  0x00002000019cc0e8 in ofi_get_core_info ()
   from /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/lib/libfabric.so.1
#5  0x0000200001a0725c in rxm_domain_open ()
   from /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/lib/libfabric.so.1
#6  0x0000000010030154 in fi_domain (fabric=0x113b3b30, info=0x1137aee0, 
    domain=0x1137b460, context=0x0)
    at /autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/include/rdma/fi_domain.h:308
#7  0x0000000010030c08 in init_fabric (fabric=0x1137b430, 
    Params=0x7fffffff19e8)
    at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/dp/rdma_dp.c:233
#8  0x00000000100310a4 in RdmaInitWriter (Svcs=0x10051c40 <Svcs>, 
    CP_Stream=0x10939ef0, Params=0x7fffffff19e8, DPAttrs=0x1137be20, 
    Stats=0x10939f30)
    at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/dp/rdma_dp.c:473
#9  0x0000000010015d9c in SstWriterOpen (Name=0x10033550 "SstConnToolTemp", 
    Params=0x7fffffff19e8, comm=0x10051f28 <CommWorld>)
    at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/cp/cp_writer.c:1321
#10 0x0000000010009e5c in do_listen ()
    at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/util/sst_conn_tool.c:350
#11 0x0000000010009738 in main (argc=1, argv=0x7fffffff1f28)
    at /ccs/home/ckelly/src/ADIOS2_latest/source/adios2/toolkit/sst/util/sst_conn_tool.c:184

@giltirn
Copy link
Author

giltirn commented Oct 8, 2020

I notice that my install of libfabric 1.11 was built by spack for one of our dependencies. The system has a libfabric1.7 module. If I build ADIOS against this version I no longer get the error, suggesting it is an issue either with the more recent libfabric or with the way it was built.

@giltirn
Copy link
Author

giltirn commented Oct 8, 2020

Is it possible that the install of 1.11 simply was not built with the correct interfaces? Comparing the configure script between the Summit 1.7 module and the spack 1.11 install:

==> [2020-05-12-09:50:35.552247] '/autofs/nccs-svm1_sw/summit/.swci/1-compute/var/spack/stage/libfabric-1.7.0-q2ncswaqkeg3upfwddwevwzsqkxpos4p/libfabric-1.7.0/configure' '--prefix=/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-9.1.0/libfabric-1.7.0-q2ncswaqkeg3upfwddwevwzsqkxpos4p' '--enable-psm=no' '--enable-psm2=no' '--enable-sockets=yes' '--enable-verbs=no' '--enable-usnic=no' '--enable-gni=no' '--enable-xpmem=no' '--enable-udp=no' '--enable-rxm=no' '--enable-rxd=no' '--enable-mlx=no'

==> [2020-10-07-10:03:29.216041] '/tmp/ckelly/spack-stage/spack-stage-libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y/spack-src/configure' '--prefix=/autofs/nccs-svm1_home1/ckelly/install/spack/spack/opt/spack/linux-rhel7-power9le/gcc-9.1.0/libfabric-1.11.0-rby5hc3zlikqxmrbl3toj7lgyb3bpw2y' '--with-kdreg=no' '--enable-psm=no' '--enable-psm2=no' '--enable-sockets=no' '--enable-verbs=yes' '--enable-usnic=no' '--enable-gni=no' '--enable-xpmem=no' '--enable-udp=no' '--enable-rxm=yes' '--enable-rxd=no' '--enable-mlx=no' '--enable-tcp=no' '--enable-efa=no' '--enable-mrail=yes' '--enable-shm=no'

we observe that 1.7 was built only with "sockets" whereas 1.11 was built with"rxm","mrail","verbs" but not "sockets".

@philip-davis
Copy link
Collaborator

I would recommend building libfabric 1.9.0; there seems to be some incompatibility with 1.11.0 that I am investigating. The system install of libfabric 1.7.0 does not offer RDMA support, so SST is falling back to sockets support instead.

@giltirn
Copy link
Author

giltirn commented Oct 8, 2020

I rebuilt with 1.9 and indeed the problem seems to be fixed. Thank you for your help!

@giltirn giltirn closed this as completed Oct 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants