Now you should run one of the following depending on your shell source /share/apps/python/miniconda24.4.0/etc/profile.d/conda.sh source /share/apps/python/miniconda24.4.0/etc/profile.d/conda.csh -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: a100-02 Local device: mlx5_2 -------------------------------------------------------------------------- [a100-02:261152] 7 more processes have sent help message help-mpi-btl-openib.txt / error in device init [a100-02:261152] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages terminate called after throwing an instance of 'raft::cuda_error' what(): CUDA error encountered at: file=/people/ghos167/.conda/envs/cugraph-ldgpu2/include/raft/util/cudart_utils.hpp line=148: terminate called after throwing an instance of 'raft::cuda_error' what(): CUDA error encountered at: file=/people/ghos167/.conda/envs/cugraph-ldgpu2/include/raft/util/cudart_utils.hpp line=148: [a100-02:261159] *** Process received signal *** [a100-02:261159] Signal: Aborted (6) [a100-02:261159] Signal code: (-6) [a100-02:261159] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2ab05f648630] [a100-02:261159] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2ab05f88b387] [a100-02:261159] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2ab05f88ca78] [a100-02:261159] [ 3] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xc0)[0x2ab05f20bf9e] [a100-02:261159] [ 4] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(+0xb64e2)[0x2ab05f20a4e2] [a100-02:261159] [ 5] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(_ZSt10unexpectedv+0x0)[0x2ab05f2042e3] [a100-02:261159] [ 6] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(__cxa_rethrow+0x0)[0x2ab05f20a702] [a100-02:261159] [ 7] [a100-02:261162] *** Process received signal *** [a100-02:261162] Signal: Aborted (6) [a100-02:261162] Signal code: (-6) [a100-02:261162] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2acc07d81630] [a100-02:261162] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2acc07fc4387] [a100-02:261162] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2acc07fc5a78] [a100-02:261162] [ 3] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xc0)[0x2acc07944f9e] [a100-02:261162] [ 4] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(+0xb64e2)[0x2acc079434e2] [a100-02:261162] [ 5] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(_ZSt10unexpectedv+0x0)[0x2acc0793d2e3] [a100-02:261162] [ 6] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libstdc++.so.6(__cxa_rethrow+0x0)[0x2acc07943702] [a100-02:261162] [ 7] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN4raft4copyImEEvPT_PKS1_mN3rmm16cuda_stream_viewE+0x1cc)[0x2acbbe092d5c] [a100-02:261162] [ 8] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph21host_scalar_allgatherImEENSt9enable_ifIXsrSt13is_arithmeticIT_E5valueESt6vectorIS3_SaIS3_EEE4typeERKN4raft5comms7comms_tES3_P11CUstream_st+0x36e)[0x2acbbe10083e] [a100-02:261162] [ 9] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN4raft4copyImEEvPT_PKS1_mN3rmm16cuda_stream_viewE+0x1cc)[0x2ab015959d5c] [a100-02:261159] [ 8] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph21host_scalar_allgatherImEENSt9enable_ifIXsrSt13is_arithmeticIT_E5valueESt6vectorIS3_SaIS3_EEE4typeERKN4raft5comms7comms_tES3_P11CUstream_st+0x36e)[0x2ab0159c783e] [a100-02:261159] [ 9] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph24update_edge_dst_propertyINS_12graph_view_tIiiLb0ELb1EvEEPiN6thrust17constant_iteratorIbNS4_11use_defaultES6_EEEEvRKN4raft8handle_tERKT_T0_SF_T1_RNS_19edge_dst_property_tISC_NSt15iterator_traitsISG_E10value_typeEEEb+0x151)[0x2acbc0924c01] [a100-02:261162] [10] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph24update_edge_dst_propertyINS_12graph_view_tIiiLb0ELb1EvEEPiN6thrust17constant_iteratorIbNS4_11use_defaultES6_EEEEvRKN4raft8handle_tERKT_T0_SF_T1_RNS_19edge_dst_property_tISC_NSt15iterator_traitsISG_E10value_typeEEEb+0x151)[0x2ab0181ebc01] [a100-02:261159] [10] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph6detail3bfsINS_12graph_view_tIiiLb0ELb1EvEEN6thrust16discard_iteratorINS4_11use_defaultEEEEEvRKN4raft8handle_tERKT_PNSC_11vertex_typeET0_PKSF_mbSF_b+0x545)[0x2acbc0928585] [a100-02:261162] [11] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph6detail3bfsINS_12graph_view_tIiiLb0ELb1EvEEN6thrust16discard_iteratorINS4_11use_defaultEEEEEvRKN4raft8handle_tERKT_PNSC_11vertex_typeET0_PKSF_mbSF_b+0x545)[0x2ab0181ef585] [a100-02:261159] [11] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph3bfsIiiLb1EEEvRKN4raft8handle_tERKNS_12graph_view_tIT_T0_Lb0EXT1_EvEEPS6_SB_PKS6_mbS6_b+0x49)[0x2acbc092c149] [a100-02:261162] [12] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x443bca] [a100-02:261162] [13] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x41b7e6] [a100-02:261162] [14] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2acc07fb0555] [a100-02:261162] [15] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x418be9] [a100-02:261162] *** End of error message *** /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph3bfsIiiLb1EEEvRKN4raft8handle_tERKNS_12graph_view_tIT_T0_Lb0EXT1_EvEEPS6_SB_PKS6_mbS6_b+0x49)[0x2ab0181f3149] [a100-02:261159] [12] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x443bca] [a100-02:261159] [13] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x41b7e6] [a100-02:261159] [14] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab05f877555] [a100-02:261159] [15] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x418be9] [a100-02:261159] *** End of error message *** [a100-02:261163:0:261163] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) [a100-02:261164:0:261164] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) ==== backtrace (tid: 261163) ==== 0 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libucs.so.0(ucs_handle_error+0x2fd) [0x2ac3f65e9fed] 1 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libucs.so.0(+0x2a1e1) [0x2ac3f65ea1e1] 2 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libucs.so.0(+0x2a3aa) [0x2ac3f65ea3aa] 3 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c245) [0x2ac362662245] 4 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c996) [0x2ac362662996] 5 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3db73) [0x2ac362663b73] 6 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x37cd9) [0x2ac36265dcd9] 7 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(pncclAllGather+0x1cc) [0x2ac362653a1c] 8 /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x42f6e0] 9 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph21host_scalar_allgatherImEENSt9enable_ifIXsrSt13is_arithmeticIT_E5valueESt6vectorIS3_SaIS3_EEE4typeERKN4raft5comms7comms_tES3_P11CUstream_st+0x2c2) [0x2ac32645f792] 10 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph24update_edge_dst_propertyINS_12graph_view_tIiiLb0ELb1EvEEPiN6thrust17constant_iteratorIbNS4_11use_defaultES6_EEEEvRKN4raft8handle_tERKT_T0_SF_T1_RNS_19edge_dst_property_tISC_NSt15iterator_traitsISG_E10value_typeEEEb+0x151) [0x2ac328c83c01] 11 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph6detail3bfsINS_12graph_view_tIiiLb0ELb1EvEEN6thrust16discard_iteratorINS4_11use_defaultEEEEEvRKN4raft8handle_tERKT_PNSC_11vertex_typeET0_PKSF_mbSF_b+0x545) [0x2ac328c87585] 12 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph3bfsIiiLb1EEEvRKN4raft8handle_tERKNS_12graph_view_tIT_T0_Lb0EXT1_EvEEPS6_SB_PKS6_mbS6_b+0x49) [0x2ac328c8b149] 13 /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x443bca] 14 /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x41b7e6] 15 /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x2ac37030f555] 16 /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x418be9] ================================= [a100-02:261163] *** Process received signal *** [a100-02:261163] Signal: Segmentation fault (11) [a100-02:261163] Signal code: (-6) [a100-02:261163] Failing at address: 0x32bec0003fc2b [a100-02:261163] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2ac3700e0630] [a100-02:261163] [ 1] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c245)[0x2ac362662245] [a100-02:261163] [ 2] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c996)[0x2ac362662996] [a100-02:261163] [ 3] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3db73)[0x2ac362663b73] [a100-02:261163] [ 4] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x37cd9)[0x2ac36265dcd9] [a100-02:261163] [ 5] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(pncclAllGather+0x1cc)[0x2ac362653a1c] [a100-02:261163] [ 6] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x42f6e0] [a100-02:261163] [ 7] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph21host_scalar_allgatherImEENSt9enable_ifIXsrSt13is_arithmeticIT_E5valueESt6vectorIS3_SaIS3_EEE4typeERKN4raft5comms7comms_tES3_P11CUstream_st+0x2c2)[0x2ac32645f792] [a100-02:261163] [ 8] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph24update_edge_dst_propertyINS_12graph_view_tIiiLb0ELb1EvEEPiN6thrust17constant_iteratorIbNS4_11use_defaultES6_EEEEvRKN4raft8handle_tERKT_T0_SF_T1_RNS_19edge_dst_property_tISC_NSt15iterator_traitsISG_E10value_typeEEEb+0x151)[0x2ac328c83c01] [a100-02:261163] [ 9] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph6detail3bfsINS_12graph_view_tIiiLb0ELb1EvEEN6thrust16discard_iteratorINS4_11use_defaultEEEEEvRKN4raft8handle_tERKT_PNSC_11vertex_typeET0_PKSF_mbSF_b+0x545)[0x2ac328c87585] [a100-02:261163] [10] ==== backtrace (tid: 261164) ==== 0 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libucs.so.0(ucs_handle_error+0x2fd) [0x2b8816494fed] 1 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libucs.so.0(+0x2a1e1) [0x2b88164951e1] 2 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libucs.so.0(+0x2a3aa) [0x2b88164953aa] 3 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c245) [0x2b878250d245] 4 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c996) [0x2b878250d996] 5 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3db73) [0x2b878250eb73] 6 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x37cd9) [0x2b8782508cd9] 7 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(pncclAllGather+0x1cc) [0x2b87824fea1c] 8 /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x42f6e0] 9 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph21host_scalar_allgatherImEENSt9enable_ifIXsrSt13is_arithmeticIT_E5valueESt6vectorIS3_SaIS3_EEE4typeERKN4raft5comms7comms_tES3_P11CUstream_st+0x2c2) [0x2b874630a792] 10 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph24update_edge_dst_propertyINS_12graph_view_tIiiLb0ELb1EvEEPiN6thrust17constant_iteratorIbNS4_11use_defaultES6_EEEEvRKN4raft8handle_tERKT_T0_SF_T1_RNS_19edge_dst_property_tISC_NSt15iterator_traitsISG_E10value_typeEEEb+0x151) [0x2b8748b2ec01] 11 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph6detail3bfsINS_12graph_view_tIiiLb0ELb1EvEEN6thrust16discard_iteratorINS4_11use_defaultEEEEEvRKN4raft8handle_tERKT_PNSC_11vertex_typeET0_PKSF_mbSF_b+0x545) [0x2b8748b32585] 12 /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph3bfsIiiLb1EEEvRKN4raft8handle_tERKNS_12graph_view_tIT_T0_Lb0EXT1_EvEEPS6_SB_PKS6_mbS6_b+0x49) [0x2b8748b36149] 13 /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x443bca] 14 /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x41b7e6] 15 /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x2b87901ba555] 16 /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test() [0x418be9] ================================= /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph3bfsIiiLb1EEEvRKN4raft8handle_tERKNS_12graph_view_tIT_T0_Lb0EXT1_EvEEPS6_SB_PKS6_mbS6_b+0x49)[0x2ac328c8b149] [a100-02:261163] [11] [a100-02:261164] *** Process received signal *** /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x443bca] [a100-02:261163] [12] [a100-02:261164] Signal: Segmentation fault (11) [a100-02:261164] Signal code: (-6) [a100-02:261164] Failing at address: 0x32bec0003fc2c /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x41b7e6] [a100-02:261163] [13] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac37030f555] [a100-02:261163] [14] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x418be9] [a100-02:261163] *** End of error message *** [a100-02:261164] [ 0] /usr/lib64/libpthread.so.0(+0xf630)[0x2b878ff8b630] [a100-02:261164] [ 1] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c245)[0x2b878250d245] [a100-02:261164] [ 2] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3c996)[0x2b878250d996] [a100-02:261164] [ 3] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x3db73)[0x2b878250eb73] [a100-02:261164] [ 4] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(+0x37cd9)[0x2b8782508cd9] [a100-02:261164] [ 5] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libnccl.so.2(pncclAllGather+0x1cc)[0x2b87824fea1c] [a100-02:261164] [ 6] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x42f6e0] [a100-02:261164] [ 7] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph21host_scalar_allgatherImEENSt9enable_ifIXsrSt13is_arithmeticIT_E5valueESt6vectorIS3_SaIS3_EEE4typeERKN4raft5comms7comms_tES3_P11CUstream_st+0x2c2)[0x2b874630a792] [a100-02:261164] [ 8] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph24update_edge_dst_propertyINS_12graph_view_tIiiLb0ELb1EvEEPiN6thrust17constant_iteratorIbNS4_11use_defaultES6_EEEEvRKN4raft8handle_tERKT_T0_SF_T1_RNS_19edge_dst_property_tISC_NSt15iterator_traitsISG_E10value_typeEEEb+0x151)[0x2b8748b2ec01] [a100-02:261164] [ 9] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph6detail3bfsINS_12graph_view_tIiiLb0ELb1EvEEN6thrust16discard_iteratorINS4_11use_defaultEEEEEvRKN4raft8handle_tERKT_PNSC_11vertex_typeET0_PKSF_mbSF_b+0x545)[0x2b8748b32585] [a100-02:261164] [10] /people/ghos167/.conda/envs/cugraph-ldgpu2/lib/libcugraph.so(_ZN7cugraph3bfsIiiLb1EEEvRKN4raft8handle_tERKNS_12graph_view_tIT_T0_Lb0EXT1_EvEEPS6_SB_PKS6_mbS6_b+0x49)[0x2b8748b36149] [a100-02:261164] [11] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x443bca] [a100-02:261164] [12] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x41b7e6] [a100-02:261164] [13] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b87901ba555] [a100-02:261164] [14] /people/ghos167/proj/cugraph-maximal-matching/mg-graph/./mg_test[0x418be9] [a100-02:261164] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 4 with PID 261163 on node a100-02 exited on signal 11 (Segmentation fault). --------------------------------------------------------------------------