`cuIpcGetMemHandle` failure resulting in CUDA-aware MPI to fail #1398

luraess · 2022-02-23T19:57:00Z

Starting with CUDA v3.8.2, CUDA-aware MPI fails with following error, resulting in a segfault:

--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0xa15822400
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------

This failure was known in the past and required to set export JULIA_CUDA_MEMORY_POOL=none.

With CUDA v3.8.1, all works fine upon exporting export JULIA_CUDA_MEMORY_POOL=none. This suggests that a commit going from CUDA v3.8.1 to v3.8.2 introduced the bug. The (only) suspect may be #1383

To reproduce, you can run the ImplicitGlobalGrid test_update_halo.jl using MPI:

mpirun -np 2 julia ~/.julia/packages/ImplicitGlobalGrid/<hash>/test/test_update_halo.jl`

In the test one does not get a segfault, but warnings. In the application, it segfaults.

This occurs on Julia 1.7.1 and

julia> CUDA.versioninfo()
CUDA toolkit 11.4, local installation
NVIDIA driver 470.103.1, for CUDA 11.4
CUDA driver 11.6

Libraries: 
- CUBLAS: 11.6.5
- CURAND: 10.2.5
- CUFFT: 10.5.2
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+470.103.1
- CUDNN: missing
- CUTENSOR: missing

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_MEMORY_POOL: none
- JULIA_CUDA_USE_BINARYBUILDER: false

8 devices:
  0: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  1: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  2: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  3: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  4: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  5: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  6: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  7: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)

The text was updated successfully, but these errors were encountered:

maleadt · 2022-02-23T20:16:44Z

Dup: #1053

maleadt · 2022-02-23T20:17:34Z

With CUDA v3.8.1, all works fine upon exporting export JULIA_CUDA_MEMORY_POOL=none. This suggests that a commit going from CUDA v3.8.1 to v3.8.2 introduced the bug. The (only) suspect may be #1383

Sorry, missed that bit. Fixed that in #1397

luraess · 2022-02-23T20:33:31Z

Thanks for your rapid reply and the fix! Would there be a way to integrate to CUDA tests some of the low-level ImplicitGlobalGrid tests. This may be helpful to catch some low-level issues before they turn into bugs.

maleadt · 2022-02-23T20:34:54Z

We could do reverse CI, but that doesn't scale well. Wouldn't it be possible to do CI on the ImplicitGlobalGrid repo using CUDA.jl#master, and do that automatically every day/week or so?

luraess · 2022-02-23T20:46:53Z

Yeah, we could try running some weekly CI on IGG using CUDA#master. We plan to set-up (multi-)GPU CI at CSCS for IGG and PS in order to have access to various GPU and MPI archs/builds. I guess your suggestion could then fit within that frame.

luraess added the bug Something isn't working label Feb 23, 2022

luraess changed the title ~~Call to cuIpcGetMemHandle failure resulting in CUDA-aware MPI to fail~~ cuIpcGetMemHandle failure resulting in CUDA-aware MPI to fail Feb 23, 2022

maleadt closed this as completed Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cuIpcGetMemHandle` failure resulting in CUDA-aware MPI to fail #1398

`cuIpcGetMemHandle` failure resulting in CUDA-aware MPI to fail #1398

luraess commented Feb 23, 2022

maleadt commented Feb 23, 2022

maleadt commented Feb 23, 2022

luraess commented Feb 23, 2022

maleadt commented Feb 23, 2022

luraess commented Feb 23, 2022

cuIpcGetMemHandle failure resulting in CUDA-aware MPI to fail #1398

cuIpcGetMemHandle failure resulting in CUDA-aware MPI to fail #1398

Comments

luraess commented Feb 23, 2022

maleadt commented Feb 23, 2022

maleadt commented Feb 23, 2022

luraess commented Feb 23, 2022

maleadt commented Feb 23, 2022

luraess commented Feb 23, 2022

`cuIpcGetMemHandle` failure resulting in CUDA-aware MPI to fail #1398

`cuIpcGetMemHandle` failure resulting in CUDA-aware MPI to fail #1398