Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuIpcGetMemHandle failure resulting in CUDA-aware MPI to fail #1398

Closed
luraess opened this issue Feb 23, 2022 · 5 comments
Closed

cuIpcGetMemHandle failure resulting in CUDA-aware MPI to fail #1398

luraess opened this issue Feb 23, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@luraess
Copy link

luraess commented Feb 23, 2022

Starting with CUDA v3.8.2, CUDA-aware MPI fails with following error, resulting in a segfault:

--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0xa15822400
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------

This failure was known in the past and required to set export JULIA_CUDA_MEMORY_POOL=none.

With CUDA v3.8.1, all works fine upon exporting export JULIA_CUDA_MEMORY_POOL=none. This suggests that a commit going from CUDA v3.8.1 to v3.8.2 introduced the bug. The (only) suspect may be #1383

To reproduce, you can run the ImplicitGlobalGrid test_update_halo.jl using MPI:

mpirun -np 2 julia ~/.julia/packages/ImplicitGlobalGrid/<hash>/test/test_update_halo.jl`

In the test one does not get a segfault, but warnings. In the application, it segfaults.

This occurs on Julia 1.7.1 and

julia> CUDA.versioninfo()
CUDA toolkit 11.4, local installation
NVIDIA driver 470.103.1, for CUDA 11.4
CUDA driver 11.6

Libraries: 
- CUBLAS: 11.6.5
- CURAND: 10.2.5
- CUFFT: 10.5.2
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+470.103.1
- CUDNN: missing
- CUTENSOR: missing

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_MEMORY_POOL: none
- JULIA_CUDA_USE_BINARYBUILDER: false

8 devices:
  0: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  1: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  2: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  3: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  4: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  5: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  6: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  7: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
@luraess luraess added the bug Something isn't working label Feb 23, 2022
@luraess luraess changed the title Call to cuIpcGetMemHandle failure resulting in CUDA-aware MPI to fail cuIpcGetMemHandle failure resulting in CUDA-aware MPI to fail Feb 23, 2022
@maleadt
Copy link
Member

maleadt commented Feb 23, 2022

Dup: #1053

@maleadt maleadt closed this as completed Feb 23, 2022
@maleadt
Copy link
Member

maleadt commented Feb 23, 2022

With CUDA v3.8.1, all works fine upon exporting export JULIA_CUDA_MEMORY_POOL=none. This suggests that a commit going from CUDA v3.8.1 to v3.8.2 introduced the bug. The (only) suspect may be #1383

Sorry, missed that bit. Fixed that in #1397

@luraess
Copy link
Author

luraess commented Feb 23, 2022

Thanks for your rapid reply and the fix! Would there be a way to integrate to CUDA tests some of the low-level ImplicitGlobalGrid tests. This may be helpful to catch some low-level issues before they turn into bugs.

@maleadt
Copy link
Member

maleadt commented Feb 23, 2022

We could do reverse CI, but that doesn't scale well. Wouldn't it be possible to do CI on the ImplicitGlobalGrid repo using CUDA.jl#master, and do that automatically every day/week or so?

@luraess
Copy link
Author

luraess commented Feb 23, 2022

Yeah, we could try running some weekly CI on IGG using CUDA#master. We plan to set-up (multi-)GPU CI at CSCS for IGG and PS in order to have access to various GPU and MPI archs/builds. I guess your suggestion could then fit within that frame.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants