Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to query whether CUDA awareness is actually turned on or not at runtime? #7963

Closed
leofang opened this issue Jul 24, 2020 · 4 comments · Fixed by #7970
Closed

How to query whether CUDA awareness is actually turned on or not at runtime? #7963

leofang opened this issue Jul 24, 2020 · 4 comments · Fixed by #7970

Comments

@leofang
Copy link

leofang commented Jul 24, 2020

Hi, I am aware of the build time and runtime checks outlined in https://www.open-mpi.org/faq/?category=runcuda#mpi-cuda-aware-support, so please bear with me til the end.

We are looking for a runtime way to check whether CUDA awareness is actually turned on in Open MPI, see the original discussion in the mpi4py repo. It turns out that the existing API MPIX_Query_cuda_support() is useless. I quote @jsquyres:

Honestly, I don’t think we thought anyone was using it. 🙂

The reason is it only tells us if --with-cuda is set at build time, but the smcuda btl could still be ejected (as defaulted in conda-forge's openmpi package: conda-forge/openmpi-feedstock#56) and thus no CUDA awareness.

The situation becomes even more complicated when UCX is in use. As pointed out by @jsquyres, Open MPI could be built without CUDA while UCX is, and in this situation we still get CUDA awareness (but MPIX_Query_cuda_support() would return false)! For this, I am requesting the UCX side to support a runtime query (openucx/ucx#5471), which should be incorporated here if UCX is in use by Open MPI.

In short, it would be great if Open MPI could set up a mechanism (hopefully another public API, but leave the existing MPIX_Query_cuda_support() intact to avoid confusion) for us to query, at runtime:

  • If Open MPI's CUDA support is actually in effect or not, this needs to take into account
    1. When Open MPI is not built against UCX: whether the smcuda btl is ejected or not
    2. When UCX is used: whether UCX has CUDA support (How to query CUDA support at runtime? openucx/ucx#5471)

I hope I do not misunderstand the situation or miss any critical pieces of information. Thanks.

cc: @dalcinl

@leofang
Copy link
Author

leofang commented Jul 24, 2020

ii. When UCX is used: whether UCX has CUDA support

I didn't expect I could follow up here so quickly 😁 @yosefe and @bureddy kindly pointed out in openucx/ucx#5471 that ucp_context_query() can be used to query the CUDA support from UCX, and that Open MPI already uses this API during initialization (#7898). I think we can just take the information recorded there when UCX is in use, so half of the problem is resolved!

@bureddy
Copy link
Member

bureddy commented Jul 24, 2020

The situation becomes even more complicated when UCX is in use. As pointed out by @jsquyres, Open MPI could be built without CUDA while UCX is, and in this situation we still get CUDA awareness (but MPIX_Query_cuda_support() would return false)!

In this case cuda awareness is partial. UCX still depends on OMPI cuda support for collectives and cuda datatype pack/unpack.
ideally, both OMPI and UCX has to build with cuda support.

@jsquyres
Copy link
Member

@bureddy I don't know if you want to address this here or over at the UCX issue, but I think the customer ask is that MPIX_Query_cuda() be improved (or replaced with something better?). In the original discussion, two needs were identified:

  1. Is the library compiled with CUDA support.
  2. Is CUDA support enabled "right now" (for some definition of "right now").

Both of those can be narrowed down further, I'm sure. But the point is that the current MPIX_Query_cuda() really isn't very useful, because all it does is return a configure-time constant that indicates where Open MPI -- not even UCX -- is compiled with CUDA support or not.

@bosilca
Copy link
Member

bosilca commented Jul 24, 2020

You have to keep in mind that when we added that function all CUDA support was through the OB1 PML, so the answer was always correct. Now we will need to interrogate the selected PML to check if the capability is supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants