Removing requirement of CUDA availability at build time #7790

ocaisa · 2022-05-26T11:35:50Z

Is your feature request related to a problem? Please describe.

In EasyBuild we've been able to split out CUDA support in UCX into a separate (additional) plugin installation, and have tweaked our OpenMPI installation to essentially defer CUDA detection to runtime (by using an internal CUDA header for the configuration step similar to what GCC does for their GPU offloading), easybuilders/easybuild-easyconfigs#15528).

Describe the solution you'd like
How hard would it be to do something similar with libfabric? Can we patch to configure CUDA support with such an internal header file? Is there any cost to always configuring CUDA (there is in OpenMPI, but we have minimised this with an additional patch)? Can we leverage FI_PROVIDER_PATH to shadow the original providers of the main installation with CUDA-enabled alternates.

Are there any obvious issues you see with this approach?

Additional context
We don't want to maintain CUDA-enabled and non-CUDA enabled MPI toolchains, what we want is that when CUDA is required as a dependency we automatically load UCX-CUDA and libfabric-CUDA as well which triggers all available CUDA support in the MPI layer.

The text was updated successfully, but these errors were encountered:

wzamazon · 2022-05-26T12:59:38Z

Hi, the feature you requested already exists in libfabric, you just need to configure libfabric with --enable-cuda-dlopen (along with --with-cuda).

--enable-cuda-dlopen will use dlopen to open cuda library during runtime and load symbol from the library.

ToddRimmer · 2022-05-26T13:21:39Z

Please clarify what OFI providers you are using. Note the PSM3 provider when built with CUDA support allows jobs with and without CUDA by PSM3_CUDA env variables and uses dlopen to load the CUDA libraries when requested.

ocaisa · 2022-05-26T13:29:01Z

Ok, that's great, I will try it out. From what I can see in the repos, EFA also supports CUDA, do you know if it will work in a similar way?

wzamazon · 2022-05-26T13:34:00Z

EFA also supports CUDA, do you know if it will work in a similar way?

Yes.

libfabric core defined a set of cuda_ops, that other providers uses for cuda.

The code is:

libfabric/src/hmem_cuda.c

Line 46 in a28c5f8

struct cuda_ops {

ocaisa · 2022-05-26T13:37:20Z

That's excellent,thanks a lot!

ocaisa · 2022-05-26T15:13:26Z

So the CUDA runtime is still required at compile time due to needing the header files cuda.h and cuda_runtime.h, we are trying to work from the scenario where these are not available. Will see if I can come up with a patch.

ToddRimmer · 2022-05-26T15:33:01Z

I'm doubtful you can have runtime cuda without using the headers at build time. For example you need the cuda headers to define the data structures and constants applicable to the cuda functions which will be called. We certainly would not want to replicate portions of the cuda headers into various OFI providers.

ocaisa · 2022-05-26T16:12:20Z

Well, you can have it, but it would require replicating the necessary parts of those header files (e.g, https://github.com/gcc-mirror/gcc/blob/master/include/cuda/cuda.h).

ocaisa · 2022-05-27T13:50:30Z

Just wanted to clarify our intent.

Our plan would not be to force this on the providers, but only within libfabric itself. We'd build efa and psm3 without CUDA support and with --enable-<provider>=dl, that way we can then have another build of the providers with CUDA support and use FI_PROVIDER_PATH to have those picked up.

We use environment modules, so the base libfabric module would act as a non-CUDA build (but libfabric itself would be "CUDA-ready"), and then we would have a libfabric-CUDA module that can be loaded on top (which loads a dependent CUDA module and sets FI_PROVIDER_PATH) to enable CUDA-aware capabilities.

ToddRimmer · 2022-05-27T16:38:49Z

A few important considerations:

header files such as cuda_runtime.h have a proprietary license. We cannot copy any code or concepts from such headers without introducing legal issues in the OFI code which has an open license now. And of course such copies introduce maintenance challenges as we can expect nvidia to evolve these headers over time
note that some providers, such as psm3, do not use or depend on the cuda code in libfabric. These providers intentionally need direct cuda awareness so features like GPU Direct can be implemented efficiently. The cuda builds for psm3 are separated for a few reasons:
2a. headers may not be available (and we can't legally copy such code)
2b. there are additional runtime overheads, even when cuda is not enabled. We suspect those overheads are insignificant, but it will take effort to confirm those suspicions.

rajachan · 2022-05-27T16:48:59Z

Our plan would not be to force this on the providers, but only within libfabric itself. We'd build efa and psm3 without CUDA support and with --enable-=dl, that way we can then have another build of the providers with CUDA support and use FI_PROVIDER_PATH to have those picked up.

I'm with Todd on not copying in cuda*.h definitions for the core to use. Also, CUDA is just one of the many accelerators libfabric supports (see the list here). Whatever solution we come up with, lets make sure it is uniformly handled across all supported HMEM types.

wzamazon · 2022-05-27T21:17:12Z

Our plan would not be to force this on the providers, but only within libfabric itself. We'd build efa and psm3 without CUDA support and with --enable-=dl, that way we can then have another build of the providers with CUDA support and use FI_PROVIDER_PATH to have those picked up.

One possible issue of this approach is that the EFA provider you built with CUDA support might not have shm support, hence it will work but not efficient.

This is because EFA provider use shm provider to implement the shm support, and shm provider might be availble when you built EFA as a standalone library.

j-xiong · 2024-03-25T20:15:40Z

Anything left to be done here? Note that if a DL provider is built with CUDA support, it doesn't require the libfabric core to be built in the same way --- the HMEM support code (e.g. hmem.c, hmem_cuda.c) is compiled into the DL provider directly.

I will close this issue if no objection is heard by the end of this week.

ocaisa · 2024-03-27T15:27:38Z

Thank for addressing this!

ocaisa added the enhancement label May 26, 2022

ocaisa closed this as completed May 26, 2022

ocaisa reopened this May 26, 2022

ocaisa changed the title ~~Runtime detection of CUDA~~ Removing requirement of CUDA availability at build time Jun 15, 2022

thomas-bouvier mentioned this issue Jan 3, 2023

Getting HG_OPNOTSUPPORTED when performing RDMA on data living in CUDA memory mochi-hpc/mochi-thallium#7

Closed

ocaisa closed this as completed Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing requirement of CUDA availability at build time #7790

Removing requirement of CUDA availability at build time #7790

ocaisa commented May 26, 2022 •

edited

Loading

wzamazon commented May 26, 2022

ToddRimmer commented May 26, 2022

ocaisa commented May 26, 2022

wzamazon commented May 26, 2022

ocaisa commented May 26, 2022

ocaisa commented May 26, 2022

ToddRimmer commented May 26, 2022

ocaisa commented May 26, 2022

ocaisa commented May 27, 2022 •

edited

Loading

ToddRimmer commented May 27, 2022

rajachan commented May 27, 2022

wzamazon commented May 27, 2022

j-xiong commented Mar 25, 2024

ocaisa commented Mar 27, 2024

Removing requirement of CUDA availability at build time #7790

Removing requirement of CUDA availability at build time #7790

Comments

ocaisa commented May 26, 2022 • edited Loading

wzamazon commented May 26, 2022

ToddRimmer commented May 26, 2022

ocaisa commented May 26, 2022

wzamazon commented May 26, 2022

ocaisa commented May 26, 2022

ocaisa commented May 26, 2022

ToddRimmer commented May 26, 2022

ocaisa commented May 26, 2022

ocaisa commented May 27, 2022 • edited Loading

ToddRimmer commented May 27, 2022

rajachan commented May 27, 2022

wzamazon commented May 27, 2022

j-xiong commented Mar 25, 2024

ocaisa commented Mar 27, 2024

ocaisa commented May 26, 2022 •

edited

Loading

ocaisa commented May 27, 2022 •

edited

Loading