Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing requirement of CUDA availability at build time #7790

Closed
ocaisa opened this issue May 26, 2022 · 14 comments
Closed

Removing requirement of CUDA availability at build time #7790

ocaisa opened this issue May 26, 2022 · 14 comments

Comments

@ocaisa
Copy link

ocaisa commented May 26, 2022

Is your feature request related to a problem? Please describe.

In EasyBuild we've been able to split out CUDA support in UCX into a separate (additional) plugin installation, and have tweaked our OpenMPI installation to essentially defer CUDA detection to runtime (by using an internal CUDA header for the configuration step similar to what GCC does for their GPU offloading), easybuilders/easybuild-easyconfigs#15528).

Describe the solution you'd like
How hard would it be to do something similar with libfabric? Can we patch to configure CUDA support with such an internal header file? Is there any cost to always configuring CUDA (there is in OpenMPI, but we have minimised this with an additional patch)? Can we leverage FI_PROVIDER_PATH to shadow the original providers of the main installation with CUDA-enabled alternates.

Are there any obvious issues you see with this approach?

Additional context
We don't want to maintain CUDA-enabled and non-CUDA enabled MPI toolchains, what we want is that when CUDA is required as a dependency we automatically load UCX-CUDA and libfabric-CUDA as well which triggers all available CUDA support in the MPI layer.

@wzamazon
Copy link
Contributor

Hi, the feature you requested already exists in libfabric, you just need to configure libfabric with --enable-cuda-dlopen (along with --with-cuda).

--enable-cuda-dlopen will use dlopen to open cuda library during runtime and load symbol from the library.

@ToddRimmer
Copy link
Contributor

Please clarify what OFI providers you are using. Note the PSM3 provider when built with CUDA support allows jobs with and without CUDA by PSM3_CUDA env variables and uses dlopen to load the CUDA libraries when requested.

@ocaisa
Copy link
Author

ocaisa commented May 26, 2022

Ok, that's great, I will try it out. From what I can see in the repos, EFA also supports CUDA, do you know if it will work in a similar way?

@wzamazon
Copy link
Contributor

EFA also supports CUDA, do you know if it will work in a similar way?

Yes.

libfabric core defined a set of cuda_ops, that other providers uses for cuda.

The code is:

struct cuda_ops {

@ocaisa
Copy link
Author

ocaisa commented May 26, 2022

That's excellent,thanks a lot!

@ocaisa ocaisa closed this as completed May 26, 2022
@ocaisa
Copy link
Author

ocaisa commented May 26, 2022

So the CUDA runtime is still required at compile time due to needing the header files cuda.h and cuda_runtime.h, we are trying to work from the scenario where these are not available. Will see if I can come up with a patch.

@ocaisa ocaisa reopened this May 26, 2022
@ToddRimmer
Copy link
Contributor

I'm doubtful you can have runtime cuda without using the headers at build time. For example you need the cuda headers to define the data structures and constants applicable to the cuda functions which will be called. We certainly would not want to replicate portions of the cuda headers into various OFI providers.

@ocaisa
Copy link
Author

ocaisa commented May 26, 2022

Well, you can have it, but it would require replicating the necessary parts of those header files (e.g, https://github.com/gcc-mirror/gcc/blob/master/include/cuda/cuda.h).

@ocaisa
Copy link
Author

ocaisa commented May 27, 2022

Just wanted to clarify our intent.

Our plan would not be to force this on the providers, but only within libfabric itself. We'd build efa and psm3 without CUDA support and with --enable-<provider>=dl, that way we can then have another build of the providers with CUDA support and use FI_PROVIDER_PATH to have those picked up.

We use environment modules, so the base libfabric module would act as a non-CUDA build (but libfabric itself would be "CUDA-ready"), and then we would have a libfabric-CUDA module that can be loaded on top (which loads a dependent CUDA module and sets FI_PROVIDER_PATH) to enable CUDA-aware capabilities.

@ToddRimmer
Copy link
Contributor

A few important considerations:

  1. header files such as cuda_runtime.h have a proprietary license. We cannot copy any code or concepts from such headers without introducing legal issues in the OFI code which has an open license now. And of course such copies introduce maintenance challenges as we can expect nvidia to evolve these headers over time
  2. note that some providers, such as psm3, do not use or depend on the cuda code in libfabric. These providers intentionally need direct cuda awareness so features like GPU Direct can be implemented efficiently. The cuda builds for psm3 are separated for a few reasons:
    2a. headers may not be available (and we can't legally copy such code)
    2b. there are additional runtime overheads, even when cuda is not enabled. We suspect those overheads are insignificant, but it will take effort to confirm those suspicions.

@rajachan
Copy link
Member

Our plan would not be to force this on the providers, but only within libfabric itself. We'd build efa and psm3 without CUDA support and with --enable-=dl, that way we can then have another build of the providers with CUDA support and use FI_PROVIDER_PATH to have those picked up.

I'm with Todd on not copying in cuda*.h definitions for the core to use. Also, CUDA is just one of the many accelerators libfabric supports (see the list here). Whatever solution we come up with, lets make sure it is uniformly handled across all supported HMEM types.

@wzamazon
Copy link
Contributor

Our plan would not be to force this on the providers, but only within libfabric itself. We'd build efa and psm3 without CUDA support and with --enable-=dl, that way we can then have another build of the providers with CUDA support and use FI_PROVIDER_PATH to have those picked up.

One possible issue of this approach is that the EFA provider you built with CUDA support might not have shm support, hence it will work but not efficient.

This is because EFA provider use shm provider to implement the shm support, and shm provider might be availble when you built EFA as a standalone library.

@ocaisa ocaisa changed the title Runtime detection of CUDA Removing requirement of CUDA availability at build time Jun 15, 2022
@j-xiong
Copy link
Contributor

j-xiong commented Mar 25, 2024

Anything left to be done here? Note that if a DL provider is built with CUDA support, it doesn't require the libfabric core to be built in the same way --- the HMEM support code (e.g. hmem.c, hmem_cuda.c) is compiled into the DL provider directly.

I will close this issue if no objection is heard by the end of this week.

@ocaisa
Copy link
Author

ocaisa commented Mar 27, 2024

Thank for addressing this!

@ocaisa ocaisa closed this as completed Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants