-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing requirement of CUDA availability at build time #7790
Comments
Hi, the feature you requested already exists in libfabric, you just need to configure libfabric with
|
Please clarify what OFI providers you are using. Note the PSM3 provider when built with CUDA support allows jobs with and without CUDA by PSM3_CUDA env variables and uses dlopen to load the CUDA libraries when requested. |
Ok, that's great, I will try it out. From what I can see in the repos, EFA also supports CUDA, do you know if it will work in a similar way? |
Yes. libfabric core defined a set of The code is: Line 46 in a28c5f8
|
That's excellent,thanks a lot! |
So the CUDA runtime is still required at compile time due to needing the header files |
I'm doubtful you can have runtime cuda without using the headers at build time. For example you need the cuda headers to define the data structures and constants applicable to the cuda functions which will be called. We certainly would not want to replicate portions of the cuda headers into various OFI providers. |
Well, you can have it, but it would require replicating the necessary parts of those header files (e.g, https://github.com/gcc-mirror/gcc/blob/master/include/cuda/cuda.h). |
Just wanted to clarify our intent. Our plan would not be to force this on the providers, but only within libfabric itself. We'd build We use environment modules, so the base |
A few important considerations:
|
I'm with Todd on not copying in cuda*.h definitions for the core to use. Also, CUDA is just one of the many accelerators libfabric supports (see the list here). Whatever solution we come up with, lets make sure it is uniformly handled across all supported HMEM types. |
One possible issue of this approach is that the EFA provider you built with CUDA support might not have shm support, hence it will work but not efficient. This is because EFA provider use shm provider to implement the shm support, and shm provider might be availble when you built EFA as a standalone library. |
Anything left to be done here? Note that if a DL provider is built with CUDA support, it doesn't require the libfabric core to be built in the same way --- the HMEM support code (e.g. hmem.c, hmem_cuda.c) is compiled into the DL provider directly. I will close this issue if no objection is heard by the end of this week. |
Thank for addressing this! |
Is your feature request related to a problem? Please describe.
In EasyBuild we've been able to split out CUDA support in UCX into a separate (additional) plugin installation, and have tweaked our OpenMPI installation to essentially defer CUDA detection to runtime (by using an internal CUDA header for the configuration step similar to what GCC does for their GPU offloading), easybuilders/easybuild-easyconfigs#15528).
Describe the solution you'd like
How hard would it be to do something similar with
libfabric
? Can we patch to configure CUDA support with such an internal header file? Is there any cost to always configuring CUDA (there is in OpenMPI, but we have minimised this with an additional patch)? Can we leverageFI_PROVIDER_PATH
to shadow the original providers of the main installation with CUDA-enabled alternates.Are there any obvious issues you see with this approach?
Additional context
We don't want to maintain CUDA-enabled and non-CUDA enabled MPI toolchains, what we want is that when CUDA is required as a dependency we automatically load
UCX-CUDA
andlibfabric-CUDA
as well which triggers all available CUDA support in the MPI layer.The text was updated successfully, but these errors were encountered: