Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA_Runtime] only add include_dependency if version not specified #7523

Merged
merged 4 commits into from
Oct 18, 2023

Conversation

simonbyrne
Copy link
Contributor

We were hitting an issue where precompiling on a node with a GPU driver installed would then retrigger precompilation when used on a non-GPU node.

This should avoid the problem when the CUDA runtime version is concretely specified

cc @maleadt @vchuravy

@maleadt
Copy link
Contributor

maleadt commented Oct 13, 2023

Thanks. The approach is not entirely correct though, as there's a reason I put this logic before loading CUDA_Driver_jll: cuda_driver may point to the forwards-compatible driver that got picked up by that JLL.

On the other hand, maybe we should have CUDA_Driver_jll invalidate itself separately if the actual system driver got updated, which should then automatically invalidate CUDA_Runtime_jll. For that to work, we should validate that invalidating CUDA_Driver_jll results in CUDA_Runtime_jll (and thus CUDA.jl) getting precompiled again.

@simonbyrne
Copy link
Contributor Author

What determines whether it uses the system driver or the JLL-provided driver is used?

@simonbyrne
Copy link
Contributor Author

simonbyrne commented Oct 13, 2023

I have spent a few hours thinking about this, and I don't think there is a coherent way we can make cache invalidation work based on system files, while also hoping to support shared file systems.

There are two issues with the current approach:

  1. If I precompile on a system without the CUDA driver (e.g. the login node), the system driver won't be installed at all, and so won't get include_dependency-ed. Thus when I load it on a compute node (which has the driver), it won't get invalidated no matter what the version (possibly ending up in a broken state).

  2. The CUDA toolkit version won't be part of the precompile slug, and every time i switch CUDA drivers (e.g. if nodes are configured slightly differently), then I will retrigger precompilation, overwritting the file.

The only way to fix 2 is to explicitly set the version in the preferences: i.e. if you want to (or need to) use an older toolkit than the current one, you are required to set the CUDA_Runtime_jll preferences. This would then avoid the need to include_dependency any system files.

@maleadt
Copy link
Contributor

maleadt commented Oct 16, 2023

What determines whether it uses the system driver or the JLL-provided driver is used?

Platform and hardware compatibility. There's a host driver/forwards compatible driver compatibility chart in the NVIDIA docs, and it generally only supports enterprise/datacenter hardware.

  1. If I precompile on a system without the CUDA driver (e.g. the login node), the system driver won't be installed at all, and so won't get include_dependency-ed. Thus when I load it on a compute node (which has the driver), it won't get invalidated no matter what the version (possibly ending up in a broken state).

    1. The CUDA toolkit version won't be part of the precompile slug, and every time i switch CUDA drivers (e.g. if nodes are configured slightly differently), then I will retrigger precompilation, overwritting the file.

The only way to fix 2 is to explicitly set the version in the preferences: i.e. if you want to (or need to) use an older toolkit than the current one, you are required to set the CUDA_Runtime_jll preferences. This would then avoid the need to include_dependency any system files.

This however only solves the runtime selection problem, where the selected runtime depends on the driver's version. In the case where no driver was available at runtime, we essentially require either use of the local toolkit, or that the user provided a runtime version to use. In both those cases, we don't need to invalidate the precompilation image when the version changes, because we won't be selecting a different toolkit anyway, right?

The other problem is that we also use the driver's version number for determining which CUDA driver APIs to use. I was hoping we could perform such decisions at top level, but it looks like that won't be possible if we want to support precompiling on a system without the CUDA driver...

@simonbyrne
Copy link
Contributor Author

In both those cases, we don't need to invalidate the precompilation image when the version changes, because we won't be selecting a different toolkit anyway, right?

What I meant is that you could have a case where you precompile on a node without a GPU, and so it assumes that it will use the latest CUDA toolkit (12.2): as no driver is available, the driver won't be part of the include_dependency. If I then load it on a compute node with an incompatible driver, it won't trigger recompilation.

The other problem is that we also use the driver's version number for determining which CUDA driver APIs to use. I was hoping we could perform such decisions at top level, but it looks like that won't be possible if we want to support precompiling on a system without the CUDA driver...

In that case, I think it would be reasonable to require the user to specify some sort of minimum CUDA driver version: perhaps we could also make this a preference (this would also help in the case where I'm using a cluster with a mixture of CUDA driver versions)

@simonbyrne
Copy link
Contributor Author

@maleadt In the meantime, how about this change? It doesn't include the driver library as a include_dependency if the CUDA_Runtime version preference is set.

@maleadt maleadt merged commit 786f33e into JuliaPackaging:master Oct 18, 2023
@simonbyrne simonbyrne deleted the patch-17 branch October 18, 2023 17:35
@simonbyrne
Copy link
Contributor Author

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants