[CUDA_Runtime] only add include_dependency if version not specified #7523

simonbyrne · 2023-10-12T20:54:48Z

We were hitting an issue where precompiling on a node with a GPU driver installed would then retrigger precompilation when used on a non-GPU node.

This should avoid the problem when the CUDA runtime version is concretely specified

cc @maleadt @vchuravy

maleadt · 2023-10-13T06:17:46Z

Thanks. The approach is not entirely correct though, as there's a reason I put this logic before loading CUDA_Driver_jll: cuda_driver may point to the forwards-compatible driver that got picked up by that JLL.

On the other hand, maybe we should have CUDA_Driver_jll invalidate itself separately if the actual system driver got updated, which should then automatically invalidate CUDA_Runtime_jll. For that to work, we should validate that invalidating CUDA_Driver_jll results in CUDA_Runtime_jll (and thus CUDA.jl) getting precompiled again.

simonbyrne · 2023-10-13T17:57:58Z

What determines whether it uses the system driver or the JLL-provided driver is used?

simonbyrne · 2023-10-13T18:54:39Z

I have spent a few hours thinking about this, and I don't think there is a coherent way we can make cache invalidation work based on system files, while also hoping to support shared file systems.

There are two issues with the current approach:

If I precompile on a system without the CUDA driver (e.g. the login node), the system driver won't be installed at all, and so won't get include_dependency-ed. Thus when I load it on a compute node (which has the driver), it won't get invalidated no matter what the version (possibly ending up in a broken state).
The CUDA toolkit version won't be part of the precompile slug, and every time i switch CUDA drivers (e.g. if nodes are configured slightly differently), then I will retrigger precompilation, overwritting the file.

The only way to fix 2 is to explicitly set the version in the preferences: i.e. if you want to (or need to) use an older toolkit than the current one, you are required to set the CUDA_Runtime_jll preferences. This would then avoid the need to include_dependency any system files.

maleadt · 2023-10-16T13:59:54Z

What determines whether it uses the system driver or the JLL-provided driver is used?

Platform and hardware compatibility. There's a host driver/forwards compatible driver compatibility chart in the NVIDIA docs, and it generally only supports enterprise/datacenter hardware.

If I precompile on a system without the CUDA driver (e.g. the login node), the system driver won't be installed at all, and so won't get include_dependency-ed. Thus when I load it on a compute node (which has the driver), it won't get invalidated no matter what the version (possibly ending up in a broken state).

The CUDA toolkit version won't be part of the precompile slug, and every time i switch CUDA drivers (e.g. if nodes are configured slightly differently), then I will retrigger precompilation, overwritting the file.

The only way to fix 2 is to explicitly set the version in the preferences: i.e. if you want to (or need to) use an older toolkit than the current one, you are required to set the CUDA_Runtime_jll preferences. This would then avoid the need to include_dependency any system files.

This however only solves the runtime selection problem, where the selected runtime depends on the driver's version. In the case where no driver was available at runtime, we essentially require either use of the local toolkit, or that the user provided a runtime version to use. In both those cases, we don't need to invalidate the precompilation image when the version changes, because we won't be selecting a different toolkit anyway, right?

The other problem is that we also use the driver's version number for determining which CUDA driver APIs to use. I was hoping we could perform such decisions at top level, but it looks like that won't be possible if we want to support precompiling on a system without the CUDA driver...

simonbyrne · 2023-10-16T18:20:07Z

In both those cases, we don't need to invalidate the precompilation image when the version changes, because we won't be selecting a different toolkit anyway, right?

What I meant is that you could have a case where you precompile on a node without a GPU, and so it assumes that it will use the latest CUDA toolkit (12.2): as no driver is available, the driver won't be part of the include_dependency. If I then load it on a compute node with an incompatible driver, it won't trigger recompilation.

The other problem is that we also use the driver's version number for determining which CUDA driver APIs to use. I was hoping we could perform such decisions at top level, but it looks like that won't be possible if we want to support precompiling on a system without the CUDA driver...

In that case, I think it would be reasonable to require the user to specify some sort of minimum CUDA driver version: perhaps we could also make this a preference (this would also help in the case where I'm using a cluster with a mixture of CUDA driver versions)

simonbyrne · 2023-10-17T17:59:28Z

@maleadt In the meantime, how about this change? It doesn't include the driver library as a include_dependency if the CUDA_Runtime version preference is set.

simonbyrne · 2023-10-18T17:35:53Z

thank you!

…uliaPackaging#7523)

simonbyrne added 3 commits October 12, 2023 13:52

[CUDA_Runtime] only add include_dependency if version not specified

198a15f

add line break back

b35135c

Update platform_augmentation.jl

6511c28

simonbyrne mentioned this pull request Oct 13, 2023

Cannot upgrade to CUDA.jl 5 CliMA/ClimaCore.jl#1500

Closed

Disable include_dependency if CUDA_Runtime_jll version preference set

fb1a3cb

maleadt merged commit 786f33e into JuliaPackaging:master Oct 18, 2023

simonbyrne deleted the patch-17 branch October 18, 2023 17:35

amontoison pushed a commit to amontoison/Yggdrasil that referenced this pull request Nov 27, 2023

[CUDA_Runtime] only add include_dependency if version not specified (J…

0d4b1e6

…uliaPackaging#7523)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA_Runtime] only add include_dependency if version not specified #7523

[CUDA_Runtime] only add include_dependency if version not specified #7523

simonbyrne commented Oct 12, 2023

maleadt commented Oct 13, 2023

simonbyrne commented Oct 13, 2023

simonbyrne commented Oct 13, 2023 •

edited

Loading

maleadt commented Oct 16, 2023

simonbyrne commented Oct 16, 2023

simonbyrne commented Oct 17, 2023

simonbyrne commented Oct 18, 2023

[CUDA_Runtime] only add include_dependency if version not specified #7523

[CUDA_Runtime] only add include_dependency if version not specified #7523

Conversation

simonbyrne commented Oct 12, 2023

maleadt commented Oct 13, 2023

simonbyrne commented Oct 13, 2023

simonbyrne commented Oct 13, 2023 • edited Loading

maleadt commented Oct 16, 2023

simonbyrne commented Oct 16, 2023

simonbyrne commented Oct 17, 2023

simonbyrne commented Oct 18, 2023

simonbyrne commented Oct 13, 2023 •

edited

Loading