-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: Invalid bitcode signature when loading CUDA.jl after precompilation #293
Comments
The runtime normally doesn't get compiled during precompilation. Or do you do anything special (like explicitly building the runtime)? |
We don't do anything special, just:
for our project dependencies, we are also using the systems CUDA binaries with |
This error is happening somewhat infrequently, maybe 1 out of 20-50 jobs. So I suspect it has to do more with the cluster and less with CUDA.jl, but I've opened an issue just to make it easier to track. |
Assumed fixed with #294, let's reopen if we see it again. |
Unfortunately this is a lot less frequently occurring, but it just showed up again: |
I believe I'm seeing similar behavior in AMDGPU.jl, where we have approximately the same code pattern as CUDA.jl for bitcode loading. The exact error I get is slightly different than the one reported here, but is similarly stochastic in nature. Usually I won't get them for a long while (approx. 1 out of 100 runs), but once it occurs the first time, I will keep getting them at a frequency of about 30-40% in that specific Julia project folder. |
Haven't seen it in a while but another error noticed int the wild with CUDA 10.0 runtime Julia 1.4.2 |
And a new related error today about the bitcode file being too small:
|
Maybe JuliaGPU/GPUCompiler.jl#82 will fix this; is there an easy way to try? I could always tag a release that includes the 'fix'. |
@maleadt I think this is a good change, however I don't know if it fully addresses the cluster / parallel file systems use case.
|
Ex. if depot is on a shared NFS / GPFS mount point and tmp was mapped to a local ex4 filesystem, there is no guarantee I think that mv will be atomic when crossing the filesystem ext4 <-> parallel file system boundary. |
I see. #293 should be a better fix then, if you don't mind a warning. EDIT: all this assuming the errors comes from that code patch. Hard to tell lacking a back trace. |
Would it be possible to have a manual |
You should be able to call Line 10 in 9ee41f6
v"7.0" is for sm_70
|
Would an overload like:
be a reasonable addition? It seems like this would be a common way to pre-initialize the runtime for an HPC batch job with a shared depot. |
Sure, yes. I do something similar here: https://github.com/maleadt/julia-ngc/blob/8e6117fc5d4b2672c1a8d90a78fc79623b5b7692/Dockerfile#L31-L37 That said, I'd also be curious to see if recent changes have fixed this issue for you. |
@jakebolewski can you open a PR for that? |
Thanks! Using |
Describe the bug
Occasionally our group is seeing the above error raised from the CUDA runtime when running CI that first precompiles the package (stored on the cluster's shared file system) before running. The error is infrequent but somewhat reproducible:
CliMA/ClimateMachine.jl#1345
The above error is raised in LLVM's bitcode reader, which @vchuravy points out is exercised here:
CUDA.jl/src/device/runtime.jl
Lines 65 to 93 in 463a412
Manifest
Julia v1.4.1
CUDA v1.0.2
GPUCompiler v0.4.0
LLVM v1.7.0
The text was updated successfully, but these errors were encountered: