Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Invalid bitcode signature when loading CUDA.jl after precompilation #293

Closed
jakebolewski opened this issue Jul 14, 2020 · 19 comments · Fixed by JuliaGPU/GPUCompiler.jl#83
Labels
bug Something isn't working

Comments

@jakebolewski
Copy link
Member

jakebolewski commented Jul 14, 2020

Describe the bug

 Error: Invalid bitcode signature

Occasionally our group is seeing the above error raised from the CUDA runtime when running CI that first precompiles the package (stored on the cluster's shared file system) before running. The error is infrequent but somewhat reproducible:

CliMA/ClimateMachine.jl#1345

The above error is raised in LLVM's bitcode reader, which @vchuravy points out is exercised here:

function load_libdevice(cap)
path = libdevice()
get!(libcache, path) do
open(path) do io
parse(LLVM.Module, read(path), JuliaContext())
end
end
end
function link_libdevice!(mod::LLVM.Module, cap::VersionNumber, undefined_fns)
# only link if there's undefined __nv_ functions
if !any(fn->startswith(fn, "__nv_"), undefined_fns)
return
end
lib::LLVM.Module = load_libdevice(cap)
# override libdevice's triple and datalayout to avoid warnings
triple!(lib, triple(mod))
datalayout!(lib, datalayout(mod))
GPUCompiler.link_library!(mod, lib)
ModulePassManager() do pm
push!(metadata(mod), "nvvm-reflect-ftz",
MDNode([ConstantInt(Int32(1), JuliaContext())]))
run!(pm, mod)
end
end

Manifest
Julia v1.4.1
CUDA v1.0.2
GPUCompiler v0.4.0
LLVM v1.7.0

@jakebolewski jakebolewski added the bug Something isn't working label Jul 14, 2020
bors bot added a commit that referenced this issue Jul 14, 2020
294: do not open the file twice when reading the libdevice bitcode r=maleadt a=jakebolewski

Might help with #293 with the shared pvfs (GPFS) we are using on the cluster.

Co-authored-by: jakebolewski <jakebolewski@gmail.com>
@maleadt
Copy link
Member

maleadt commented Jul 14, 2020

The runtime normally doesn't get compiled during precompilation. Or do you do anything special (like explicitly building the runtime)?

@jakebolewski
Copy link
Member Author

jakebolewski commented Jul 14, 2020

We don't do anything special, just:

julia --color=no --project -e 'using Pkg; Pkg.instantiate(); Pkg.build(;verbose=true)'
julia --color=no --project -e 'using Pkg; Pkg.precompile()'

for our project dependencies, we are also using the systems CUDA binaries with JULIA_CUDA_USE_BINARYBUILDER=false (CUDA 10.0)

@jakebolewski
Copy link
Member Author

This error is happening somewhat infrequently, maybe 1 out of 20-50 jobs. So I suspect it has to do more with the cluster and less with CUDA.jl, but I've opened an issue just to make it easier to track.

@vchuravy
Copy link
Member

Assumed fixed with #294, let's reopen if we see it again.

@jakebolewski
Copy link
Member Author

Unfortunately this is a lot less frequently occurring, but it just showed up again:
https://gist.github.com/climabot/2f31ad168ac2e4aca3aaeb59ce775f24#file-out_10699693

@jakebolewski jakebolewski reopened this Jul 20, 2020
@jpsamaroo
Copy link
Member

I believe I'm seeing similar behavior in AMDGPU.jl, where we have approximately the same code pattern as CUDA.jl for bitcode loading. The exact error I get is slightly different than the one reported here, but is similarly stochastic in nature. Usually I won't get them for a long while (approx. 1 out of 100 runs), but once it occurs the first time, I will keep getting them at a frequency of about 30-40% in that specific Julia project folder.

@jakebolewski
Copy link
Member Author

jakebolewski commented Sep 21, 2020

Haven't seen it in a while but another error noticed int the wild with CUDA 10.0 runtime
https://buildkite.com/clima/climatemachine-ci/builds/361#25bd08e1-0b9e-4369-be8c-f84985c872d0/373-380

Julia 1.4.2
CUDA 1.3.3
GPUCompiler 1.4.1
LLVM 1.7.0

@jakebolewski
Copy link
Member Author

jakebolewski commented Sep 21, 2020

And a new related error today about the bitcode file being too small:

error: file too small to contain bitcode header

https://buildkite.com/clima/climatemachine-ci/builds/366#df448b5a-73e2-4599-b766-b62a2e7dabae/373-375

@maleadt
Copy link
Member

maleadt commented Sep 22, 2020

Maybe JuliaGPU/GPUCompiler.jl#82 will fix this; is there an easy way to try? I could always tag a release that includes the 'fix'.

@jakebolewski
Copy link
Member Author

jakebolewski commented Sep 22, 2020

@maleadt I think this is a good change, however I don't know if it fully addresses the cluster / parallel file systems use case.

tmp and the depot paths might be on different mount points which doesn't guarantee an atomic update with mv. I think for the guarantee to hold the tmp directory must be in the same directory as the written runtime file in general (which would ensure they are on the same filesystem).

@jakebolewski
Copy link
Member Author

jakebolewski commented Sep 22, 2020

Ex. if depot is on a shared NFS / GPFS mount point and tmp was mapped to a local ex4 filesystem, there is no guarantee I think that mv will be atomic when crossing the filesystem ext4 <-> parallel file system boundary.

@maleadt
Copy link
Member

maleadt commented Sep 22, 2020

I see. #293 should be a better fix then, if you don't mind a warning.

EDIT: all this assuming the errors comes from that code patch. Hard to tell lacking a back trace.

@simonbyrne
Copy link
Contributor

Would it be possible to have a manual init() function that we could use to trigger this?

@vchuravy
Copy link
Member

You should be able to call CUDA.load_runtime(v"7.0") (

function load_runtime(cap::VersionNumber)
). where v"7.0" is for sm_70

@jakebolewski
Copy link
Member Author

jakebolewski commented Sep 25, 2020

Would an overload like:

load/init_runtime()  = foreach(d ->  load_runtime(capability(d)), devices())

be a reasonable addition? It seems like this would be a common way to pre-initialize the runtime for an HPC batch job with a shared depot.

@maleadt
Copy link
Member

maleadt commented Sep 26, 2020

Sure, yes. I do something similar here: https://github.com/maleadt/julia-ngc/blob/8e6117fc5d4b2672c1a8d90a78fc79623b5b7692/Dockerfile#L31-L37

That said, I'd also be curious to see if recent changes have fixed this issue for you.

@simonbyrne
Copy link
Contributor

@jakebolewski can you open a PR for that?

@maleadt
Copy link
Member

maleadt commented Oct 2, 2020

#465

@jakebolewski
Copy link
Member Author

Thanks! Using llvm_cap_support seems cleaner and the method name is better IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants