Error: Invalid bitcode signature when loading CUDA.jl after precompilation #293

jakebolewski · 2020-07-14T17:42:02Z

Describe the bug

 Error: Invalid bitcode signature

Occasionally our group is seeing the above error raised from the CUDA runtime when running CI that first precompiles the package (stored on the cluster's shared file system) before running. The error is infrequent but somewhat reproducible:

CliMA/ClimateMachine.jl#1345

The above error is raised in LLVM's bitcode reader, which @vchuravy points out is exercised here:

CUDA.jl/src/device/runtime.jl

Lines 65 to 93 in 463a412

    
           function load_libdevice(cap) 
        
               path = libdevice() 
        
               get!(libcache, path) do 
        
                   open(path) do io 
        
                       parse(LLVM.Module, read(path), JuliaContext()) 
        
                   end 
        
               end 
        
           end 
        
           function link_libdevice!(mod::LLVM.Module, cap::VersionNumber, undefined_fns) 
        
               # only link if there's undefined __nv_ functions 
        
               if !any(fn->startswith(fn, "__nv_"), undefined_fns) 
        
                   return 
        
               end 
        
               lib::LLVM.Module = load_libdevice(cap) 
        
               # override libdevice's triple and datalayout to avoid warnings 
        
               triple!(lib, triple(mod)) 
        
               datalayout!(lib, datalayout(mod)) 
        
               GPUCompiler.link_library!(mod, lib) 
        
               ModulePassManager() do pm 
        
                   push!(metadata(mod), "nvvm-reflect-ftz", 
        
                         MDNode([ConstantInt(Int32(1), JuliaContext())])) 
        
                   run!(pm, mod) 
        
               end 
        
           end

Manifest
Julia v1.4.1
CUDA v1.0.2
GPUCompiler v0.4.0
LLVM v1.7.0

The text was updated successfully, but these errors were encountered:

294: do not open the file twice when reading the libdevice bitcode r=maleadt a=jakebolewski Might help with #293 with the shared pvfs (GPFS) we are using on the cluster. Co-authored-by: jakebolewski <jakebolewski@gmail.com>

maleadt · 2020-07-14T20:00:13Z

The runtime normally doesn't get compiled during precompilation. Or do you do anything special (like explicitly building the runtime)?

jakebolewski · 2020-07-14T20:03:15Z

We don't do anything special, just:

julia --color=no --project -e 'using Pkg; Pkg.instantiate(); Pkg.build(;verbose=true)'
julia --color=no --project -e 'using Pkg; Pkg.precompile()'

for our project dependencies, we are also using the systems CUDA binaries with JULIA_CUDA_USE_BINARYBUILDER=false (CUDA 10.0)

jakebolewski · 2020-07-14T20:27:26Z

This error is happening somewhat infrequently, maybe 1 out of 20-50 jobs. So I suspect it has to do more with the cluster and less with CUDA.jl, but I've opened an issue just to make it easier to track.

vchuravy · 2020-07-18T12:12:26Z

Assumed fixed with #294, let's reopen if we see it again.

jakebolewski · 2020-07-20T20:28:57Z

Unfortunately this is a lot less frequently occurring, but it just showed up again:
https://gist.github.com/climabot/2f31ad168ac2e4aca3aaeb59ce775f24#file-out_10699693

jpsamaroo · 2020-08-21T17:45:56Z

I believe I'm seeing similar behavior in AMDGPU.jl, where we have approximately the same code pattern as CUDA.jl for bitcode loading. The exact error I get is slightly different than the one reported here, but is similarly stochastic in nature. Usually I won't get them for a long while (approx. 1 out of 100 runs), but once it occurs the first time, I will keep getting them at a frequency of about 30-40% in that specific Julia project folder.

jakebolewski · 2020-09-21T15:53:13Z

Haven't seen it in a while but another error noticed int the wild with CUDA 10.0 runtime
https://buildkite.com/clima/climatemachine-ci/builds/361#25bd08e1-0b9e-4369-be8c-f84985c872d0/373-380

Julia 1.4.2
CUDA 1.3.3
GPUCompiler 1.4.1
LLVM 1.7.0

jakebolewski · 2020-09-21T19:30:19Z

And a new related error today about the bitcode file being too small:

error: file too small to contain bitcode header

https://buildkite.com/clima/climatemachine-ci/builds/366#df448b5a-73e2-4599-b766-b62a2e7dabae/373-375

maleadt · 2020-09-22T06:09:28Z

Maybe JuliaGPU/GPUCompiler.jl#82 will fix this; is there an easy way to try? I could always tag a release that includes the 'fix'.

jakebolewski · 2020-09-22T15:45:17Z

@maleadt I think this is a good change, however I don't know if it fully addresses the cluster / parallel file systems use case.

tmp and the depot paths might be on different mount points which doesn't guarantee an atomic update with mv. I think for the guarantee to hold the tmp directory must be in the same directory as the written runtime file in general (which would ensure they are on the same filesystem).

jakebolewski · 2020-09-22T15:48:20Z

Ex. if depot is on a shared NFS / GPFS mount point and tmp was mapped to a local ex4 filesystem, there is no guarantee I think that mv will be atomic when crossing the filesystem ext4 <-> parallel file system boundary.

maleadt · 2020-09-22T16:06:30Z

I see. #293 should be a better fix then, if you don't mind a warning.

EDIT: all this assuming the errors comes from that code patch. Hard to tell lacking a back trace.

simonbyrne · 2020-09-25T16:31:48Z

Would it be possible to have a manual init() function that we could use to trigger this?

vchuravy · 2020-09-25T19:37:21Z

You should be able to call CUDA.load_runtime(v"7.0") (

CUDA.jl/src/device/runtime.jl

Line 10 in 9ee41f6

function load_runtime(cap::VersionNumber)

). where v"7.0" is for sm_70

jakebolewski · 2020-09-25T20:43:35Z

Would an overload like:

load/init_runtime()  = foreach(d ->  load_runtime(capability(d)), devices())

be a reasonable addition? It seems like this would be a common way to pre-initialize the runtime for an HPC batch job with a shared depot.

maleadt · 2020-09-26T07:02:56Z

Sure, yes. I do something similar here: https://github.com/maleadt/julia-ngc/blob/8e6117fc5d4b2672c1a8d90a78fc79623b5b7692/Dockerfile#L31-L37

That said, I'd also be curious to see if recent changes have fixed this issue for you.

simonbyrne · 2020-09-30T16:35:45Z

@jakebolewski can you open a PR for that?

maleadt · 2020-10-02T14:52:05Z

#465

jakebolewski · 2020-10-02T15:05:46Z

Thanks! Using llvm_cap_support seems cleaner and the method name is better IMO.

jakebolewski added the bug Something isn't working label Jul 14, 2020

jakebolewski mentioned this issue Jul 14, 2020

do not open the file twice when reading the libdevice bitcode #294

Merged

vchuravy closed this as completed Jul 18, 2020

jakebolewski reopened this Jul 20, 2020

maleadt mentioned this issue Sep 22, 2020

Retry runtime library compilation if it failed to load. JuliaGPU/GPUCompiler.jl#83

Merged

maleadt closed this as completed in JuliaGPU/GPUCompiler.jl#83 Sep 23, 2020

maleadt mentioned this issue Oct 2, 2020

Add functionality to precompile the runtime library. #465

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: Invalid bitcode signature when loading CUDA.jl after precompilation #293

Error: Invalid bitcode signature when loading CUDA.jl after precompilation #293

jakebolewski commented Jul 14, 2020 •

edited

Loading

maleadt commented Jul 14, 2020

jakebolewski commented Jul 14, 2020 •

edited

Loading

jakebolewski commented Jul 14, 2020

vchuravy commented Jul 18, 2020

jakebolewski commented Jul 20, 2020

jpsamaroo commented Aug 21, 2020

jakebolewski commented Sep 21, 2020 •

edited

Loading

jakebolewski commented Sep 21, 2020 •

edited

Loading

maleadt commented Sep 22, 2020

jakebolewski commented Sep 22, 2020 •

edited

Loading

jakebolewski commented Sep 22, 2020 •

edited

Loading

maleadt commented Sep 22, 2020 •

edited

Loading

simonbyrne commented Sep 25, 2020

vchuravy commented Sep 25, 2020

jakebolewski commented Sep 25, 2020 •

edited

Loading

maleadt commented Sep 26, 2020

simonbyrne commented Sep 30, 2020

maleadt commented Oct 2, 2020

jakebolewski commented Oct 2, 2020

Error: Invalid bitcode signature when loading CUDA.jl after precompilation #293

Error: Invalid bitcode signature when loading CUDA.jl after precompilation #293

Comments

jakebolewski commented Jul 14, 2020 • edited Loading

maleadt commented Jul 14, 2020

jakebolewski commented Jul 14, 2020 • edited Loading

jakebolewski commented Jul 14, 2020

vchuravy commented Jul 18, 2020

jakebolewski commented Jul 20, 2020

jpsamaroo commented Aug 21, 2020

jakebolewski commented Sep 21, 2020 • edited Loading

jakebolewski commented Sep 21, 2020 • edited Loading

maleadt commented Sep 22, 2020

jakebolewski commented Sep 22, 2020 • edited Loading

jakebolewski commented Sep 22, 2020 • edited Loading

maleadt commented Sep 22, 2020 • edited Loading

simonbyrne commented Sep 25, 2020

vchuravy commented Sep 25, 2020

jakebolewski commented Sep 25, 2020 • edited Loading

maleadt commented Sep 26, 2020

simonbyrne commented Sep 30, 2020

maleadt commented Oct 2, 2020

jakebolewski commented Oct 2, 2020

jakebolewski commented Jul 14, 2020 •

edited

Loading

jakebolewski commented Jul 14, 2020 •

edited

Loading

jakebolewski commented Sep 21, 2020 •

edited

Loading

jakebolewski commented Sep 21, 2020 •

edited

Loading

jakebolewski commented Sep 22, 2020 •

edited

Loading

jakebolewski commented Sep 22, 2020 •

edited

Loading

maleadt commented Sep 22, 2020 •

edited

Loading

jakebolewski commented Sep 25, 2020 •

edited

Loading