CUDA throws OOM error when initializing API on multiple devices #398

andevellicus · 2020-08-27T23:43:55Z

I have a K80 with two cores, and usually what I'll end up doing is training different models simultaneously and independantly, one on each core. Since I deal with 3D images, I usually run pretty close to the limit for each core. Recently, while I've been running a model on one core, I'll get an OOM error when trying to specify the second device using CUDA.device!(1) despite the fact that the second device has plenty of memory. I'll eventually get my model training, but it usually takes restarting the code a few times. Stacktrace below:

ERROR: LoadError: LoadError: InitError: CUDA error (code 2, CUDA_ERROR_OUT_OF_MEMORY)
Stacktrace:
 [1] throw_api_error(::CUDA.cudaError_enum) at /home/andevellicus/.julia/packages/CUDA/dZvbp/lib/cudadrv/error.jl:103
 [2] macro expansion at /home/andevellicus/.julia/packages/CUDA/dZvbp/lib/cudadrv/error.jl:110 [inlined]
 [3] cuDevicePrimaryCtxRetain(::Base.RefValue{Ptr{Nothing}}, ::CUDA.CuDevice) at /home/andevellicus/.julia/packages/CUDA/dZvbp/lib/utils/call.jl:93
 [4] CuContext at /home/andevellicus/.julia/packages/CUDA/dZvbp/lib/cudadrv/context/primary.jl:31 [inlined]
 [5] context at /home/andevellicus/.julia/packages/CUDA/dZvbp/src/state.jl:249 [inlined]
 [6] device!(::CUDA.CuDevice, ::Nothing) at /home/andevellicus/.julia/packages/CUDA/dZvbp/src/state.jl:286
 [7] device! at /home/andevellicus/.julia/packages/CUDA/dZvbp/src/state.jl:265 [inlined]
 [8] mem at /home/andevellicus/.julia/packages/Knet/Mfd6L/src/Knet.jl:18 [inlined]
 [9] #1 at ./none:0 [inlined]
 [10] iterate at ./generator.jl:47 [inlined]
 [11] _all(::Base.var"#256#258", ::Base.Generator{CUDA.DeviceSet,Knet.var"#1#3"{Knet.var"#mem#2"}}, ::Colon) at ./reduce.jl:827
 [12] all at ./reduce.jl:823 [inlined]
 [13] Dict(::Base.Generator{CUDA.DeviceSet,Knet.var"#1#3"{Knet.var"#mem#2"}}) at ./dict.jl:130
 [14] __init__() at /home/andevellicus/.julia/packages/Knet/Mfd6L/src/Knet.jl:19
 [15] _include_from_serialized(::String, ::Array{Any,1}) at ./loading.jl:697
 [16] _require_search_from_serialized(::Base.PkgId, ::String) at ./loading.jl:782
 [17] _require(::Base.PkgId) at ./loading.jl:1007
 [18] require(::Base.PkgId) at ./loading.jl:928
 [19] require(::Module, ::Symbol) at ./loading.jl:923
 [20] include(::String) at ./client.jl:457
 [21] top-level scope at /home/andevellicus/Programming/ML/julia/knet-sdh-seg/train.jl:1
 [22] include(::Function, ::Module, ::String) at ./Base.jl:380
 [23] include(::Module, ::String) at ./Base.jl:368
 [24] exec_options(::Base.JLOptions) at ./client.jl:296
 [25] _start() at ./client.jl:506
during initialization of module Knet
in expression starting at /home/andevellicus/Programming/ML/julia/knet-sdh-seg/SDHSeg.jl:4
in expression starting at /home/andevellicus/Programming/ML/julia/knet-sdh-seg/train.jl:1
caused by [exception 1]
CUDA error (code 2, CUDA_ERROR_OUT_OF_MEMORY)
Stacktrace:
 [1] throw_api_error(::CUDA.cudaError_enum) at /home/andevellicus/.julia/packages/CUDA/dZvbp/lib/cudadrv/error.jl:103
 [2] macro expansion at /home/andevellicus/.julia/packages/CUDA/dZvbp/lib/cudadrv/error.jl:110 [inlined]
 [3] cuDevicePrimaryCtxRetain(::Base.RefValue{Ptr{Nothing}}, ::CUDA.CuDevice) at /home/andevellicus/.julia/packages/CUDA/dZvbp/lib/utils/call.jl:93
 [4] CuContext at /home/andevellicus/.julia/packages/CUDA/dZvbp/lib/cudadrv/context/primary.jl:31 [inlined]
 [5] context at /home/andevellicus/.julia/packages/CUDA/dZvbp/src/state.jl:249 [inlined]
 [6] device!(::CUDA.CuDevice, ::Nothing) at /home/andevellicus/.julia/packages/CUDA/dZvbp/src/state.jl:286
 [7] device! at /home/andevellicus/.julia/packages/CUDA/dZvbp/src/state.jl:265 [inlined]
 [8] mem at /home/andevellicus/.julia/packages/Knet/Mfd6L/src/Knet.jl:18 [inlined]
 [9] #1 at ./none:0 [inlined]
 [10] iterate at ./generator.jl:47 [inlined]
 [11] Dict{CUDA.CuDevice,Int64}(::Base.Generator{CUDA.DeviceSet,Knet.var"#1#3"{Knet.var"#mem#2"}}) at ./dict.jl:102
 [12] dict_with_eltype at ./abstractdict.jl:531 [inlined]
 [13] dict_with_eltype at ./abstractdict.jl:538 [inlined]
 [14] Dict(::Base.Generator{CUDA.DeviceSet,Knet.var"#1#3"{Knet.var"#mem#2"}}) at ./dict.jl:128
 [15] __init__() at /home/andevellicus/.julia/packages/Knet/Mfd6L/src/Knet.jl:19
 [16] _include_from_serialized(::String, ::Array{Any,1}) at ./loading.jl:697
 [17] _require_search_from_serialized(::Base.PkgId, ::String) at ./loading.jl:782
 [18] _require(::Base.PkgId) at ./loading.jl:1007
 [19] require(::Base.PkgId) at ./loading.jl:928
 [20] require(::Module, ::Symbol) at ./loading.jl:923
 [21] include(::String) at ./client.jl:457
 [22] top-level scope at /home/andevellicus/Programming/ML/julia/knet-sdh-seg/train.jl:1
 [23] include(::Function, ::Module, ::String) at ./Base.jl:380
 [24] include(::Module, ::String) at ./Base.jl:368
 [25] exec_options(::Base.JLOptions) at ./client.jl:296
 [26] _start() at ./client.jl:506

Usually I'll have CUDA.device!(0) full, and CUDA.device!(1) ready to go.

My system is:

Julia 1.5
Knet 1.4
CUDA 1.3.3

The text was updated successfully, but these errors were encountered:

maleadt · 2020-08-28T06:22:37Z

You're not using the other device exclusively, the Knet code in your stack trace initializes a context for all devices. This is bad behavior on Knet's side, because it allocates a context and initializes memory for every device upon package load time, which (as observed here) easily consumes 100-200 MB of device-memory per device. cc @denizyuret

maleadt · 2020-08-28T06:26:49Z

Specifically, @denizyuret, don't do this: https://github.com/denizyuret/Knet.jl/blob/2754cd6d4f61d5e509810461d8dcafb5efa2f79e/src/Knet.jl#L18-L19

Doing a device! during __init__ is bad, because of the memory it uses (i.e. this issue). Resetting the device there is even worse: it kills any existing allocations, making it impossible to import Knet later in the program, or use it in collaboration with other CUDA software. The available memory represented by mem is also not going to correct because, well, you reset the device afterwards.

denizyuret · 2020-08-28T08:00:55Z

@maleadt What is the workaround? (This is important on clusters where an 8 GPU machine may have 6 of its GPUs busy and you want to find the 7th one). To avoid the problems you mention I had reverted to using nvml calls. NVML is not supported in all OSs but it is in linux which is what most clusters use. Maybe I can go back to NVML?

…

On Fri, Aug 28, 2020 at 9:27 AM Tim Besard ***@***.***> wrote: Specifically, @denizyuret <https://github.com/denizyuret>, don't do this: https://github.com/denizyuret/Knet.jl/blob/2754cd6d4f61d5e509810461d8dcafb5efa2f79e/src/Knet.jl#L18-L19 Doing a device! during __init__ is bad, because of the memory it uses (i.e. this issue). Resetting the device there is even worse: it kills any existing allocations, making it impossible to import Knet later in the program, or use it in collaboration with other CUDA software. The available memory represented by mem is also not going to correct because, well, you reset the device afterwards. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#398 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAN43JQBHDP2WQMZKGNX5GTSC5E3PANCNFSM4QNQ5RGA> .

maleadt · 2020-08-28T08:14:58Z

@maleadt What is the workaround?

There is no workaround. Don't initialize the device if you don't need it.

NVML is not only not available on all platforms, but many users don't have it installed, and since it comes with the driver we can't provision it using artifacts. So I'd try NVML (CUDA.jl has wrappers for it), but if it isn't available just trust what CUDA selected as primary device, which is already heuristic-driven.

maleadt · 2020-08-28T08:32:27Z

e.g. our tests do the following (which is a more complicated heuristic because it also looks at compute capability):

CUDA.jl/test/runtests.jl

Lines 119 to 159 in 892e649

    
           # find suitable devices 
        
           @info "System information:\n" * sprint(io->CUDA.versioninfo(io)) 
        
           candidates, driver_version, cuda_driver_version = if has_nvml() 
        
               [(index=i, 
        
                 uuid=NVML.uuid(dev), 
        
                 name=NVML.name(dev), 
        
                 cap=NVML.compute_capability(dev), 
        
                 mem=NVML.memory_info(dev).free) 
        
                for (i,dev) in enumerate(NVML.devices())], 
        
               NVML.driver_version(), 
        
               NVML.cuda_driver_version() 
        
           else 
        
               # using CUDA to query this information requires initializing a context, 
        
               # which might fail if the device is heavily loaded. 
        
               [(device!(dev); 
        
                (index=i, 
        
                 uuid=uuid(dev), 
        
                 name=CUDA.name(dev), 
        
                 cap=capability(dev), 
        
                 mem=CUDA.available_memory())) 
        
                for (i,dev) in enumerate(devices())], 
        
               "(unknown)", 
        
               CUDA.version() 
        
           end 
        
           ## only consider devices that are fully supported by our CUDA toolkit, or tools can fail. 
        
           ## NOTE: we don't reuse target_support which is also bounded by LLVM support, 
        
           #        and is used to pick a codegen target regardless of the actual device. 
        
           cuda_support = CUDA.cuda_compat() 
        
           filter!(x->x.cap in cuda_support.cap, candidates) 
        
           ## only consider recent devices if we want testing to be thorough 
        
           thorough = parse(Bool, get(ENV, "CI_THOROUGH", "false")) 
        
           if thorough 
        
               filter!(x->x.cap >= v"7.0", candidates) 
        
           end 
        
           isempty(candidates) && error("Could not find any suitable device for this configuration") 
        
           ## order by available memory, but also by capability if testing needs to be thorough 
        
           sort!(candidates, by=x->x.mem) 
        
           ## apply 
        
           picks = reverse(candidates[end-gpus+1:end])   # best GPU first 
        
           ENV["CUDA_VISIBLE_DEVICES"] = join(map(pick->"GPU-$(pick.uuid)", picks), ",") 
        
           @info "Testing using $(length(picks)) device(s): " * join(map(pick->"$(pick.index). $(pick.name) (UUID $(pick.uuid))", picks), ", ")

But I'd leave out the !has_nvml() fallback to avoid this issue here.

andevellicus · 2020-08-28T12:09:40Z

Thanks for looking at it, sorry to have misplaced it.

maleadt · 2020-08-28T12:41:45Z

No problem. Come to think of it, there is a workaround (which isn't usable by Knet though): define CUDA_VISIBLE_DEVICES=... in your environment before launching Julia. That way, Knet won't be able to initialize the other devices, and trigger the OOM.

andevellicus added the bug Something isn't working label Aug 27, 2020

andevellicus changed the title ~~CUDA throws OOM error when initializing API on multi-core devices~~ CUDA throws OOM error when initializing API on multiple devices Aug 28, 2020

maleadt closed this as completed Aug 28, 2020

maleadt removed the bug Something isn't working label Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA throws OOM error when initializing API on multiple devices #398

CUDA throws OOM error when initializing API on multiple devices #398

andevellicus commented Aug 27, 2020 •

edited

Loading

maleadt commented Aug 28, 2020

maleadt commented Aug 28, 2020

denizyuret commented Aug 28, 2020 via email •

edited

Loading

maleadt commented Aug 28, 2020

maleadt commented Aug 28, 2020

andevellicus commented Aug 28, 2020

maleadt commented Aug 28, 2020

CUDA throws OOM error when initializing API on multiple devices #398

CUDA throws OOM error when initializing API on multiple devices #398

Comments

andevellicus commented Aug 27, 2020 • edited Loading

maleadt commented Aug 28, 2020

maleadt commented Aug 28, 2020

denizyuret commented Aug 28, 2020 via email • edited Loading

maleadt commented Aug 28, 2020

maleadt commented Aug 28, 2020

andevellicus commented Aug 28, 2020

maleadt commented Aug 28, 2020

andevellicus commented Aug 27, 2020 •

edited

Loading

denizyuret commented Aug 28, 2020 via email •

edited

Loading