-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calls to has_cudnn
running on wrong CuDevice
?
#978
Comments
The error message:
|
Here is the PR I am working on: FluxML/Metalhead.jl#70 An alternate MWE is to checkout the repo, switch to the julia>] activate training
julia> include("training/run.jl") |
I can't reproduce this. Could it be that the OOM happened in the precompilation process (when importing Flux), because that also triggers Regardless, Flux really shouldn't be calling this from Also, this bug won't happen on master anymore, because (1) after #992 the log configuration won't trigger initialization, and (2) with #987 the loggers aren't installed unless |
Yes, actually upon further investigation, I discovered last night that I had one process on which |
Describe the bug
I am developing a package that depends on Flux, and I need to run some benchmarks on the GPU. There are other users on the system (cyclops) taking up all the memory on several devices. Before doing anything else, I call
I call
CUDA.memory_status()
to ensure that the device I selected is not in use (it isn't). Whenusing
my package, it triggers this line in Flux which OOMs. I suspecthas_cudnn
is running some code on a device other than the one I selected.To reproduce
You need two GPUs with one GPU maxed out on memory consumption. The following should trigger the bug:
Manifest.toml
Expected behavior
No OOMs since I am not on a device that is currently being used.
Version info
Details on Julia:
Details on CUDA:
The text was updated successfully, but these errors were encountered: