-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA throws OOM error when initializing API on multiple devices #398
Comments
You're not using the other device exclusively, the Knet code in your stack trace initializes a context for all devices. This is bad behavior on Knet's side, because it allocates a context and initializes memory for every device upon package load time, which (as observed here) easily consumes 100-200 MB of device-memory per device. cc @denizyuret |
Specifically, @denizyuret, don't do this: https://github.com/denizyuret/Knet.jl/blob/2754cd6d4f61d5e509810461d8dcafb5efa2f79e/src/Knet.jl#L18-L19 Doing a |
@maleadt What is the workaround? (This is important on clusters where an 8 GPU
machine may have 6 of its GPUs busy and you want to find the 7th one).
To avoid the problems you mention I had reverted to using nvml calls. NVML is not supported in all OSs but it is in linux which is what most clusters use. Maybe I can go back to NVML?
…On Fri, Aug 28, 2020 at 9:27 AM Tim Besard ***@***.***> wrote:
Specifically, @denizyuret <https://github.com/denizyuret>, don't do this:
https://github.com/denizyuret/Knet.jl/blob/2754cd6d4f61d5e509810461d8dcafb5efa2f79e/src/Knet.jl#L18-L19
Doing a device! during __init__ is bad, because of the memory it uses
(i.e. this issue). Resetting the device there is even worse: it kills any
existing allocations, making it impossible to import Knet later in the
program, or use it in collaboration with other CUDA software. The available
memory represented by mem is also not going to correct because, well, you
reset the device afterwards.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#398 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN43JQBHDP2WQMZKGNX5GTSC5E3PANCNFSM4QNQ5RGA>
.
|
There is no workaround. Don't initialize the device if you don't need it. NVML is not only not available on all platforms, but many users don't have it installed, and since it comes with the driver we can't provision it using artifacts. So I'd try NVML (CUDA.jl has wrappers for it), but if it isn't available just trust what CUDA selected as primary device, which is already heuristic-driven. |
e.g. our tests do the following (which is a more complicated heuristic because it also looks at compute capability): Lines 119 to 159 in 892e649
But I'd leave out the |
Thanks for looking at it, sorry to have misplaced it. |
No problem. Come to think of it, there is a workaround (which isn't usable by Knet though): define |
I have a K80 with two cores, and usually what I'll end up doing is training different models simultaneously and independantly, one on each core. Since I deal with 3D images, I usually run pretty close to the limit for each core. Recently, while I've been running a model on one core, I'll get an OOM error when trying to specify the second device using CUDA.device!(1) despite the fact that the second device has plenty of memory. I'll eventually get my model training, but it usually takes restarting the code a few times. Stacktrace below:
Usually I'll have CUDA.device!(0) full, and CUDA.device!(1) ready to go.
My system is:
The text was updated successfully, but these errors were encountered: