-
-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spurious RNN failure with CUDNN #923
Comments
I'm getting this error consistently. Altho I found something odd: |
This is showing up in CI too. Any idea why it might be happening? Seems like it should be possible to bisect the CuArrays change that lead to this. |
There was a bunch of changes to the memory allocator, and I also limited the amount of memory a process can use (which increases memory pressure, and can thus cause a memory reuse problem to surface). |
For me, it happens on "cudnnBackwardData" not forward. It may not be workspace issue only. I'm on 7eb6a0 In my case, I copied https://github.com/FluxML/Flux.jl/blob/master/test/cuda/curnn.jl to test_curnn.jl and limit memory as 40MB as following. Then trying to debug this code.
This error indicates first test case code
Could It be Zygote issue? |
Can you try running with GC disabled and see if you can still reproduce it? We never actually added all the |
We don't need any I'll have a closer look at this, but couldn't reproduce last I looked. |
IIUC, we could still get an early free if the I have no idea if early frees would even cause this error – you'd expect a segfault-like issue instead – but hopefully GC will tell us that. Of course, it was misleading last time, so ¯\(ツ)/¯ FWIW, this seems to be showing up in about 1/3-1/2 of bors runs. |
No, that's not correct. If the unsafe conversions (i.e. the ones that get an untracked reference to data) only happen by julia> a = [1]
1-element Array{Int64,1}:
1
julia> Meta.@lower ccall(:whatever, Nothing, (Ptr{Int},), a)
:($(Expr(:thunk, CodeInfo(
@ none within `top-level scope'
1 ─ %1 = Core.apply_type(Ptr, Int)
│ %2 = Base.cconvert(%1, a)
│ %3 = Core.apply_type(Ptr, Int)
│ %4 = Base.unsafe_convert(%3, %2)
│ %5 = $(Expr(:foreigncall, :(:whatever), :Nothing, :(Core.svec(Core.apply_type(Ptr, Int))), :(:ccall), 1, :(%4), :(%2)))
└── return %5
))))
|
Ah, I wasn't aware that |
I still cannot reproduce. @appleparan could you send a Manifest? Which versions of Julia, CUDA and CUDNN are you using? |
@maleadt Here it is. https://gist.github.com/appleparan/69289887e446b3ec57f1f42c6a375588 I was trying to reproduce whole day. However, If I didn't put Following is commands that I used. I didn't use GC related options.
EDIT
CUDA Info
I made a Singularity image, and run julia inside |
I added https://gist.github.com/appleparan/676d03e9de15092b0e7d3d6501c52517 I thought pullback from this lines and this should be called, however, there is no Does anyone explain this? It is nonsense not to call Moreover, this error only happens on first run. If I want to reproduce, I need to restart Julia. That's why I think this relates to code generation or Zygote. |
You're taking the gradient of |
Which doesn't make any sense since the workspace is allocated right before the call to |
Is it possible that the alloc causes memory to get freed by the pool, affecting CUDNN's heuristic? If so, it seems pretty hard to work around this. Perhaps we can just allocate-and-check until CUDNN is happy, at least as a temporary fix. |
ding, ding, ding
|
Dammit CUDNN, we talked about this |
Finally! I gave up this issue because after updating Flux#master I couldn't reproduce and in my case when I inspect variables, error doesn't appeared. I saw draft PR from @maleadt . It is great solution. However, the real problem is How about keeping |
That wouldn't work, because we can happen to free data in between, in which case CUDNN would expect a larger workspace due to how its heuristics appear to work. In the sample code, there are no frees in between those calls, they only happen at the very end of the sample. |
I see. You are right. |
So the heuristic is to check until cudnn is happy with the amount of memory we allocate? |
We can't really check if CUDNN is happy because it returns INVALID_PARAM and not INSUFFICIENT_WORKSPACE or similar. So we allocate until it doesn't ask us to allocate more, and hope that we don't suddenly free memory before calling into the library (which would change the heuristic). |
Should be fixed |
I've been getting this, which seems related to what's been discussed here:
If this isn't the same, I can open a new issue. |
Apologies for the noise, it seems it is not the same issue. |
What was the issue? |
I still have no idea, but the workspace and reserve sizes appear to be correct, so it seems not to be the same issue as this one. |
I also encountered rnn = Chain(GRU(16, 8),
Dense(8,1, σ),
x -> reshape(x,:))
X = [rand(16,10) for i in 1:20]
Y = rand(10,20) ./ 10
rnn = rnn |> gpu
X = gpu(X)
Y = gpu(Y)
θ = Flux.params(rnn)
loss(x,y) = mean((Flux.stack(rnn.(X),2) .- y) .^ 2f0)
opt = ADAM(1e-3)
size(rnn[1].state)
Flux.reset!(rnn)
Flux.train!(loss, θ, [(X,Y)], opt)
size(rnn[1].state) Itr can be observed that both prior and after |
Is that reproducible? If so, could you put it in a new issue with some details on package and CUDA versions? |
On 8a0745f, which uses CuArrays 1.4.2:
Guess we haven't fixed all of them. Happens very rarely though, so less of an issue than #267.
The text was updated successfully, but these errors were encountered: