Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect gradients of batchnorm in testmode #548

Open
phaim opened this issue Oct 27, 2023 · 3 comments
Open

Incorrect gradients of batchnorm in testmode #548

phaim opened this issue Oct 27, 2023 · 3 comments

Comments

@phaim
Copy link

phaim commented Oct 27, 2023

I have been trying to take the derivative of a Flux model in testmode, and noticed that the BatchNorm layer behaves incorrectly for 4D and 5D CUDA-arrays.
Here is a MVE of this behaviour, computing the gradient of the BatchNorm for differently reshaped inputs:

using Flux, CUDA, Zygote

function gradient_varying_shape(m, x, n_dims, device)
    m = m |> device
    Flux.testmode!(m)

    x = reshape(x, ntuple(i -> 1, n_dims)) |> device
    return gradient(input -> sum(m(input).^2), x)[1] |> cpu
end

model = BatchNorm(1)
x = [1f0]

for i=2:7
    cpu_gradient = gradient_varying_shape(model, x, i, cpu) 
    gpu_gradient = gradient_varying_shape(model, x, i, gpu) 
    println("n_dim=$i, cpu: $(cpu_gradient[1]), gpu: $(gpu_gradient[1])")
end

This gives the following output for me:

n_dim=2, cpu: 1.99998, gpu: 0.0
n_dim=3, cpu: 1.99998, gpu: 1.99998
n_dim=4, cpu: 1.99998, gpu: 0.0
n_dim=5, cpu: 1.99998, gpu: 0.0
n_dim=6, cpu: 1.99998, gpu: 1.99998
n_dim=7, cpu: 1.99998, gpu: 1.99998

Looking through the Code, I found that the implementation of the CUDA backwards batchnorm here ignores the argument training. Could this be the origin of this behavior?

I'm using Julia 1.9.3 with NNlib version 0.9.7 and this environment:

[052768ef] CUDA v5.0.0
[587475ba] Flux v0.14.6
[e88e6eb3] Zygote v0.6.66
[02a925ec] cuDNN v1.2.0
@ToucheSir
Copy link
Member

It's quite possible. That whole method is a bit of a kludge and hasn't been changed in years, so I'm not surprised it has edge cases.

@phaim
Copy link
Author

phaim commented Nov 14, 2023

I have been looking into fixing this issue, but I have a hard time understanding the function signature of cudnnBNBackward! and on what variable it is supposed to act. Is there some additional documentation or information on that?

@ToucheSir
Copy link
Member

No, but the good news is that it's merely a thin wrapper over cudnnBatchNormalizationBackward and that is documented by Nvidia. If you have any questions during your effort, I'd be happy to answer them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants