-
-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange result with gradient #1547
Comments
I want to bump this because I feel pretty sure
Here, |
We have recently switched backends for RNNs, so if you could run the same tests on the previous tag of flux, it should help isolate what is happening |
Doesn't look like that applies here because no GPU acceleration is in play, though I second the recommendation to try on a prior version. |
Well, this actually turned up an interesting result. I tried this:
Using Flux v0.9 I get the result I expect, but using Flux v1.1, I get the wrong result (the true and false outputs are reversed). |
Can you test on v0.10? 0.9 -> 0.10 was a complete overhaul of Flux's internals, so it's not suprising to see the results are different. Also can you verify that e.g. PyTorch does give the results you expect? I don't think it's possible to rule out that the RNN formulation used by these libraries is different than you'd expect. |
I want to update that with v0.10, I get the incorrect result. Here's comparing v0.9 and v0.10:
As soon as I figure out how to use PyTorch I'll report those results. |
Hi! After rewriting the loss in different ways I found that the problem is related to the broadcasting: using Flux
h = [0.0]
function m(x)
y = Wx .* x + Wh .* h .+ b
global h = y
return y
end
x = [[0.3], [2.5]]
y = [0.5, 1.0]
Wx = [0.5]
Wh = [0.001]
b = [0.85]
# Theoretic derivative
x[1] + x[2] + Wh.*x[1] # = 2.8003
loss(x, y) = sum((Flux.stack(m.(x), 1) .- y))
gs = gradient(() -> loss(x, y), params(Wx, Wh, b))
gs[Wx] # = 2.8025
loss2(x, y) = sum((Flux.stack([m(xᵢ) for xᵢ in x], 1) .- y))
gs2 = gradient(() -> loss2(x, y), params(Wx, Wh, b))
gs2[Wx] # = 2.8003 I tested it with an RNN layer and in this case the gradients match: using Random
Random.seed!(1)
m2 = RNN(1, 1, x -> x)
x = [rand(Float32, 1) for i = 1:2]
y = rand(Float32, 2)
p = params(m2)
loss(x, y) = sum((Flux.stack(m2.(x), 1) .- y))
gs = gradient(() -> loss(x, y), p)
Flux.reset!(m2)
loss2(x, y) = sum((Flux.stack([m2(xᵢ) for xᵢ in x], 1) .- y))
gs2 = gradient(() -> loss2(x, y), p)
[gs[pᵢ] .== gs2[pᵢ] for pᵢ in p]
# 4-element Vector{BitArray}:
# [1]
# [1]
# [1]
# [1] However, if I add a trivial Chain, the gradients don't match anymore: Random.seed!(1)
m2 = Chain(RNN(1, 1, x -> x))
x = [rand(Float32, 1) for i = 1:2]
y = rand(Float32, 2)
p = params(m2)
loss(x, y) = sum((Flux.stack(m2.(x), 1) .- y))
gs = gradient(() -> loss(x, y), p)
Flux.reset!(m2)
loss2(x, y) = sum((Flux.stack([m2(xᵢ) for xᵢ in x], 1) .- y))
gs2 = gradient(() -> loss2(x, y), p)
[gs[pᵢ] .== gs2[pᵢ] for pᵢ in p]
# 4-element Vector{BitArray}:
# [0]
# [1]
# [0]
# [0] For Dense layers, the gradients match in both cases. Maybe I'm doing something wrong but if not, this could point to a serious bug, since most recurrent models are trained with broadcasting. I hope that this can be helpful and I take the chance to thank you for this amazing (and elegant) package!! versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.6.0)
CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS = 4
(Flux_gradient_test) pkg> st
Status `~/Documents/github issues/Flux_gradient_test/Project.toml`
[587475ba] Flux v0.12.3 |
@gabrevaya the reason wrapping with a chain doesn't work is because of #1209. This was fixed for broadcasting An easy workaround that doesn't require an array comprehension is to use loss2(x, y) = sum((Flux.stack(map(m2, x), 1) .- y)) For a proper fix, I think it's worth revisiting FluxML/Zygote.jl#807. |
I was still confused after my last issue, so I dove a little deeper. Specifically, I wanted to understand how RNN training works with the loss function used here:
loss(x, y) = sum((Flux.stack(m.(x),1) .- y) .^ 2)
I tested a much simplified version of this, using a 1 -> 1 RNN cell without an activation function, and the same loss function without the square:
Here's where I had a problem: when there's more than one input/output sample, what's the derivative of
L
with respect toWi
(m.cell.Wi
)? Using some semi-random values:If you evaluate
m.(x)
orL(x, y)
, you get the result you expect ([[1.06], [2.60106]]
and2.16106
, respectively). For dL/dWi, it's easy to derive by inspection that it should be x1 + x2 + Wh*x1 = 2.8003. You could also get it by finite difference:But when you use
gradient
:This result is equal to x1 + x2 + Wh*x2 instead of x1 + x2 + Wh*x1. Am I overlooking something, or is something weird happening here?
The text was updated successfully, but these errors were encountered: