Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux.Zeros conflicts with Flux.loadparams! #1277

Closed
pxl-th opened this issue Jul 11, 2020 · 11 comments
Closed

Flux.Zeros conflicts with Flux.loadparams! #1277

pxl-th opened this issue Jul 11, 2020 · 11 comments

Comments

@pxl-th
Copy link
Member

pxl-th commented Jul 11, 2020

Consider a model where all convolutions have their biases disabled:

m = Chain(
    Conv((3, 3), 1 => 2, bias=Flux.Zeros()),
    Conv((3, 3), 2 => 3, bias=Flux.Zeros()),
    Conv((3, 3), 3 => 4, bias=Flux.Zeros()),
)

If I collect its parameters (e.g. for saving them later) and print:

mp = m |> params .|> cpu
for p in mp
    println("$(typeof(p)) $(size(p))")
end

in the output I will have:

Array{Float32,4} (3, 3, 1, 2)
Array{Bool,0} () <--- disabled bias from 1st conv
Array{Float32,4} (3, 3, 2, 3)
Array{Float32,4} (3, 3, 3, 4)

So parameters of the model contain disabled bias, but only from the 1st layer (since parameters are stored in Zygote.IdSet others are ignored).
Moreover if I want to load these parameters into the model:

Flux.loadparams!(m, mp)

I get error ERROR: setindex! not defined for Flux.Zeros{Bool,0}.

Easy fix would be to define params! for Flux.Zeros that does not add it to parameters:

params!(p::Flux.Params, x::Flux.Zeros{<:Number}, seen = Flux.IdSet()) = nothing

I wanted to create a PR with this fix, but this solution conflicts with this test which expects Flux.Zeros to be in parameters.
Maybe change it to instead check that bias is not in gradients?

@CarloLucibello
Copy link
Member

I don't see any reason why Zeros should be in parameters. @DhairyaLGandhi what is that test for?

@DhairyaLGandhi
Copy link
Member

It's to make sure we don't train on bias on accident

Functor would automatically collect all the params currently, a better solution might be to teach loadparams how to handle Zeros.

@pxl-th
Copy link
Member Author

pxl-th commented Jul 12, 2020

Also I've found out, that if you move model to any device, then Flux.Zeros gets converted to actual Array representation and gradient w.r.t. bias is no longer nothing.

MWE to reproduce

device = Flux.cpu
model = Conv((3, 3), 1 => 2, bias=Flux.Zeros()) |> device
xin = randn(Float32, 4, 4, 1, 1) |> device
you = randn(Float32, 2, 2, 1, 1) |> device

model_params = Flux.params(model)
println(">>> Model parameters:")
for p in model_params
    println("$(typeof(p)) $(size(p))")
end

loss(x, y) = Flux.mse(model(x), y)
grads = Flux.gradient(model_params) do
    loss(xin, you)
end

println(">>> Gradients:")
for (k, v) in grads.grads
    println("$(typeof(k)) $(size(k)) => $(typeof(v))")
end

MWE output without moving model to any device

>>> Model parameters:
Array{Float32,4} (3, 3, 1, 2)
Flux.Zeros{Bool,0} ()
>>> Gradients:
Array{Float32,4} (3, 3, 1, 2) => Array{Float32,4}
Flux.Zeros{Bool,0} () => Nothing

MWE output when `cpu` is device

>>> Model parameters:
Array{Float32,4} (3, 3, 1, 2)
Array{Bool,0} ()
>>> Gradients:
Array{Bool,0} () => Array{Float32,0}
Array{Float32,4} (3, 3, 1, 2) => Array{Float32,4}

MWE output when `gpu` is device

>>> Model parameters:
CUDA.CuArray{Float32,4,Nothing} (3, 3, 1, 2)
CUDA.CuArray{Bool,0,Nothing} ()
>>> Gradients:
CUDA.CuArray{Float32,4,Nothing} (3, 3, 1, 2) => CUDA.CuArray{Float32,4,Nothing}
CUDA.CuArray{Bool,0,Nothing} () => CUDA.CuArray{Float32,0,CUDA.CuArray{Float32,4,CUDA.CuArray{Float32,4,Nothing}}}

@MacKenzieHnC
Copy link

Would solving #1294 and #1284 solve this? It seems like not separating the different "trainable" parameters is becoming a problem

@DhairyaLGandhi
Copy link
Member

I think having a solid notion of how to

  1. Load Zeros
  2. Don't convert when it's moved to a device

Would be valuable. #1284 would be a great to have too

@MacKenzieHnC
Copy link

If you give someone this Julia file:

using Flux
using BSON

model = BSON.load("arbitrary_saved_model.bson")

do they have enough information to reverse-engineer that model in a different language?

Maybe I'm way off-base here, but it feels like using Zeros to turn bias off is covering up for an insufficient definition somewhere

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Sep 8, 2020

I agree that there is missing information in the parameters. Btw if we make Zeros a mutable struct, then I think we get a lot of the same benefits with having the flexibility of not defining a Singleton Flux.Zeros, so it should reflect as different parameters in the params object. The fact that the bias is turned off is valuable information to maintain the reproducibility of the model, which can't be ignored, and therefore should be reflected in the params.

For the record, making Zeros mutable would give you:

julia> m = Chain(
           Conv((3, 3), 1 => 2, bias=Flux.Zeros()),
           Conv((3, 3), 2 => 3, bias=Flux.Zeros()),
           Conv((3, 3), 3 => 4, bias=Flux.Zeros()),
       )
Chain(Conv((3, 3), 1=>2), Conv((3, 3), 2=>3), Conv((3, 3), 3=>4))

julia> length(Flux.params(m))
6

julia> Flux.params(m)[2]
0-dimensional Flux.Zeros{Bool,0}:
0

julia> gs = gradient(() -> sum(m(rand(Float32, 28,28,1,1))), Flux.params(m))
Grads(...)

julia> gs.grads
IdDict{Any,Any} with 6 entries:
  Float32[0.0169228 - => Float32[387.49 390.719 392.534; 385.96
  fill(false)          => nothing
  fill(false)          => nothing
  fill(false)          => nothing
  Float32[0.092979 -0 => Float32[-432.106 -433.08 -435.981; -43
  Float32[-0.166749 0 => Float32[-0.000347137 -0.42526 0.50606;

@DhairyaLGandhi
Copy link
Member

Easy fix would be to define params! for Flux.Zeros that does not add it to parameters:

If we don't add it to the parameters, how do we encode that the bias is turned off when pulling out the weights from the model? I think it would work if the entire model was saved, but loadparams!(model, weights_with_zeros) would fail

@MacKenzieHnC
Copy link

MacKenzieHnC commented Sep 9, 2020

If we don't add it to the parameters, how do we encode that the bias is turned off when pulling out the weights from the model?

I am kind of new to Julia. Are there no polymorphism tricks we can use? Is there any way to just define a Conv struct that doesn't have a bias property in the first place?

edit: Reading more into the documentation, it looks like that's already kind of what Zeros is doing. But what problem is solved by making bias Union{Zeros, AbstractVector{T}} instead of Union{Nothing, AbstractVector{T}}?

Per the documentation:

A particularly useful case of a Union type is Union{T, Nothing}, where T can be any type and Nothing is the singleton type whose only instance is the object nothing. This pattern is the Julia equivalent of Nullable, Option or Maybe types in other languages. Declaring a function argument or a field as Union{T, Nothing} allows setting it either to a value of type T, or to nothing to indicate that there is no value.

edit: I got Union{Nothing, AbstractVector{T}} to work but it required putting an if statement in the function that evaluates Conv

function (c::Conv)(x::AbstractArray)
  # TODO: breaks gpu broadcast :(
  # ndims(x) == ndims(c.weight)-1 && return squeezebatch(c(reshape(x, size(x)..., 1)))
  
  ### My changes
  σ = c.σ
  if c.bias != nothing
      b = reshape(c.bias, ntuple(_->1, length(c.stride))..., :, 1)
  else
      b = 0
  end
  ###
  cdims = DenseConvDims(x, c.weight; stride=c.stride, padding=c.pad, dilation=c.dilation)
  σ.(conv(x, c.weight, cdims) .+ b)
end

Is that what Zeros is supposed to solve? Is it a performance optimization? Here's the results of the example after my changes:

julia> m = Chain(
           Conv((3, 3), 1 => 2, bias=nothing),
           Conv((3, 3), 2 => 3, bias=nothing),
           Conv((3, 3), 3 => 4, bias=nothing),
       )
Chain(Conv((3, 3), 1=>2), Conv((3, 3), 2=>3), Conv((3, 3), 3=>4))

julia> length(Flux.params(m))
3

julia> Flux.params(m)[2]
3×3×2×3 Array{Float32,4}:
[:, :, 1, 1] =
  0.089666   0.28373   -0.300706
 -0.272113   0.252896  -0.100829
  0.171586  -0.324309   0.00790166
...

julia> gs = gradient(() -> sum(m(rand(Float32, 28,28,1,1))), Flux.params(m))
Grads(...)

julia> gs.grads
IdDict{Any,Any} with 3 entries:
  Float32[0.089666 0.28373 -0.300706; -0.272113 0.252896 -0.100829; 0.1715 => Float32[-189.062 -189.535 -189.128; -188.437 -188.879 -189.078; -187.731 -189.049 -189.331]
  Float32[-0.10269 -0.0312617 -0.184054; -0.0267541 0.172398 0.170193; 0.1 => Float32[-38.5559 -38.4327 -38.9365; -38.2541 -38.1558 -39.0549; -38.5386 -37.9995 -38.0654]
  Float32[0.177636 0.394305 -0.407948; 0.34289 -0.426031 0.319731; 0.00487 => Float32[245.675 244.423 244.569; 244.205 243.673 244.096; 243.72 243.263 243.177]

julia> Flux.loadparams!(m, Flux.params(m))

julia> length(params(m))
3

The fact that the bias is turned off is valuable information to maintain the reproducibility of the model, which can't be ignored, and therefore should be reflected in the params.

For a given i it seems like we don't know what Flux.params(model)[i] is anyway unless we already knew what kinds of params we were expecting.

This might be way more ambitious than I realize, but I was hoping there was some way to have Flux.params return a dict, kind of like tf.compat.v1.trainable_variables does in Python. We already have names for all the parameters (since we get them from the struct) but we throw that information away at some step

edit: I promise I'm doing my homework. I read the thread where Flux.Zeros() got added, I just can't see what it is solving that the original nothing solution didn't.

@MacKenzieHnC
Copy link

Another way of looking at it is, if you see an array where you try loading it with nothing (instead of zeros) you break the assumption that it might be populated with legitimate values if you train it further.

@DhairyaLGandhi what did you mean by this comment in #1281?

bors bot added a commit that referenced this issue Nov 9, 2020
1379: Fix some issues with Zeros option 2 r=CarloLucibello a=DrChainsaw

Fixes #1332, #1277

This is take two on fixing the above issues and is intended to be mutually exclusive to #1374.

In this version Zeros is no longer an AbstractArray and is thus more like a there-is-no-such-parameter marker instead of this-parameter-is-an-immutable-array-of-zeros type. 

I made the change so that the advertised way of disabling bias is through `bias=false` rather than `bias=Zeros()`. 

`Zeros` itself is alot less capable here compared to the previous attempt, but that also keeps the number of moving parts down, i.e no need to test that both the 0-dim version and the full size version works the same. All in all there are fewer specializations compared to #1374. 

I think that a rename could definitely be in place. Better names I can think of are `bias=Off`, `bias=None` or as suggested by @mcabbott `bias=False()`. 

The best argument against renaming I can think of is to be a little bit less breaking.

### PR Checklist

- [X] Tests are added
- [X] Entry in NEWS.md
- [ ] Documentation, if applicable
- [ ] Final review from `@dhairyagandhi96` (for API changes).


Co-authored-by: DrChainsaw <Christian.kyril.skarby@gmail.com>
@DrChainsaw
Copy link
Contributor

This should have been closed by #1379, I must have messed up the 'fixes' syntax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants