Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistency between params and destructure #1733

Closed
CarloLucibello opened this issue Oct 3, 2021 · 7 comments · Fixed by #1901
Closed

inconsistency between params and destructure #1733

CarloLucibello opened this issue Oct 3, 2021 · 7 comments · Fixed by #1901

Comments

@CarloLucibello
Copy link
Member

CarloLucibello commented Oct 3, 2021

In my understanding, destructure is supposed to behave similarly to params and collect trainable params of a struct into a vector. Instead, it collects all arrays, as one can see in this example using BatchNorm:

julia> b = BatchNorm(2)
BatchNorm(2)        # 4 parameters, plus 4 non-trainable

julia> ps = params(b)
Params([Float32[0.0, 0.0], Float32[1.0, 1.0]])

julia> p, re = Flux.destructure(b)
(Float32[0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0], Flux.var"#60#62"{BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}(BatchNorm(2)))

julia> length(p)
8

Should we modify destructure to act like params, therefore recurse over trainable(m) instead of just applying fmap as it currently does?

Also, maybe we should add to both params and destructure a keyword argument which with possible values :trainable, :buffer, :all, and default value :trainable.

Related to #1727

@DhairyaLGandhi
Copy link
Member

They're pretty different functions and meant for different reasons. destructred outputs can be used by external libraries, which conservatively need all the parameters from a training routine to be used downstream.

@CarloLucibello
Copy link
Member Author

Seems exactly the same use to me, training, which means that both functions should output what is meant to be trained. I think we need some input from SciML people, since they are prominent consumers of desctructure @ChrisRackauckas

@DhairyaLGandhi
Copy link
Member

Not just training, some libraries use it for initial guesses as well. I'd prefer to not make trainable a "special" requirement for defining layers

@ToucheSir
Copy link
Member

Differing use cases aside, making destructure call trainable won't make trainable any more of a requirement for defining layers than it already is (i.e. not at all), because it forwards to functor by default.

@ChrisRackauckas
Copy link
Member

I thought it only grabbed the trainable params 😅 . The reason is to, for example, get the vector and use BFGS to train the parameters of the Chain. If it's not giving the same vector out, then training with Optim wouldn't be the same as training with Flux's optimizers, which is not what we would intend. So it should probably exclude the values that are meant to be constant.

@darsnack
Copy link
Member

darsnack commented Oct 3, 2021

We can switch to using the new walk keyword for fmap to only walk the trainable parameters with an option to pass an alternate walk into destructure.

@DhairyaLGandhi
Copy link
Member

It currently excluded some values that we would want included at higher order cases, and on our end it's actually easier to ignore the objects that don't have gradients which can differ depending on layer configurations by grabbing the gradients from the NamedTuple directly. That is a nice and generic approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants