-
-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix some issues with Zeros option 2 #1379
Conversation
What about the |
@CarloLucibello I guess that would be an option 3 then :) Reason why I did not go for it is that it reintroduces the ambiguity as to whether the layer as a fixed constant bias or if it just doesn't have one at all. For instance, allowing it would almost certainly mean that someone will put Allowing for scalar biases would increase the flexibility of course, but it would imo also increase the number of moving parts (a.k.a the error surface). Question is if non-zero scalar biases is useful enough to allow for this. If we only allow for scalar zeros then One example I ran into where it could be problematic is the As an at least to me not insignificant detail, having an own type means we can specialize arithmetic ops to be (essentially) zero cost. Broadcasted addition with |
Just an addendum to the "there is no bias" story from above. I left the implementations of If we rename this version of |
#873 tried a bunch of these approaches, and it was expressed that |
Sure, I guess it's a trade-off, pretending to be an AbstractArray saves effort some places, but adds it elsewhere. For most layers I would guess it's enough just not to constrain the type of the bias, and let broadcasting sort it out. That seems pretty Julian. Only if you will do something more complicated (like perhaps Conv does) will you need to care (and probably enforce in the constructor that it's either Here's a quick sketch to see how much works in the simplest using Flux, Zygote
struct Dense2{T,S,F}
W::T
b::S
f::F
end
Flux.@functor Dense2
(l::Dense2)(x) = (l.f).(l.W * x .+ l.b)
Dcon = Dense2(rand(2,3), randn(2), tanh)
Dsin = Dense2(rand(2,3), false, tanh)
Dcon(rand(3))
Dsin(rand(3)) # ok
params(Dcon) # two parameter arrays, ignores tanh
params(Dsin) # one array
params(Dense2(rand(2,3), fill(10.0), tanh)) # trainable scalar as a 0-array
par1, re1 = Flux.destructure(Dcon); length(par1) == 8
par2, re2 = Flux.destructure(Dsin); length(par2) == 6
re2((1:6)/100)([1,2,3])
re1(vcat(1:6, 0,0)/100)([1,2,3]) # same output, but has trainable bias
fmap(f32, Dcon)
fmap(f32, Dsin) # ok, retains false & tanh
Flux.loadparams!(Dcon, [ones(2,3), ones(2)])
Flux.loadparams!(Dsin, [ones(2,3)])
Flux.loadparams!(Dsin, [ones(2,3), ones(2)]) # no error, because it just calls zip
Broadcast.broadcasted(::typeof(+), a::AbstractArray{T}, b::Bool) where {T<:AbstractFloat} =
b===false ? a : Broadcast.broadcasted(+, a, one(T)) # many BLAS things have a branch like this, low cost:
@btime sum(r .+ b) setup=(r=randn(10,10); b=0.0) # 104.829 ns (1 allocation: 896 bytes)
@btime sum(r .+ b) setup=(r=randn(10,10); b=false) # 13.044 ns (0 allocations: 0 bytes)
Zygote.@adjoint Broadcast.broadcasted(::typeof(+), a::AbstractArray{<:Number}, b::Bool) =
(a .+ b), da -> (nothing, da, nothing) # elsewhere Zygote does assume Bools can't have gradients
gradient((x,b) -> sum(x .+ b), rand(2), rand(2))[2] == [1,1] # actually Fill
gradient((x,b) -> sum(x .+ b), rand(2), 0.0)[2] == 2
gradient((x,b) -> sum(x .+ b), rand(2), false)[2] === nothing
@btime gradient((x,b) -> sum(x .+ b), r, 0.0) setup=(r=randn(100,100)); # 6.142 μs (2 allocations: 78.20 KiB)
@btime gradient((x,b) -> sum(x .+ b), r, false) setup=(r=randn(100,100)); # 815.515 ns (0 allocations: 0 bytes) For calling Conv layers, the change needed is super-simple. Is that the only place which function (c::Conv)(x::AbstractArray)
cdims = DenseConvDims(x, c.weight; stride=c.stride, padding=c.pad, dilation=c.dilation)
b = c.bias isa AbstractVector ? reshape(c.bias, map(_->1, c.stride)..., :, 1) : c.bais
(c.σ).(conv(x, c.weight, cdims) .+ b)
end |
But what advantage does using I think the in my eyes surprising behaviour of I also think there is something strange and undefined about the behaviour of destructure if Pretty much everything done above can also be done for an own type and it will work the same, no? |
I feel it's just simpler that there's no type you have to know about. It could all be done with some scalar Zero type, it's just one more thing which might not be necessary... we already have Agree that constructors which take a keyword argument certainly not take I guess my broadcast optimisation above is indeed piracy. If it's only done within Can you explain a bit more what you mean about destructure? Is the concern is that some number other than Dtri = Dense2(rand(2,3), 3.0, tanh)
params(Dtri) # one array
par3, re3 = Flux.destructure(Dtri);
re3(ones(6)) # restores 3.0 Edit: here's a sketch of how activations could be applied: FluxML/NNlib.jl@master...mcabbott:activate |
I see your point here. To me the focal point of this argument is that the type must have a name and that name is basically going to have to be a synonym for something which already exists (e.g. False, Nothing, Zero, None etc.) which in turn could make it a bit risky to export.
That works ofc from an API p.o.v, but then you'd end up having to remember to put this logic in every layer or else there will be (typically silent) unintended results. Maybe not the end of the world but it is also something which seems unnecessary given the rich set of options the language provides. One more (admittedly somewhat hyperbolic) argument for an own type is that it could provide for a decent escape hatch through dispatch in a way which would require type-piracy if a bool was used. I'm also thinking that the Zygote case is still piracy as it will redefine what an adjoint is for a Bool. If some other library needed to make another definition there would be a clash, no? Not asking rethorically btw since the intricacies of things like this is not something I feel fully confident in. I guess a fused broadcast will remove alot of the cost from not specializing assuming the cost is for allocating a new array. Is there a reason why it is not used today in Flux (I'm here assuming that fusing only happens when using the
I agree! I guess whatever point I was trying to make about destructure is not relevant if we agree that the only scalar value to be allowed for bias is the one which puts the bias out of play (i.e 0, false, nothing, FluxsOwnNothing). It might not even have been relevant even if non-zero scalars are allowed so lets not dwell on it. @DhairyaLGandhi I like the array style from some kind of conceptual purity perspective for sure. Is there a way to signal that an array is immutable so that it could be made clear that params are all arrays which are mutable? Anyways, I don't mean to sound dug in here. As a user I'm happy with whatever solution solves the issues in the end. |
It's my understanding that Zygote regards most integers as being real numbers that happen to be human-friendly round numbers, but
Julia normally fuses all broadcasts, independent of whether But I would say it should be OK to test
Would be good perhaps to find examples of layers defined elsewhere, to see how big a deal this will be. If my custom layer never had a way to turn off bias then it doesn't care. If I want to add one, is it much more complicated than this? Dense2(in::Int, out::Int, act::Function=identity; bias::Bool=true, init::Function=randn) =
Dense2(init(out,in), bias ? init(out) : false, act) |
If we keep the Zeros type, whether an array or not, I think we should not expose it to users but use a boolean flag as in
suggested above. This way we will be able to change the implementation without further breakings. As for the Broadcast.broadcasted(::typeof(+), a::AbstractArray{T}, b::Bool) where {T<:AbstractFloat} =
b===false ? a : Broadcast.broadcasted(+, a, one(T)) # many BLAS things have a branch like this, low cost: If I understand @mcabbott comment, the piracy could be avoided by just specialing the Zygote adjoint: Zygote.@adjoint Broadcast.broadcasted(::typeof(+), a::AbstractArray{<:Number}, b::Bool) =
y = b === false ? a : a .+ b
y, da -> (nothing, da, nothing) This is good for the training phase, but we lose in performance at inference time. Maybe having a slim type as in this PR is ok if we want to avoid this and other possible pitfalls of the |
Yes, I wasn't thinking clearly initially about piracy, that isn't safe at all. But the broadcast adjoint you write ought to be fine. I think the cost of not removing it during inference is really tiny:
It might make sense to replace these broadcasts with |
Keeping the Zeros type would be good, since it's fairly low overhead in terms of API, and retroactively also adds the functionality to most custom layers outside of the standard Flux layers. I agree that we need the adjoint for +/- and having the Zeros type also allows for neat dispatching. |
Cool, are we somewhat converged then? Should the name change? There are a bunch of tests added in |
Have you got a link to some such layers? I'm not sure I follow how needing to allow a special type is more flexible than allowing Base types. Maybe you can say which variant of Zeros you mean here?
Here too, do you mean the Zeros <: AbstractArray (whose overhead includes things like Adapt.jl) or Zeros <: Any (of this PR)? |
Could you clarify whether my understanding is wrong, but would this proposal involves that a layer Also, although these come to questions of personal preference, given that common frameworks such as Pytorch or Mxnet are all using the bias=false as a flag to indicate whether or not a bias is applied, I feel using a different convention is more likely to add than remove potential interpretation conflicts. |
@jeremiedb : This is fully correct. This is the same as how it is done on current master btw as I didn't change the name, just the implementation. In other words, this PR just fixes the existing implementation, namely #1332 and #1277. The original PR which added I also agree that a name change to make it clearer that it means "there is no bias parameter" would be helpful and perhaps worth the breaking change. Do you like any of the options mentioned in the first post? |
Good point about Would the change have to be breaking? If you leave behind a dummy function like |
@mcabbott : I'm thinking that a breaking change maybe isn't so bad and it might be better to just rip the bandaid, especially since there are issues with the functionality already. It seems there are many in favour for The main drawback imo is that maintainers need to stay alert in codereviews so that new layers implement handling for it. This is however to some extent the same for both current masters While it is often good to stick to conventions, I hardly think that older frameworks tendency to have about 10 (sometimes mutually exclusive) arguments for creating a layer and about ten times as much if-else_ing in both the constructor and in the actual forward pass code is something to strive for. What got me hooked on Flux in the first place was definitely the absence of all that. |
Just to clarify, I think I'm still a bit puzzled by the approach taken with the Zeros. Is it normal that currently on master, the utility builder wouldn't be working? For example I agree that conforming to the existing framework shouldn't be taken as a guideline. Nonetheless, I'm wondering if relying on type unions involving custom types just to handle whether a bias is included isn't risking to open a can of worms. Thinking of it from a plumber perspective, what downside would come from handling the no_bias as distinct structs? For example, a DenseNoBias: struct Dense{F,S <: AbstractArray,T <: AbstractArray}
W::S
b::T
σ::F
end
struct DenseNoBias{F,S <: AbstractArray,}
W::S
σ::F
end
Dense(W, b) = Dense(W, b, identity)
Dense(W) = DenseNoBias(W, identity)
function Dense(in::Integer, out::Integer, σ=identity;
bias=false, initW=glorot_uniform, initb=zeros)
if bias
return DenseNoBias(initW(out, in), σ)
else
return Dense(initW(out, in), initb(out), σ)
end
end
@functor Dense
@functor DenseNoBias
function (a::Dense)(x::AbstractArray)
W, b, σ = a.W, a.b, a.σ
σ.(W * x .+ b)
end
function (a::DenseNoBias)(x::AbstractArray)
W, σ = a.W, a.σ
σ.(W * x)
end My only pending concern would be about the behavior for Dense, for which I'm not sure how the constructor of the layer should be modified so not to create confusion by specifying |
It is normal unfortunately and one of the issues fixed in this PR. Zygote defines adjoints for broadcasting with numerical arrays, so even if the
It might, but the attempt at non-unions is basically whats in #1374. I think other frameworks leaves the bias as The distinct structs approach sure is clean cut, but it forces duplication. Coming from an OO background, the thought of something like struct Biased{T, B<:AbstractArray}
layer::T
bias::B
end
(l::Biased)(x) = l.layer(x) .+ l.bias has certainly struck me, but I recon its impractical for many reasons, specialized CuDNN operations being one and API smoothness being another.
Not sure I follow the concern here. Is this related to the distinct structs approach? |
Would have been a clever way to handle it! It reminds a bit of the SkipConnexion approach.
Are you referring for example to:
Sorry for confusion, I was referring to the current PR status where the |
Isn't the problem with Having
Yes, absolutely. Although I think it won't be so hard to juggle things to make all previous-version user-level syntax give a friendly warning (at construction), at least for a while. This isn't (IMO) so important that it should prevent us from ending up in the right place, but one of the big criticisms of Flux is that the model zoo is permanently broken, etc, so if possible it would be nice to print some warnings. (That said, |
It would indeed make it so that both bias and activation would need to be removed from every layer and activation would be something added separately, just as one today can do While it has a nice DRY and SRP:ish touch to it, as I said, seems far from practical. I can imagine there are more than a handful of "lets keep everything clean this time" frameworks out there which noone uses that use this approach. Maybe in some distant future when something XLA.ish just optimizes whatever code you write for whatever compute resources you have available then it could turn out to be useful.
It would in this case be a potential new name for the type which in this PR is called
I agree with this. I'm thinking about trying to keep PRs small, so perhaps we just focus on fixing the issues here and then do another round to smooth out the rough edges around the API. There obviously needs to be some kind of direction and I'm not opposing adding the |
so let's just switch to the |
I think I'm all set as long as bors is happy. |
nice, thanks! bors r+ |
Build succeeded: |
I'm not completely in favour of this API change here, we haven't particularly added conceptual clarity over the Boolean flag. And given that this restricts functionality, we need to be able to discuss bringing the generic function back. |
@DhairyaLGandhi : Fwiw, |
Using this with mismatched shapes and saving and loading with parameters would cause the same issues, with lesser ways of justifying them too. Having the necessary constructors and adjoints are good and well, necessary, I am trying to understand how this proposal supersedes design wise. |
Really, the outer user facing API being decoupled from the implementation is usually something we haven't done much of but is common enough in development that I don't have strong opinions on them in this case as long as it keeps bloat manageable, while giving user the options to work on the abstractions they prefer or are better suited for the work at hand. |
I got half-way with a PR re-coupling the internals to the API, but got distracted by tests & docstrings. But I noticed the following error --- this works on Flux v0.11.2 but not on master, so I presume but have not proven that this PR is to blame:
|
@DhairyaLGandhi can you clarify what functionality is restricted? And what do you mean by "the generic function" here? I'm not sure I follow at all. I think this PR is a step in the right direction, towards a simpler API and a more maintainable pile of objects. Especially around "saving and loading with parameters" which seemed to be one of the places the fake-array Zeros caused problems. |
This design assumes no checks on making sure the operations it participates in are valid, which can open issues down the line. The generic function I was referring to might be ambiguous, but I was referring to the manner in which Zeros was implemented before merging this PR. Taking the input of This PR also doesn't catch the issue with loading model parameters with Zeros in them. This was an error in the previous implementation as well, but can be worked around by making the struct mutable - which isn't much of a problem with the design of the previous Zeros so to speak, since adding the keyword in there is an orthogonal change, but it has the advantage of holding the shape information so the model is actually reproducible once it is loaded. As mentioned - a lighter API is desirable to me as well, but it should not sacrifice correctness which would be a bigger concern to me. |
Thanks for expanding.
Can you point out some checks you'd like to see? Or point out what input is allowed by the current state, which gave a friendly error on construction before? We could always catch more things, especially if they are mistakes people are apt to make. (Often these are easier for someone besides the author to see.)
Do you mean loading a model saved from an earlier version? That might not be trivial to support, but if it is meant to be, we should start by saving some test cases somewhere. Or are there particular cases of
Which
Are there cases where the length of the bias vector cannot be inferred from the shape of the weight matrix? The constructors all seem to have no problem doing this. Or what other shape information are you concerned about reproducing? |
FWIW, this is how far I got with a PR storing Here |
The shape checking and broadcasting semantics that Julia expects to have the correct result when a mismatch does take place. This also requires being able to reshape consistently.
Typically yes, but also equally importantly, consistently representing the information in the stored model such that it can be manifested back to the loaded model.
Well, it highly depends. Initially, my thought was checking for a bool in the forward pass isn't too bad for a start, but the implications of it are documented in #873. The complexity control flow adds to ad and the special handling in every layer are examples. The unions in the constructors and the forward passes were also not well received.
The |
On loading and saving - Yes it does get slightly tricky since we need to replace the |
To be clear, the tests and adjoints from the PR are fine, and while I wish we could preserve the ease of loading with booleans, the tradeoff with correctness is what I'm trying to understand. |
But if there is no bias, there is no shape, it cannot be wrong. Having an array of zeros which don't get trained, or a fake-array which is always zero, is a way to save writing a separate
The only complexity is that, when The forward pass isn't an issue, worst case we simply add I don't know what bit of #873 you are thinking of. But we also know more now than we did then, about just how complicated this fake array story ended up. That's what's driving us to simpler alternatives.
By who? Before this PR, there was Not sure what you mean by unions in the forward passes.
OK, thanks for clarifying. But it wasn't mutable before this PR, nor in the alternative one. Is this an approach you were thinking about exploring, or another PR somewhere else? This still seems like loading on ever more complication to compensate for other complication.
But this is exactly what Booleans shine at. The loading is trivially handled. It is preserving this with a complicated
Whether you have a FillArray, a custom version, a scalar Zero, false, or
I don't follow, I thought Is your concern here about whether parameters from
If this is a requirement, then it would be great to start setting up some kind of test which does this, and which the tagged version of Flux passes. Then we can build against that. But if not, then it doesn't matter, and shouldn't influence our design. |
Thank you for this! I think it's important to make sure the argument about correctness is not lost.
It's not about there not being a bias to shape check, it's about the model passing without the bias where it wouldn't pass had the bias array existed there or vice versa. Both cases should behave identically.
I don't think we are considering the #873 (comment)
Correct, but how would you handle loading an existing
Regardless, it's a known and simple fix to an issue. The concern is loading the models with multiple layers which have their bias off. And again, while a Boolean would make it easier, it comes at the cost of special casing. It might have only needed a special reshape in Conv, but extending it for other cases is equally easy in both designs.
These methods here Line 30 in 8b53033
And Line 111 in 8b53033
Yes, those were, but now We can't assume that biases are a subset of arrays, which Zeros should be. |
The way I see it, the biggest complaints were:
|
I don't mean to come off as rigid because there's a LOT of good in this PR and my thanks and Kudos to @DrChainsaw and yourself for validating the motivations behind it, but also feel it necessary to address the cases which are important to cover |
@DhairyaLGandhi No worries from my side, this is how good software is built. I hope I don't come off as rigid either. Just trying to help out here. :) To me one of the attractions of open source is that one (at least I who is not on a payroll for this task) can afford to dwell on decisions like this.
I guess this also requires one to prevent that the loading functionality tries to mutate the Zeros. The use case of trying to just copy over the parameters from one model with bias:ed layers to another which doesn't have biases is probably going to be a bit ambiguous no matter what approach one takes. They are in essence different models imo. Note sure how relevant, but in the case of The only use case I could think of for that method is that adjoint method for backprop through an ODE solver from that paper which popularized neural ODEs. For that use case I think it is preferred to not return non-trainable parameters. I don't know if there will be any harm done if returning zeros and then ignoring whatever values comes back though (except wasted computation). Not sure if worth mentioning, but I guess that a fixed AbstractArray type with the correct eltype and size does still fit in the type constraints, or? |
I reckon not returning the Zeros would be preferable for NODEs, although I'd check with Chris Rackauckas before to be sure.
Not sure what you're pointing at here? |
Fixes #1332, #1277
This is take two on fixing the above issues and is intended to be mutually exclusive to #1374.
In this version Zeros is no longer an AbstractArray and is thus more like a there-is-no-such-parameter marker instead of this-parameter-is-an-immutable-array-of-zeros type.
I made the change so that the advertised way of disabling bias is through
bias=false
rather thanbias=Zeros()
.Zeros
itself is alot less capable here compared to the previous attempt, but that also keeps the number of moving parts down, i.e no need to test that both the 0-dim version and the full size version works the same. All in all there are fewer specializations compared to #1374.I think that a rename could definitely be in place. Better names I can think of are
bias=Off
,bias=None
or as suggested by @mcabbottbias=False()
.The best argument against renaming I can think of is to be a little bit less breaking.
PR Checklist
@dhairyagandhi96
(for API changes).