-
-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Zeros with dimensions #1402
Conversation
If we are doing this, why don't we just use FillArrays.jl? |
answer: because these also don't have any gradients An alterantive is to use |
Or just The current implementation is more or less In layers with bias, the bias is an array, which has shape, and this has to be preserved, etc. But in layers without bias, the absence of bias doesn't have to be "consistent with the behaviour of the type it replaces" because it's, well, not there. I mean, I don't have a dog. It doesn't have a name, and it also doesn't behave like other dogs when taken for a walk... |
Fwiw, the comment by @mcabbott about "absence of bias" is what made me prefer the current implementation over #1374. I think the code in the PR looks like an improvement over #1374, but it seems like a couple of tests fail due to getindex not being supported. It also looked as if a gradient slipped through somewhere.
Is it possible to get and ELI5 version of this? I found it easy to reason about what |
But changing the behaviour because of "absence of bias" may make it diverge from what is expected from the model. The expectation of what the model is best suited to do is left to the user. Basically saying that not training on the bias is not the same as not having the bias term in some modeling situations. Granted - no bias, no need to compute anything, but in cases where the shape data is necessary, that might not be so trivial without also defining the implementation of the layer without the bias. With this, we can retain the fast paths of pretending that the bias doesn't exist, while giving the option of always holding it to zeros - ie it exists but it's never trained on. Consider a contrived version of a layer function layer(w, b, x)
y = w .* x
y .+ b
end Here the shape of the output depends on the shape of the bias. W = rand(3,3,2)
x = rand(3,3,2) # last dim is batch
layer(W, Flux.Zeros(), x) # mirrors current implementation
size(layer(W, Flux.Zeros(3,3,2,2,2), x )) == size(layer(W, zeros(3,3,2,2,2), x )) This way, we can have the flexibility of choosing how Zeros behaves - both as a traditional bias (with one value per neuron) or made to respect the implementation of a layer. |
I agree it's an important property of broadcasting that it extends dimensions based on all arguments. And it's neat that you can use arbitrary broadcasting in functions you intend to use with Flux, and arbitrary array types, including But the problem we are trying to solve here is much narrower. Some common ML layers exist in a form like For reasons of code re-use & shortness of the manual, we'd like to encode both situations as one layer type. They are, however, different models: they have different numbers of trainable parameters, either counting scalars ( There is no general expectation that pre-trained parameters saved from one model should make sense when loaded into a different model. I mean you are welcome to try, it's all just numbers, but the default behaviour of calling
Sure, if you want a model with bias, but want to exclude it from training (or from some stages of training), that is an orthogonal concern. There are lots of reasons you may wish to freeze some of the parameters present in your model, and we have tools for that. |
@mcabbott maybe you should turn the work you have done on the |
It is not about the narrow scale of the problem per se, and that is what I implied by the traditional use of bias (apologies if I wasn't clear earlier). It is about having the flexibility of treating it as if bias doesn't exist, or treating it as a fixed array or a more generally as a transformation based off of broadcasting. In this approach, we can have both basically, and that makes it strictly superior to making an assumption of whether the bias term will always act in a particular fashion. Notice also in the example above, it isn't a traditional Dense layer we are considering - it is a learnable transformation.
Agree! Giving the users the ability to choose between fixed array vs not existing in a single model seems to be in line with that. To be clear - calling
That is correct, and I don't think this approach changes that in any way. Loading into a different model with incompatible parameters structure would error, however, having multiple layers with bias turned off would not. (ref #1332 and #1277) c = Conv((3,3), 1=>3, bias = Flux.Zeros())
ps = Flux.params(c)
c2 = Conv((3,3), 1=>3)
Flux.loadparams!(c2, ps) # errors as expected
c3 = Conv((3,3), 1 => 3, bias = false)
Flux.loadparams!(c3, ps) # works I haven't exactly seen anything against this approach, which is encouraging. |
I wouldn't mind having this flexibility as a user although I can see how maintainers might consider it a non-necessary requirement.
Just a question for my own understanding: Aren't there alot of things which would be required to load/save a model which are not included in I somehow have the mental model that users can rely on things returned in As an example, won't returning them in |
Maybe we are tripping over terminology here? Bias is a constant shift applied before the nonlinearity. Constant meaning it's independent of which input image etc, of the batch index, but may vary per pixel. It shifts the kink in Common layers with & without trainable bias are often used, and make sense for Flux to support. Ideally in a simple, easy-to-understand, easy-to-maintain way. Flux is not interested in elaborate contortions for 1% performance changes. The example above with In some models, people may wish to replace some other array of trainable parameters with a constant array of zeros. Not just freeze them during some training, but fix permanently to a constant. Such a constant array can often be avoided, as broadcasting tends to extend a single scalar just fine, but when it is needed (e.g. because it affects the output shape), there's a well-tested & maintained package called FillArrays which provides such a thing (and is already loaded by Zygote). If this has particular bugs, then we should fix them. However, the claim is that this is not acceptable, or too inconvenient, because switching back and forth between the trainable array and the constant one is something people are going to do all the time. (Perhaps so often, even, that you wish to make it easy to load parameters from the model with the trainable array, into the model without it, and make all the other parameters still line up? I'm not sure.) Is this a common use case? If so, it should be easy to find some links to papers using such things; there are of course many papers I haven't read. If it is common, then we can discuss how best to support it, whether the constant is always zero, and whether it should share any code with the handling of bias, etc. We're much more likely to get the design right with some actual real-life examples to look at. Or if it's not, then we can move on. |
I can see the constant advocation for FillArrays which has been taken and discussed plenty, and not that it merits repeating, but Using FillArrays used to be my go to after I can see the fundamental disagreement using Zeros to represent a bias at all, and that's alright. I happen to think that this approach (in the PR) would be able to replicate most if not all of what we get from a Bool/ Singleton based approach and still be general enough for extending in the future.
This is false, and has been said so in the past. Not being able to julia> d1 = Dense(1,3, bias = false)
Dense(1, 3)
julia> d2 = Dense(1,3)
Dense(1, 3)
julia> d1.b
Flux.Zeros()
julia> d2.b
3-element Array{Float32,1}:
0.0
0.0
0.0
julia> Flux.loadparams!(d1, Flux.params(d2)) # should error -> can't (shouldn't) load array into Zeros/ loaded model doesn't contain the parameters from `d2`
julia> d1.b
Flux.Zeros()
julia> d2.b
3-element Array{Float32,1}:
0.0
0.0
0.0 This should clearly error, since you are either not indeed loading parameters properly ( @ToucheSir good point! Models with |
I assume you meant to tag @DrChainsaw, but I'm happy to weigh in as well :) Documenting the "user-space" API for biasRegardless of what approach we use to represent bias under the hood, I think it's good to step back and consider how users are going to learn about this feature (especially since intuitiveness/ease of use was a major point in starting this discussion). Currently, there are exactly 0 references to a Maybe it's not bias that's the problem, but
|
@DrChainsaw yes, indeed Thanks! |
I think this has the tagged version's docstrings. The ones for Edit -- actually only the first method is shown in the docs, the docstring for The Dense docstring could use some polish, there are PRs in the works. What do you mean re PyTorch etc? They use a similar
It just iterates all parameters, of any treelike structure. I haven't followed all those issues, and perhaps something fancier is needed there. The bug of #1277 was precisely about an IDDict storing just one copy of the earlier Edit -- If we think about designing other Params-like containers, they will always have to deal with struct fields which aren't parameters, like |
Yes, I agree. When last I looked there was a
OK, thank you, that bit is clearer now.
Seems OK to use a package to solve some problem for exotic models, breaking new ground. If they aren't so exotic, then let's gather some examples to guide our thinking. (Of models with a constant array which affects the size of the output, to be clear.) And find out what, if anything, other frameworks do about such models, I guess. |
That's exactly what I was referring to. Great to see the layer reference getting some TLC.
Not the API, but the structure of the docs. Note the distinct section separators, field layout (in nice tables for TF and large bullets for PyTorch vs non-existent for Flux), prominent and consistent sections for in/output shape, etc. These are all more Documenter.jl issues though, just wanted to stick a pin in it for later.
I wasn't thinking about #1277, but in that context the fix makes sense. However, what if you wanted to The issues I linked at the end were about eliminating the need for
I agree with your assessment, but all these are equally true for the current |
OK, that's not crazy, but then #1408 / #1409 are too strict. Perhaps we take this there?
Sure. I just mean that absent bias seems like a solved problem (which we should be careful not to un-solve) rather than a driver of new designs. |
I hardly find it solved considering it has unsafe behaviour in a variety of scenarios as showed earlier. Even without the rest of the benefits a more sane and predictable behaviour set seems attractive enough to me |
AFAICT, #1407 (
My thoughts:
Apologies if I am jumping into this discussion with imperfect information and conclusions. Please correct me where I am wrong. |
Re simplicity - I have been a proponent, and this is pretty straightforward design. |
Since this topic has come up a couple times recently, I think a good way to get a sense of where everyone stands is to run a quick poll. The choices are as follows:
To start, I've marked out my preferences for all the scenarios I could feasibly think of us running into below:
Alternatively, if we wanted to have a "strict" mode like
|
With respect to using |
This table might be better under #1408 perhaps? Doesn't seem tied to the idea of this PR. |
I included it here because not everyone participating in this thread has commented there, and because it appears to be a sticking point for arguments over how to represent "sometimes params" like bias in general. You could extend a similar argument for the |
OK. But I feel like there's a missing word between "Input" & "Output" -- you're discussing how to represent these, with (I think) an eye towards loading / copying / destructure operations. Can you say more precisely which such operations you have in mind? Still think it would be clearer to discuss such loading / re-structuring / missing thing somewhere else than this thread. |
The reason for having something specific to the Flux case is that fill arrays (like other specialised arrays) are not something who's behavior flux/ zygote should own. Another factor is breaking pretty crucial behaviour such as julia> f = FillArrays.Ones(3,3)
3×3 Ones{Float64}
julia> gradient(f) do x
# do something memory intensive
# that can avoid materializing arrays
sum(x .^ 2)
end
(3×3 Fill{Float64}: entries equal to 2.0,) |
Ah, I think I understand now. I expected gradient((m, x) -> sum(m(x)), Dense(10, 2, bias=FillArrays.Zeros(2)), rand(Float32, 10, 2))[1].bias == nothing To be true OOTB, but it isn't. @mcabbott I was purely focused on (theoretically smarter versions of) |
There's a whole discussion of better AD for sparse & structured arrays, which includes FillArrays. This example is much like |
Of course, but those aren't used as model parameter values by default. Currently users must explicitly choose to use sparse and structured arrays with Flux, whereas anyone who calls |
Whatever happened to just using |
Yes, for For the broader question of what happens if your model may have structured arrays, one answer for how to fill the table above is comes from broadcasting. Maybe
Filling that in on the table, 2/3 depend on the values:
The plan on the AD side, BTW, is to always project gradients down to the number of free, but not necessarily mutable, parameters. Thus the cotangent associated with |
This pr has is needed anyway besides and already handles some of the intended |
This is precisely why I mentioned |
To me, the loadparams behavior is a benefit of this PR, but not the main driver. I wouldn't block this PR over loadparams. |
No intentions to block this PR, the changes themselves seem pretty uncontroversial. My point was that if that 80%+ of the purpose of Zeros (over, say, dispatch or an explicit branch) is to placate an underpowered |
Well we already gain a bunch of consistency in this PR, so I think it's going in the direction of making loadparams more consistent too. |
Not really. This PR lacks a motivation for adding back quite a lot of complexity. No examples of ML models which need fixed arrays have been provided yet. And actual arguments against using the well-tested ones provided by an existing dependency, should someone want them, are also conspicuously absent.
|
Actually, we've had plenty of debate over these in this, and connected issues/PRs. We've also wanted means to keep gamma and such constant for normalisation layers. Plus it's usable in places outside of biases as well, much like an array. Even the current implementation defines the math ops, broadcasting and their gradients, so I am unsure if there's any real concerns there. We've also gone over the FillArrays case which I think you're referring to in your comment. Fast |
Yes, this thread has certainly gone around in circles a lot. On that everyone can agree. By "use case" I mean the arxiv numbers of papers that need this thing, and yes "outside of biases" since bias, by definition really, doesn't involve broadcasting which changes the shape. If "current implementation" means master, then yes, this is also needlessly complex. |
Yeah, these issues have become increasingly unwieldy to the point where I keep forgetting about previous context (e.g. the "adding back" part, I thought these changes were new). I also assumed that not treating
Can #1402 (comment) be solved without any code changes to Flux (such that using Zygote with and without Flux has consistent behaviour, for example)? For FillArrays at least, I feel like you'd have conflicts between the default behaviour of julia> gradient((x, y) -> sum(x .+ y), Zeros(10), Ones(10))
(10-element Fill{Float64}: entries equal to 1.0, 10-element Fill{Float64}: entries equal to 1.0) And julia> gradient((x, y) -> sum(x .+ y), Flux.Zeros(10), Ones(10))
(nothing, 10-element Fill{Float64}: entries equal to 1.0) Maybe there's a way to reconcile the two, but I haven't been able to think of one that doesn't introduce surprising behaviour. |
There's also a perfectly valid API compatible Arxiv numbers aren't really relevant here. The goal is to generalise to existing notion of how layers are written and to have more flexibility for cases that may not conform to the current state. (We can extend this such that we don't need to store gamma if it wouldn't be trained on in norm layers for example). Re If you would be happy with an example of practicality, |
With apologies to @mcabbott and @DrChainsaw :)
This PR relates to the changes in #1379 and moves some important things about. I feel uncomfortable to allow something that naturally translates as a vector or an array over to a special type without being consistent with the behaviour of the type it replaces (ie an array).
So the contentions behind the
<: AbstractArray
were that it complicated the implementation and didn't work well in some cases of moving to and from CUDA.Turns out it doesn't add too much in terms of complexity either.
Some improvements from the current implementation:
loadparams!
as expected while preserving the dims of the various arguments (can add tests)What is currently doesn't do:
cc @mcabbott @DrChainsaw for comments
PR Checklist
@dhairyagandhi96
(for API changes).