Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient of sum #900

Closed
cossio opened this issue Feb 16, 2021 · 16 comments
Closed

Gradient of sum #900

cossio opened this issue Feb 16, 2021 · 16 comments

Comments

@cossio
Copy link
Contributor

cossio commented Feb 16, 2021

T = randn(10,10)
ps = Flux.params(T)
opt=ADAM()
for iter = 1:100
    gs = gradient(ps) do
        sum(T)
    end
    Flux.update!(opt, ps, gs)
end

Throws the following error:

ArgumentError: Cannot setindex! to 0.0009999999900000003 for an AbstractFill with value 1.0.

Stacktrace:
[1] setindex! at /home/cossio/.julia/packages/FillArrays/tE9Xq/src/FillArrays.jl:41 [inlined]
[2] _setindex! at ./abstractarray.jl:1176 [inlined]
[3] setindex! at ./abstractarray.jl:1153 [inlined]
[4] macro expansion at ./broadcast.jl:932 [inlined]
[5] macro expansion at ./simdloop.jl:77 [inlined]
[6] copyto! at ./broadcast.jl:931 [inlined]
[7] copyto! at ./broadcast.jl:886 [inlined]
[8] materialize! at ./broadcast.jl:848 [inlined]
[9] materialize!(::FillArrays.Fill{Float64,2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(/),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(/),Tuple{Array{Float64,2},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(-),Tuple{Int64,Float64}}}},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(+),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(sqrt),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(/),Tuple{Array{Float64,2},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(-),Tuple{Int64,Float64}}}}}},Float64}}}},Float64}}) at ./broadcast.jl:845
[10] apply!(::ADAM, ::Array{Float64,2}, ::FillArrays.Fill{Float64,2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}}) at /home/cossio/.julia/packages/Flux/05b38/src/optimise/optimisers.jl:177
[11] update!(::ADAM, ::Array{Float64,2}, ::FillArrays.Fill{Float64,2,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}}) at /home/cossio/.julia/packages/Flux/05b38/src/optimise/train.jl:23
[12] update!(::ADAM, ::Params, ::Zygote.Grads) at /home/cossio/.julia/packages/Flux/05b38/src/optimise/train.jl:29
[13] top-level scope at In[89]:8
[14] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091

@cossio
Copy link
Contributor Author

cossio commented Feb 16, 2021

This is really an issue with Flux, since the gradient is fine.
FluxML/Flux.jl#1510

@cossio cossio closed this as completed Feb 16, 2021
@GiggleLiu
Copy link
Contributor

GiggleLiu commented Feb 18, 2021

I think it is an issue of Zygote, it returns FillArray as the gradient of sum. That is very unfortunate, a lot people around me bump to this bug. https://github.com/FluxML/Zygote.jl/pull/191/files

@cossio
Copy link
Contributor Author

cossio commented Feb 18, 2021

But I think it's fine to say that the gradient of sum is a FillArray. It's a correct answer that is more efficient in most cases.

@willtebbutt
Copy link
Member

I think it is an issue of Zygote, it returns FillArray as the gradient of sum. That is very unfortunate, a lot people around me bump to this bug. https://github.com/FluxML/Zygote.jl/pull/191/files

To my mind, this is perfectly acceptable behaviour -- it's similar in flavour to returning nothing if the gradient w.r.t. an Array input to some function is always 0. Could you explain why you feel that this is a bug?

@GiggleLiu
Copy link
Contributor

GiggleLiu commented Feb 18, 2021

Could you explain why you feel that this is a bug?

When a user writes backward rules, he always assume the adjoint of the output has the same type as the output. Simply because it is impossible to handle various adjoint types.

Here if you have a function loss = sum(f(x)), where the function f(x) returns an Array y. If we assume adjoint_y type is Array or Nothing. Then he might use setindex! in his program and get an error. This kind of bug can not be detected with unit tests.

@cossio 's issue might be related to Flux rather than chain rule. I just want to point out that random output type is a very important source of uncertainties in one's Zygote program.

@willtebbutt
Copy link
Member

When a user writes backward rules, he always assume the adjoint of the output has the same type as the output. Simply because it is impossible to handle various adjoint types.

Right -- I think this is the crux of the problem. Users should not be making this assumption because Zygote has never promised to provide a gradient of the same type as the argument, nor could it feasibly do so. Probably the documentation should be changed to reflect this.

For example, general structs have NamedTuple gradients (hopefully ChainRules's Composites at some point in the future instead) just because you can't represent the gradient of a function w.r.t. some arbitrary struct with another struct of the same type. It's not even always the case that AbstractArrays can have "sensible" gradient representations that are of the same type as the argument -- Fills are actually a good example here, as representing the gradient of a function w.r.t. a Fill turns out to have slightly strange consequences.

Perhaps we should consider some additional functionality to "canonicalise" representations. We've actually been discussing implementing that kind of functionality for ChainRules to make rule-implementers lives more straightforward -- see here. We might be able to optionally incorporate something like this in Zygote.

@GiggleLiu
Copy link
Contributor

GiggleLiu commented Feb 18, 2021

It is good to know someone is taking it seriously. I can see it is hard to unify the adjoint, like the Tranpose{T,Vector{T}} output might have a regular vector as adjoint, also those with the NamedTuple as returns.

I am also thinking how much performance loss if we force the adjoint to have the same type as the output. Julia language used to return normal array type if you call zero on another type (see issue), but now it also changes the behavior

julia> zero([1,2,3]')
1×3 adjoint(::Vector{Int64}) with eltype Int64:
 0  0  0

For composite types, I can imagine it requires users' effort to wrap it with the right type. Wondering if there is a fundamental reason for not doing so?

Also, I am interested to know if there is a benchmark to show returning the same type sacrifises the performance a lot. I am just curious the cons and pros for doing so, without strong bias toward one over another.

@willtebbutt
Copy link
Member

I am also thinking how much performance loss if we force the adjoint to have the same type as the output.

Are you specifically referring to AbstractArrays here, or more general types?

Also, I am interested to know if there is a benchmark to show returning the same type sacrifises the performance a lot. I am just curious the cons and pros for doing so, without strong bias toward one over another.

I'm not aware of any particular benchmarks -- would be interesting to see some.

@GiggleLiu
Copy link
Contributor

Are you specifically referring to AbstractArrays here, or more general types?

I mean more general types.

@willtebbutt
Copy link
Member

willtebbutt commented Feb 24, 2021

I mean more general types.

Okay. For more general types this isn't a matter of performance, it's about achieving the correct / desirable behaviour. There are a couple of reasons why you can't use a type to represent its own tangent in general. Consider an arbitrary struct Foo.

Firstly, as an AD package author, you're not allowed to define +(::Foo, ::Foo) or *(::Real, ::Foo), because that would be type-piracy. However, we need (co)tangents to form a vector space, so we need to be able to define those operations.

Secondly, it's not always the case that the set of points that a given type can represent is the same as its tangent space. For example, suppose that

struct Foo
    x::Float64
    lb::Float64
    ub::Float64
    function Foo(x, lb, ub)
        if x < lb || x > ub
            throw(error("x must be between lb and ub"))
        end
        return new(x, lb, ub)
    end
end

I've used an inner construct to constrain the value of x to be somewhere between lb and ub, so you're now allowed to construct a Foo that violates this assumption. So x ∈ [lb, ub], but the derivative of some function w.r.t. x needn't satisfy this constraint, since the tangent space of x is the entire real line. So there are tangents of Foo that literally cannot be represented by any Foo, so you need to use something else.

Does this make sense?

Of course, there are "special cases" where neither of the above isn't true, such as Float64, and Vector{Float64}. This doesn't entirely pin down what you have to do of course, since there are quite a few special cases floating around. However, from the perspective of implementing an AD system that works all (most) of the time, it's really helpful if to have as few special cases as possible, and to have a simple set of rules for deriving appropriate (co)tangent types, which is why Composites are so helpful.

@GiggleLiu
Copy link
Contributor

GiggleLiu commented Feb 24, 2021

Thanks for you reply. Benefit a lot from your answer!
About the first point, can you explain " we need (co)tangents to form a vector space" a bit more? I do not get it.

In the second point, this example is excellent. I also have such issue in NiLang. This is because the constructor did some checks beyond type information. I think the ideal solution is to add something like new(type, fields....) to access the default constructor in Julia language. Right?

@willtebbutt
Copy link
Member

willtebbutt commented Feb 24, 2021

About the first point, can you explain " we need (co)tangents to form a vector space" a bit more? I do not get it.

Well, strictly speaking, you just need to be able to add tangents to make AD work i.e. accumulation on the reverse-pass in reverse-mode. Scaling is an added bonus really.

Granted, you could take Zygotes approach and define accum, but to me that just seems like a work-around for the problem, rather than a good solution.

In the second point, this example is excellent. I also have such issue in NiLang. This is because the constructor did some checks beyond type information. I think the ideal solution is to add something like new(type, fields....) to access the default constructor in Julia language.

I think you actually can do this with some existing hack that people disapprove of. @oxinabox you mentioned this to me a while ago I think.

I really don't favour that solution though, since it doesn't make sense to me semantically. For example, I really don't want to have objects floating around that claim to be constrained, but aren't actually constrained. This is why I prefer the Composite approach -- it's explicit in what it represents (e.g. a Composite{Foo} represents the (co)tangent of a Foo), doesn't require violating constraints that type authors have placed on their types, always make sense, and lets you use +, rather than having to define something like accum to avoid type piracy.

edit: in short, while I agree that it's probably technically possible (as in, you could make Julia do it) to represent the tangents / cotangents of any given type by another object of the same type, it feels like a hack. I would rather embrace the fact that the (co)tangents of a given type cannot in general by represented by elements of that type, and design AD systems with that in mind.

edit2: there's a separate question as to whether we ought to rename Composite to Tangent to make it more explicit. Tangent{Foo} would be a nice type to have floating around.

@oxinabox
Copy link
Member

I think you actually can do this with some existing hack that people disapprove of. @oxinabox you mentioned this to me a while ago I think.

Yeah, youcan do it, it is something like
@eval force_create(T, fields...) = Expr(:new, T, fields...)

I really don't favour that solution though, since it doesn't make sense to me semantically

Yeah, sometimes inner constructors are enforcing invarients.
Consider

struct UnitCirclePoint
    x
    y
    function CirclePoint(x,y)
        x^2 + y^2 == 1 || error()
        new(x, y)
    end
end

foo(c::UnitCirclePoint) = 2*c.x+3*x.y

the UnitCirclePoint maintains an invarient. It's fields actually occupy a subspace of R^2 that is only 1D (a nicer representation would just store the angle, but maybe author of this code has a reason not to)

Attempting to update c based on the derivative of foo is not well defined.
There are several options. Most clear is to error.
But you could compute the projection back on to the circle (via tan(dx, dy)).
I think there are also other options also.

@GiggleLiu
Copy link
Contributor

GiggleLiu commented Feb 24, 2021

Tangent{Foo}

This name is so much more intuitive! I suddenly understand what you meant previously. Now I tend to agree that the solution of using another type might be better. But I have a new question, why not stick to the rule that type T has a gradient of either T for vectors and scalars and Tangent{T} for composite types. In this way, the complexity of gradient type is still under control. In this way, we will not see and undexpected FillArray.

@oxinabox

Yeah, youcan do it, it is something like
@eval force_create(T, fields...) = Expr(:new, T, fields...)

I didn't know this. learnt a new way to hack Julia, cheers!

Thank both of you for detailed explaining.

Note: Composite is not a good name for discussion because it looks like a general composite type (most types are) rather than a specific type.

@willtebbutt
Copy link
Member

willtebbutt commented Feb 24, 2021

This name is so much more intuitive! I suddenly understand what you meant previously. Now I tend to agree that the solution of using another type might be better.

This is a good point. Have opened an issue :)

But I have a new question, why not stick to the rule that type T has a gradient of either T for vectors and scalars and Tangent{T} for composite types. In this way, the complexity of gradient type is still under control. In this way, we will not see and undexpected FillArray.

Good point. To my mind, the distinction here is performance.

Firstly, correctness: I think it's clear that it is, on some level, not incorrect to return a Fill instead of an Array if all of the elements of the tangent are indeed the same. By correctness I mean it in the sense that they both satisfy the AbstractArray interface (modulo the setindex bit), so would generally expect any function to get the same answer regardless which one you get. So there aren't basic correctness problems in the same way that there are in the situations discussed above.

This brings us to performance: my feeling is that there are likely notable performance benefits to allowing the (co)tangent of an Array to be represented by a Fill in some situations, so it would be a shame to preclude them, since constructing Fills are essentially free and Arrays require time and memory proportional to their size.

Code Complexity: We already allow things like Thunks and Zeros in addition to Arrays to represent the (co)tangents of Arrays -- this is where we've taken the view that the code-complexity vs performance tradeoff favours the additional code-complexity. So I think that the question is whether we think that allowing Fills also provides a favourable tradeoff.

Honestly, I'm torn regarding the usefulness of Fills to represent tangents. I don't think it's hard to concoct reasonable-looking programmes where you would expect to see some kind of performance gain by using a Fill rather than an Array to represent a (co)tangent, but I also don't recall ever encountering situations in the wild in which I found that it was essential to use a Fill rather than an Array to get performant code 🤷

All this being said, something like this functionality could allow us to have the best of both worlds. To be honest, I think that we probably need it anyway to ensure that rule-writers can get the type that they need to write their rule without having to think too much about all of the types that a particular tangent could possibly take, and once you've got something like this, it matters less if you've got a few different types that could represent a tangent.

@GiggleLiu
Copy link
Contributor

GiggleLiu commented Feb 24, 2021

Nice, thanks.

If forcing the grdient types being the same as the input is not a proper solution. There are other ways to helping users finding potential bugs of type mismatch.

I am thinking about adding a debug option to help user identifing type mismatch.
e.g. the previous rrule for inv

function rrule(::typeof(inv), x::AbstractArray)
    Ω = inv(x)
    function inv_pullback(ΔΩ)
        return NO_FIELDS, -Ω' * ΔΩ * Ω'
    end
    return Ω, inv_pullback
end

We can insert some code for debugging

function rrule(::typeof(inv), x::AbstractArray)
    Ω = inv(x)
    function inv_pullback(ΔΩ)
        fs, g = NO_FIELDS, -Ω' * ΔΩ * Ω'
        @debug begin
            if typeof(g) <: Union{Zero, ...}  # special types
                  pass
            elseif ((typeof(x) <: Array || isprimitivetype(typeof(x))) && typeof(x) != typeof(g))
                 "warn: array/scalar type mismatch: function = inv, input type = $(typeof(x)), gradient type = $(typeof(g))"
            elseif (Tangent{typeof(x)} != typeof(g) || field_mismatch(x, g))
                 "warn: tagent type mismatch: function = inv, input type = $(typeof(x)), gradient type = $(typeof(g)), expecting a tangent type"
            end
            return fs, g
        end
    end
    return Ω, inv_pullback
end

When a user run the code in debug mode, he will see some useful information to help debugging. Does this make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants