Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Allow chunk-sizing options and fix leaky generated functions from #27 #34

Merged
merged 23 commits into from
Aug 14, 2015

Conversation

jrevels
Copy link
Member

@jrevels jrevels commented Aug 10, 2015

Basically what the title says. This PR attempts to address some issues with #27, and is being done here since there will be intermittent breakage and a smaller PR will be easier to review.

@jrevels
Copy link
Member Author

jrevels commented Aug 12, 2015

@Scidom @mlubin Here's an overview of what things look like in this branch right now:

I brought back the chunk-based calculation for gradients, extended it to Jacobians, and came up with a tiling algorithm to implement chunk-based calculation of Hessians as well. The tiling algorithm can be extended to Tensors, but it would require a significant amount of work (generalizing to rank 3 makes my head hurt). I'd rather put off supporting chunk-computing Tensors until after #27 lands, so that we can get everything merged and I can get benchmark results from an "official" version of ForwardDiff.

Following @mlubin's advice, I ended up implementing a ForwardDiffCache type that encapsulates all the behaviors we want in terms of building/fetching "work state". The cool part of this, IMO, is that it opens the door for the non-closure-generated API methods to benefit from caching between calls, without relying on leakiness as in #27. For example, this kind of usage is now supported as of the previous commit:

julia> mycache = HessianCache();

julia> hessian(f, x, cache=mycache)

...where mycache above stores the "work state" generated during the call that can be reused in subsequent calls by simply passing mycache as I did above (the caller doesn't have to do anything with mycache besides pass it around to the relevant API methods).

I really like this caching solution for that reason. However, I haven't been able to implement it in a way that matches the performance of the code in the nduals-refactor branch. Take this naive benchmark, for example:

julia> using ForwardDiff

julia> f(x) = sum(sin, x) + prod(tan, x) * sum(sqrt, x);

julia> g = ForwardDiff.gradient(f)
gradf (generic function with 1 method)

julia> h = ForwardDiff.hessian(f)
hessf (generic function with 1 method)

julia> x = rand(100);

julia> @time g(x); # after warmup
0.000123 seconds (1.71 k allocations: 558.609 KB)

julia> @time h(x); # after warmup
0.014564 seconds (23.52 k allocations: 24.293 MB, 19.97% gc time)

julia> x = rand(1000);

julia> @time g(x); # after warmup
0.019061 seconds (20.00 k allocations: 46.601 MB, 32.08% gc time)

julia> @time h(x); # after warmup
12.934909 seconds (3.02 M allocations: 22.491 GB, 16.72% gc time)

Compare the above to the same benchmark on the nduals-refactor branch:


julia> x = rand(100);

julia> @time g(x); # after warmup
0.000108 seconds (1.51 k allocations: 555.359 KB)

julia> @time h(x); # after warmup
0.013671 seconds (3.83 k allocations: 23.992 MB, 20.54% gc time)

julia> x = rand(1000);

julia> @time g(x); # after warmup
0.017159 seconds (17.01 k allocations: 46.555 MB, 33.04% gc time)

julia> @time h(x); # after warmup
12.971819 seconds (40.03 k allocations: 22.446 GB, 15.92% gc time)

The disparity in the reported time/memory usage doesn't bother me a huge amount, but the disparity in allocation count is quite large, and I fear it could be an indicator of potential poor performance. I'd like to do more intensive profiling to figure out the source - the main difference code-wise between these two branches (w.r.t to this benchmark, which doesn't touch the chunk-based code) is how caching is handled, so I suspect it's something there. I'll try some more stuff tomorrow and report back.

@mlubin
Copy link
Contributor

mlubin commented Aug 12, 2015

Could you point to the lines with the caching logic? Sounds like there's type instability somewhere.

@jrevels
Copy link
Member Author

jrevels commented Aug 13, 2015

Could you point to the lines with the caching logic? Sounds like there's type instability somewhere.

All the caching stuff is defined in cache.jl. There's definitely type instability/ambiguity internally, but I was hoping it was close enough to "surface level" that it wouldn't impact performance (e.g. what we were discussing with having type instability at the API level).

The main problem, I imagine (but still need to test) comes from the fact that I'm using ambiguously-typed Dicts here. This is because the values stored in such a Dict will all be typed differently depending on chunk_size, so there's not one type for the values that will fit them all concretely.

One way I've thought of circumventing this would be to make the fields of ForwardDiffCache generated functions instead of Dicts, like this:

immutable ForwardDiffCache{F}
    fad_type::Type{F}
    workvec_cache::Function
    partials_cache::Function
    zeros_cache::Function
end

function ForwardDiffCache{F}(::Type{F})
    @generated function workvec_cache(args...)
    ...
    end

   @generated function partials_cache(args...)
    ...
    end

    @generated function zeros_cache(args...)
    ...
    end

    return ForwardDiffCache(F, workvec_cache, partials_cache, zeros_cache)
end

Then, I could make the caching more-or-less type stable, since the input type to the generated closures (e.g.workvec_cache) can have deterministic, one-to-one correspondence with the output type, like it is in nduals-refactor. The two things I need to check are 1) will the generated closures inference correctly and 2) will their state be gc'd correctly when the object goes out of scope.

@jrevels
Copy link
Member Author

jrevels commented Aug 13, 2015

If 1) and 2) end up being true (which I'm not at all convinced of until I can test it), then this would actually be a cool pattern to handle type-inferencable, type-variadic, and non-leaky memoization in Julia (which I don't believe is covered by Memoize.jl, according to the inferencing warning at the bottom of the README).


_load_gradvec_with_x_zeros!(gradvec, x, gradzeros)

for i in 1:N:xlen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if xlen isn't divisible by N?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then an exception is thrown:

function check_chunk_size(x::Vector, chunk_size::Int)
    @assert length(x) % chunk_size == 0 "Length of input vector is indivisible by chunk size (length(x) = $(length(x)), chunk size = $chunk_size)"
end

See here.

@jrevels
Copy link
Member Author

jrevels commented Aug 13, 2015

@mlubin @Scidom The latest commits restructure the code to fix the type instabilities in the caching layer, which were bubbling up to the other methods and making them allocation-heavy. Performance is now comparable to nduals-refactor, and I'm basically ready for this to be merged back in pending your review.

…d Python (comparing ForwardDiff.jl with AlgoPy)
jrevels added a commit that referenced this pull request Aug 14, 2015
Allow chunk-sizing options, and a more robust caching layer to fix leaky generated functions
@jrevels jrevels merged commit 4a4285d into nduals-refactor Aug 14, 2015
@jrevels jrevels deleted the api-refactor branch August 14, 2015 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants