WIP: Allow chunk-sizing options and fix leaky generated functions from #27 #34

jrevels · 2015-08-10T14:54:05Z

Basically what the title says. This PR attempts to address some issues with #27, and is being done here since there will be intermittent breakage and a smaller PR will be easier to review.

…d rm dangerously leaky methods. jacobian, hessian, and tensor methods rm'd until they can support the same API structure as ForwardDiff.gradient

… to cover them

…the event that chunk_size equals length(x)

…upported

…e API. Also, split fad_api code into multiple files for easier review

jrevels · 2015-08-12T22:53:09Z

@Scidom @mlubin Here's an overview of what things look like in this branch right now:

I brought back the chunk-based calculation for gradients, extended it to Jacobians, and came up with a tiling algorithm to implement chunk-based calculation of Hessians as well. The tiling algorithm can be extended to Tensors, but it would require a significant amount of work (generalizing to rank 3 makes my head hurt). I'd rather put off supporting chunk-computing Tensors until after #27 lands, so that we can get everything merged and I can get benchmark results from an "official" version of ForwardDiff.

Following @mlubin's advice, I ended up implementing a ForwardDiffCache type that encapsulates all the behaviors we want in terms of building/fetching "work state". The cool part of this, IMO, is that it opens the door for the non-closure-generated API methods to benefit from caching between calls, without relying on leakiness as in #27. For example, this kind of usage is now supported as of the previous commit:

julia> mycache = HessianCache();

julia> hessian(f, x, cache=mycache)

...where mycache above stores the "work state" generated during the call that can be reused in subsequent calls by simply passing mycache as I did above (the caller doesn't have to do anything with mycache besides pass it around to the relevant API methods).

I really like this caching solution for that reason. However, I haven't been able to implement it in a way that matches the performance of the code in the nduals-refactor branch. Take this naive benchmark, for example:

julia> using ForwardDiff

julia> f(x) = sum(sin, x) + prod(tan, x) * sum(sqrt, x);

julia> g = ForwardDiff.gradient(f)
gradf (generic function with 1 method)

julia> h = ForwardDiff.hessian(f)
hessf (generic function with 1 method)

julia> x = rand(100);

julia> @time g(x); # after warmup
0.000123 seconds (1.71 k allocations: 558.609 KB)

julia> @time h(x); # after warmup
0.014564 seconds (23.52 k allocations: 24.293 MB, 19.97% gc time)

julia> x = rand(1000);

julia> @time g(x); # after warmup
0.019061 seconds (20.00 k allocations: 46.601 MB, 32.08% gc time)

julia> @time h(x); # after warmup
12.934909 seconds (3.02 M allocations: 22.491 GB, 16.72% gc time)

Compare the above to the same benchmark on the nduals-refactor branch:

⋮
julia> x = rand(100);

julia> @time g(x); # after warmup
0.000108 seconds (1.51 k allocations: 555.359 KB)

julia> @time h(x); # after warmup
0.013671 seconds (3.83 k allocations: 23.992 MB, 20.54% gc time)

julia> x = rand(1000);

julia> @time g(x); # after warmup
0.017159 seconds (17.01 k allocations: 46.555 MB, 33.04% gc time)

julia> @time h(x); # after warmup
12.971819 seconds (40.03 k allocations: 22.446 GB, 15.92% gc time)

The disparity in the reported time/memory usage doesn't bother me a huge amount, but the disparity in allocation count is quite large, and I fear it could be an indicator of potential poor performance. I'd like to do more intensive profiling to figure out the source - the main difference code-wise between these two branches (w.r.t to this benchmark, which doesn't touch the chunk-based code) is how caching is handled, so I suspect it's something there. I'll try some more stuff tomorrow and report back.

mlubin · 2015-08-12T23:16:23Z

Could you point to the lines with the caching logic? Sounds like there's type instability somewhere.

jrevels · 2015-08-13T01:25:24Z

Could you point to the lines with the caching logic? Sounds like there's type instability somewhere.

All the caching stuff is defined in cache.jl. There's definitely type instability/ambiguity internally, but I was hoping it was close enough to "surface level" that it wouldn't impact performance (e.g. what we were discussing with having type instability at the API level).

The main problem, I imagine (but still need to test) comes from the fact that I'm using ambiguously-typed Dicts here. This is because the values stored in such a Dict will all be typed differently depending on chunk_size, so there's not one type for the values that will fit them all concretely.

One way I've thought of circumventing this would be to make the fields of ForwardDiffCache generated functions instead of Dicts, like this:

immutable ForwardDiffCache{F}
    fad_type::Type{F}
    workvec_cache::Function
    partials_cache::Function
    zeros_cache::Function
end

function ForwardDiffCache{F}(::Type{F})
    @generated function workvec_cache(args...)
    ...
    end

   @generated function partials_cache(args...)
    ...
    end

    @generated function zeros_cache(args...)
    ...
    end

    return ForwardDiffCache(F, workvec_cache, partials_cache, zeros_cache)
end

Then, I could make the caching more-or-less type stable, since the input type to the generated closures (e.g.workvec_cache) can have deterministic, one-to-one correspondence with the output type, like it is in nduals-refactor. The two things I need to check are 1) will the generated closures inference correctly and 2) will their state be gc'd correctly when the object goes out of scope.

jrevels · 2015-08-13T01:30:18Z

If 1) and 2) end up being true (which I'm not at all convinced of until I can test it), then this would actually be a cool pattern to handle type-inferencable, type-variadic, and non-leaky memoization in Julia (which I don't believe is covered by Memoize.jl, according to the inferencing warning at the bottom of the README).

mlubin · 2015-08-13T01:33:03Z

src/fad_api/gradient.jl

+
+    _load_gradvec_with_x_zeros!(gradvec, x, gradzeros)
+
+    for i in 1:N:xlen


What if xlen isn't divisible by N?

Then an exception is thrown:

function check_chunk_size(x::Vector, chunk_size::Int) @assert length(x) % chunk_size == 0 "Length of input vector is indivisible by chunk size (length(x) = $(length(x)), chunk size = $chunk_size)" end

See here.

… it makes for a cleaner implementation

…re caching layer

jrevels · 2015-08-13T10:47:43Z

@mlubin @Scidom The latest commits restructure the code to fix the type instabilities in the caching layer, which were bubbling up to the other methods and making them allocation-heavy. Performance is now comparable to nduals-refactor, and I'm basically ready for this to be merged back in pending your review.

…d Python (comparing ForwardDiff.jl with AlgoPy)

…ions for caching

Allow chunk-sizing options, and a more robust caching layer to fix leaky generated functions

jrevels added 12 commits August 9, 2015 13:05

refactor ForwardDiff.gradient to support chunk sizing configurationan…

d622e9e

…d rm dangerously leaky methods. jacobian, hessian, and tensor methods rm'd until they can support the same API structure as ForwardDiff.gradient

*Num types renamed to *Number

15646cb

updated README with correct badges and clearer domain information

3fc5804

added chunk-configurable jacobian calculation methods

167e83f

fix variable mix up, add chunk tests for gradients

8c64065

rename test files to better reflect what they cover

40306b7

add gradient to intro text

d9c19d9

added chunk-configurable hessian calculation methods, adjusting tests…

e99d5fe

… to cover them

fix chunk_size incrementing so that the simpler loop is performed in …

6ff5c54

…the event that chunk_size equals length(x)

implemented ForwardDiff.tensor; chunk_size configuration is not yet s…

cfc2e5c

…upported

fix Hessian diagonal block filling loop indexing bug

8a187e1

implemented a caching framework to allow memory management through th…

69dc835

…e API. Also, split fad_api code into multiple files for easier review

mlubin reviewed Aug 13, 2015
View reviewed changes

jrevels added 3 commits August 13, 2015 04:10

switch to using generated functions for caching rather than Dicts, as…

f56dd2d

… it makes for a cleaner implementation

add type assertions where relevant

36342b9

fixed performance regressions by enforcing type stability in the enti…

c076b16

…re caching layer

added basic benchmarking machinery along with ackley_sum for Julia an…

db0f851

…d Python (comparing ForwardDiff.jl with AlgoPy)

jrevels mentioned this pull request Aug 13, 2015

WIP: Refactor ForwardDiff to utilize NDuals and replace API #27

Merged

14 tasks

jrevels added 6 commits August 14, 2015 09:59

more standard version of n-d ackley function

7649cfe

rm noisey comments

c2ccc76

switched to closing over Dicts rather than relying on generated fucnt…

9da6c46

…ions for caching

added deprecation wrapper testing

d9dd581

added positivity as chunk_size restriction

a88958e

added documentation for caching and chunk_size options

48ef905

added performance note about usage of the chunk_size option

49474a6

jrevels added a commit that referenced this pull request Aug 14, 2015

Merge pull request #34 from JuliaDiff/api-refactor

4a4285d

Allow chunk-sizing options, and a more robust caching layer to fix leaky generated functions

jrevels merged commit 4a4285d into nduals-refactor Aug 14, 2015

jrevels deleted the api-refactor branch August 14, 2015 19:05

salbert83 mentioned this pull request Oct 13, 2024

Unsupported operation when using CUDA #713

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Allow chunk-sizing options and fix leaky generated functions from #27 #34

WIP: Allow chunk-sizing options and fix leaky generated functions from #27 #34

jrevels commented Aug 10, 2015

jrevels commented Aug 12, 2015

mlubin commented Aug 12, 2015

jrevels commented Aug 13, 2015

jrevels commented Aug 13, 2015

mlubin Aug 13, 2015

jrevels Aug 13, 2015

jrevels commented Aug 13, 2015


		_load_gradvec_with_x_zeros!(gradvec, x, gradzeros)

		for i in 1:N:xlen

WIP: Allow chunk-sizing options and fix leaky generated functions from #27 #34

WIP: Allow chunk-sizing options and fix leaky generated functions from #27 #34

Conversation

jrevels commented Aug 10, 2015

jrevels commented Aug 12, 2015

mlubin commented Aug 12, 2015

jrevels commented Aug 13, 2015

jrevels commented Aug 13, 2015

mlubin Aug 13, 2015

Choose a reason for hiding this comment

jrevels Aug 13, 2015

Choose a reason for hiding this comment

jrevels commented Aug 13, 2015