Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto wrapper #130

Closed
wants to merge 2 commits into from
Closed

auto wrapper #130

wants to merge 2 commits into from

Conversation

simonbyrne
Copy link
Member

Avoids need for @gsl_function macro as suggested in #128

@simonbyrne
Copy link
Member Author

@lxvm can you try this out? The following seems to work for me:

using GSL

fquad = x -> x^1.5
ws_size = 100
ws = integration_cquad_workspace_alloc(ws_size)

a = 0
b = 1
result = Cdouble[0]
abserr = Cdouble[0]
nevals = Csize_t[0]
integration_cquad(fquad, a, b, 1e-10, 1e-10, ws, result, abserr, nevals)
result[1]

@lxvm
Copy link
Contributor

lxvm commented Dec 31, 2023

Thanks @simonbyrne, indeed this approach works well and removes the need for any wrapping with gsl_function. I also did a timing comparing my original approach in Integrals.jl to one relying on the pr, i.e. passing the function directly to integration_cquad without wrapping as a Base.CFunction, which appears to be ~2x slower.

branch btime
c869ad9 2.076 μs (10 allocations: 320 bytes)
a93f752 2.547 μs (10 allocations: 320 bytes)
lxvm@ca26f62 2.548 μs (12 allocations: 368 bytes)
28d06e3 2.805 μs (15 allocations: 496 bytes)
pr 6.354 μs (77 allocations: 1.50 KiB)
master 2.503 μs (12 allocations: 528 bytes)
using GSL, Integrals, BenchmarkTools
prob = IntegralProblem((x,p) -> sin(p*x), 0.0, 1.0, 1.0)
alg = GSLIntegration(integration_cquad; wssize=100)
@btime solve($prob, $alg)

Just by looking at flame graphs, it seems that Base.CFunction is type-grounded whereas gsl_function_helper can't be. Presumably this is unavoidable and related to the portability issues of CFunction?

@yuyichao
Copy link
Collaborator

You should be able to specialize gsl_function_helper.

@lxvm
Copy link
Contributor

lxvm commented Dec 31, 2023

Thanks @yuyichao ! That tip helps recover most of the performance. I made a wrapper FixP that I pass to integrate_cquad and specialize it as follows

struct FixP{F<:IntegralFunction,P}
    f::F
    p::P
end

function GSL.gsl_function_helper(x::Cdouble, @specialize(f::FixP))::Cdouble
    try
        Cdouble(isinplace(f) ? (f.f(f.integrand_prototype, x, f.p); only(f.integrand_prototype)) : f.f(x, f.p))
    catch
        NaN
    end
end

which gives a better benchmark within 20% of the original

cache = init(prob, alg)
@btime solve!($cache)
  2.753 μs (68 allocations: 1.08 KiB)

and a flamegraph that still shows a lot of dispatch
image
In comparison to QuadGK.jl, which is probably doing a similar computation, if the dispatch could be removed then the timings would be comparable

alg = QuadGKJL()
cache = init(prob, alg)
@btime solve!($cache)
  137.133 ns (0 allocations: 0 bytes)

@simonbyrne
Copy link
Member Author

@lxvm Please feel free to make another PR!

@simonbyrne
Copy link
Member Author

You should be able to specialize gsl_function_helper.

How would that help when we call fptr = @cfunction(gsl_function_helper, Cdouble, (Cdouble, Any)), which is the non-specialized version?

@yuyichao
Copy link
Collaborator

yuyichao commented Jan 1, 2024

See 28d06e3

The point is that you shouldn't call cfunction with Any as the parameter but should allow it to specialize instead, i.e. using Ref as the parameter type.

Fundamentally the rest isn't very different from passing the function as an arbitrary pointer to a c function but some care is needed to make sure all the allocations are properly rooted. The version implemented above should achieve no allocation and no dynamic dispatch though I haven't really checked carefully and probably won't have time to do so in a few days.

@lxvm
Copy link
Contributor

lxvm commented Jan 3, 2024

@yuyichao That's really clever. Is it correct to say that the same pointer to data is interpreted as a Ref{T} by Julia but as void * to C?

Your branch seems close to optimal and I've updated the timing table to include it and here is also a timing of just the library call

cache = init(prob, alg)
@btime solve!($cache)
  2.130 μs (5 allocations: 176 bytes)

Of the remaining allocations, 3 of them can be elided by specializing function integration_cquad(f::F, a, b, epsabs, epsrel, ws, result, abserr, nevals) where {F} ..., as in lxvm@ca26f62, and the two that remain are the allocation of the mutablegsl_function and RefValue, so that gives

@btime solve!($cache)
  1.991 μs (2 allocations: 48 bytes)

Presumably the gsl_function and RefValue could be pre-allocated and their pointers updated as necessary?

@lxvm lxvm mentioned this pull request Jan 3, 2024
7 tasks
@simonbyrne
Copy link
Member Author

@yuyichao That's interesting: how does it actually work, since we define gsl_function_helper(x::Cdouble, fn::F)::Cdouble, but construct the @cfunction with a Ref{F} as the 2nd argument instead? Why isn't this a method error?

@yuyichao
Copy link
Collaborator

yuyichao commented Jan 3, 2024

I have a updated branch almost ready with more testing.

@yuyichao
Copy link
Collaborator

yuyichao commented Jan 3, 2024

See the updated branch at https://github.com/JuliaMath/GSL.jl/tree/yyc/gsl_func

Either the previous or the new version should be able to achieve optimal performance (i.e. no dynamic dispatch and no allocation) but the compiler's allocation optimization seems to have some corner case/limitation and wasn't able to optimize too complex a structure out. That's why the new version simplified this a little bit making use of the fact that gsl_function is mutable (the old version works for both mutable and immutable gsl_function and should be more or less equivalent to the new version when gsl_function is immutable)

The only case where the allocation isn't completely eliminated is when the callable type is immutable but contains references to mutable fields (immutable struct that are not pointer-free).

And yes the wrapper function needs to specialize on the function parameter, this is to trigger the specialization logic for <:Function types. A custom callable type without being <:Function does not have this limitation. I've added it to all the functions that uses gsl_function.

Presumably the gsl_function and RefValue could be pre-allocated and their pointers updated as necessary?

No absolutely not. That's is very bad/unsafe.

how does it actually work, since we define gsl_function_helper(x::Cdouble, fn::F)::Cdouble, but construct the @cfunction with a Ref{F} as the 2nd argument instead? Why isn't this a method error?

This is somewhat of a special case for cfunction that makes it symmetric with ccall. A ccall with Ref{T} as a parameter type accepts an argument of type T and will pass it in as a pointer to that object. cfunction with Ref as the argument type essentially a reverse of it. The two also support symmetric optimization: ccall of Ref{T} with an bitstype T will allocate it on stack and the cfunction with the same argument type will not assume the pointer points to an heap allocated object either.

@lxvm
Copy link
Contributor

lxvm commented Jan 3, 2024

No absolutely not. That's is very bad/unsafe.

Thanks, I would have wanted to avoid that.

For completeness I updated the table above and can confirm that the allocations are elided,

@btime solve!($cache)
  2.084 μs (0 allocations: 0 bytes)

All of this is great to know, and I'm wondering if it would be appropriate to add to the Julia manual. I'd be happy to start a pr for it based on your comment.

And presumably the performance difference between integration_cquad and quadgk is that the former has to lookup pointers to functions and data whereas the latter can inline them?

@yuyichao
Copy link
Collaborator

yuyichao commented Jan 3, 2024

I have no idea of the specific algorithms or the underlying implementations. If the actual algorithm as well as the implementation/optimization level is exactly the same then the difference would indeed only be on the C-julia interface.

How is the new performance compare to the old version when you pass in your own gsl_function? Or even with c function fwiw. How many times did your callback function get called? Does that take a significant amount of time? What if you remove the try-catch in the gsl_function_helper?

@simonbyrne
Copy link
Member Author

@yuyichao thanks, that was helpful.

Regarding this comment:
master...yyc/gsl_func#diff-ed5863a4bdd6189a79314f8c6cb0d3acefa0fe9235e7df221e6686fedb9efc08R41-R44

If gsl_function was an immutable type, would the main change be that we would instead call pointer_from_objref on a Ref{gsl_function} instead of gsl_function itself?

@yuyichao
Copy link
Collaborator

yuyichao commented Jan 3, 2024

Effectively yes, but the more canonical way is that the creation of Ref{gsl_function} would be done via Base.cconvert(Ref{gsl_function}, ...) and the pointer_from_objref would be done with Base.unsafe_convert(Ref{gsl_function}, ...). This works for mutable type too, but create a RefValue{gsl_function} that needs to be optimized out, which the compiler is able to do in some cases but somehow not for others.

@lxvm
Copy link
Contributor

lxvm commented Jan 3, 2024

I was also wondering why here param_ref is returned with the gsl_function? Does that prevent the param_ref from being GC'd early, even though it is ignored in unsafe_convert?

@yuyichao
Copy link
Collaborator

yuyichao commented Jan 3, 2024

Yes.

@lxvm
Copy link
Contributor

lxvm commented Jan 3, 2024

Indeed, removing the try/catch block returns the fastest version yet. Here are plain benchmarks that should be copy-paste-able. At the end of the day a quadrature scheme's efficiency is measured by the number of function evaluations and there is slight variation in the number of function calls made by each algorithm, but nothing major. So the timings suggest a ~6x algorithm+language overhead for integration_cquad compared to quadgk, which of course becomes less important if the integrand is expensive. I'm glad that this discussion has been fruitful for both the compatibility with runtime callbacks and the performance and I'll continue finishing the wrappers in #131 .

Original test (master)
using QuadGK, GSL, BenchmarkTools, Test

atol = 1e-10
rtol = 1e-10
for p in [1.0, 10.0, 100.0]
    f = x -> sin(x*p)
    a = 0
    b = 1

    ws_size = 1000
    ws = integration_cquad_workspace_alloc(ws_size)
    n1, t1 = try
        result = Cdouble[0]
        abserr = Cdouble[0]
        nevals = Csize_t[0]

        ptr = @cfunction($((x,p) -> f(x)), Cdouble, (Cdouble, Ptr{Cvoid}))
        GC.@preserve ptr begin
            gslf = gsl_function(Base.unsafe_convert(Ptr{Cvoid},ptr), C_NULL)
            integration_cquad(gslf, a, b, atol, rtol, ws, result, abserr, nevals)
            @test result[1]  (cos(a*p)-cos(b*p))/p atol=atol rtol=rtol

            n = Int(nevals[1])
            n, @belapsed integration_cquad($gslf, $a, $b, $atol, $rtol, $ws, $result, $abserr, $nevals)
        end
    catch e
        display(e)
    finally
        integration_cquad_workspace_free(ws)
    end

    segbuf = QuadGK.alloc_segbuf()
    I, E, n2 = quadgk_count(f, a, b; atol, rtol, segbuf)
    @test I  (cos(a*p)-cos(b*p))/p atol=atol rtol=rtol
    t2 = @belapsed quadgk($f, $a, $b; atol=$atol, rtol=$rtol, segbuf=$segbuf)

    @info "Quadrature benchmark" difficulty=p cquad_evals=n1 cquad_time=t1 quadgk_evals=n2 quadgk_time=t2 ratio_evals=n1/n2 ratio_time=t1/t2
end
Original results
┌ Info: Quadrature benchmark
│   difficulty = 1.0
│   cquad_evals = 33
│   cquad_time = 2.6535555555555555e-6
│   quadgk_evals = 15
│   quadgk_time = 1.2850338600451468e-7
│   ratio_evals = 2.2
└   ratio_time = 20.649693662253608
┌ Info: Quadrature benchmark
│   difficulty = 10.0
│   cquad_evals = 95
│   cquad_time = 9.177e-6
│   quadgk_evals = 105
│   quadgk_time = 1.1612e-6
│   ratio_evals = 0.9047619047619048
└   ratio_time = 7.9030313468825355
┌ Info: Quadrature benchmark
│   difficulty = 100.0
│   cquad_evals = 1235
│   cquad_time = 0.000111487
│   quadgk_evals = 855
│   quadgk_time = 1.1026e-5
│   ratio_evals = 1.4444444444444444
└   ratio_time = 10.11128242336296

New test (lxvm/gsl_func2)
using QuadGK, GSL, BenchmarkTools, Test

atol = 1e-10
rtol = 1e-10
for p in [1.0, 10.0, 100.0]
    f = x -> sin(x*p)
    a = 0
    b = 1

    ws_size = 1000
    ws = integration_cquad_workspace_alloc(ws_size)
    n1, t1 = try
        result = Cdouble[0]
        abserr = Cdouble[0]
        nevals = Csize_t[0]
        # gslf = @gsl_function(f)
        integration_cquad(f, a, b, atol, rtol, ws, result, abserr, nevals)
        @test result[1]  (cos(a*p)-cos(b*p))/p atol=atol rtol=rtol
        n = Int(nevals[1])
        n, @belapsed integration_cquad($f, $a, $b, $atol, $rtol, $ws, $result, $abserr, $nevals)
    catch e
        display(e)
    finally
        integration_cquad_workspace_free(ws)
    end

    segbuf = QuadGK.alloc_segbuf()
    I, E, n2 = quadgk_count(f, a, b; atol, rtol, segbuf)
    @test I  (cos(a*p)-cos(b*p))/p atol=atol rtol=rtol
    t2 = @belapsed quadgk($f, $a, $b; atol=$atol, rtol=$rtol, segbuf=$segbuf)

    @info "Quadrature benchmark" difficulty=p cquad_evals=n1 cquad_time=t1 quadgk_evals=n2 quadgk_time=t2 ratio_evals=n1/n2 ratio_time=t1/t2
end
New result
┌ Info: Quadrature benchmark
│   difficulty = 1.0
│   cquad_evals = 33
│   cquad_time = 1.8343e-6
│   quadgk_evals = 15
│   quadgk_time = 1.2889164785553046e-7
│   ratio_evals = 2.2
└   ratio_time = 14.23133329830645
┌ Info: Quadrature benchmark
│   difficulty = 10.0
│   cquad_evals = 95
│   cquad_time = 6.4998e-6
│   quadgk_evals = 105
│   quadgk_time = 1.1675e-6
│   ratio_evals = 0.9047619047619048
└   ratio_time = 5.56728051391863
┌ Info: Quadrature benchmark
│   difficulty = 100.0
│   cquad_evals = 1235
│   cquad_time = 7.5618e-5
│   quadgk_evals = 855
│   quadgk_time = 1.1579e-5
│   ratio_evals = 1.4444444444444444
└   ratio_time = 6.530615769928318

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants