-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auto wrapper #130
auto wrapper #130
Conversation
d42663b
to
f8a2a54
Compare
@lxvm can you try this out? The following seems to work for me: using GSL
fquad = x -> x^1.5
ws_size = 100
ws = integration_cquad_workspace_alloc(ws_size)
a = 0
b = 1
result = Cdouble[0]
abserr = Cdouble[0]
nevals = Csize_t[0]
integration_cquad(fquad, a, b, 1e-10, 1e-10, ws, result, abserr, nevals)
result[1] |
Thanks @simonbyrne, indeed this approach works well and removes the need for any wrapping with
using GSL, Integrals, BenchmarkTools
prob = IntegralProblem((x,p) -> sin(p*x), 0.0, 1.0, 1.0)
alg = GSLIntegration(integration_cquad; wssize=100)
@btime solve($prob, $alg) Just by looking at flame graphs, it seems that |
You should be able to specialize gsl_function_helper. |
Thanks @yuyichao ! That tip helps recover most of the performance. I made a wrapper struct FixP{F<:IntegralFunction,P}
f::F
p::P
end
function GSL.gsl_function_helper(x::Cdouble, @specialize(f::FixP))::Cdouble
try
Cdouble(isinplace(f) ? (f.f(f.integrand_prototype, x, f.p); only(f.integrand_prototype)) : f.f(x, f.p))
catch
NaN
end
end which gives a better benchmark within 20% of the original cache = init(prob, alg)
@btime solve!($cache)
2.753 μs (68 allocations: 1.08 KiB) and a flamegraph that still shows a lot of dispatch alg = QuadGKJL()
cache = init(prob, alg)
@btime solve!($cache)
137.133 ns (0 allocations: 0 bytes) |
@lxvm Please feel free to make another PR! |
How would that help when we call |
See 28d06e3 The point is that you shouldn't call cfunction with Any as the parameter but should allow it to specialize instead, i.e. using Fundamentally the rest isn't very different from passing the function as an arbitrary pointer to a c function but some care is needed to make sure all the allocations are properly rooted. The version implemented above should achieve no allocation and no dynamic dispatch though I haven't really checked carefully and probably won't have time to do so in a few days. |
@yuyichao That's really clever. Is it correct to say that the same pointer to data is interpreted as a Your branch seems close to optimal and I've updated the timing table to include it and here is also a timing of just the library call cache = init(prob, alg)
@btime solve!($cache)
2.130 μs (5 allocations: 176 bytes) Of the remaining allocations, 3 of them can be elided by specializing @btime solve!($cache)
1.991 μs (2 allocations: 48 bytes) Presumably the |
@yuyichao That's interesting: how does it actually work, since we define |
I have a updated branch almost ready with more testing. |
See the updated branch at https://github.com/JuliaMath/GSL.jl/tree/yyc/gsl_func Either the previous or the new version should be able to achieve optimal performance (i.e. no dynamic dispatch and no allocation) but the compiler's allocation optimization seems to have some corner case/limitation and wasn't able to optimize too complex a structure out. That's why the new version simplified this a little bit making use of the fact that The only case where the allocation isn't completely eliminated is when the callable type is immutable but contains references to mutable fields (immutable struct that are not pointer-free). And yes the wrapper function needs to specialize on the function parameter, this is to trigger the specialization logic for
No absolutely not. That's is very bad/unsafe.
This is somewhat of a special case for |
Thanks, I would have wanted to avoid that. For completeness I updated the table above and can confirm that the allocations are elided, @btime solve!($cache)
2.084 μs (0 allocations: 0 bytes) All of this is great to know, and I'm wondering if it would be appropriate to add to the Julia manual. I'd be happy to start a pr for it based on your comment. And presumably the performance difference between |
I have no idea of the specific algorithms or the underlying implementations. If the actual algorithm as well as the implementation/optimization level is exactly the same then the difference would indeed only be on the C-julia interface. How is the new performance compare to the old version when you pass in your own gsl_function? Or even with c function fwiw. How many times did your callback function get called? Does that take a significant amount of time? What if you remove the try-catch in the gsl_function_helper? |
@yuyichao thanks, that was helpful. Regarding this comment: If |
Effectively yes, but the more canonical way is that the creation of |
I was also wondering why here |
Yes. |
Indeed, removing the try/catch block returns the fastest version yet. Here are plain benchmarks that should be copy-paste-able. At the end of the day a quadrature scheme's efficiency is measured by the number of function evaluations and there is slight variation in the number of function calls made by each algorithm, but nothing major. So the timings suggest a ~6x algorithm+language overhead for Original test (master)using QuadGK, GSL, BenchmarkTools, Test
atol = 1e-10
rtol = 1e-10
for p in [1.0, 10.0, 100.0]
f = x -> sin(x*p)
a = 0
b = 1
ws_size = 1000
ws = integration_cquad_workspace_alloc(ws_size)
n1, t1 = try
result = Cdouble[0]
abserr = Cdouble[0]
nevals = Csize_t[0]
ptr = @cfunction($((x,p) -> f(x)), Cdouble, (Cdouble, Ptr{Cvoid}))
GC.@preserve ptr begin
gslf = gsl_function(Base.unsafe_convert(Ptr{Cvoid},ptr), C_NULL)
integration_cquad(gslf, a, b, atol, rtol, ws, result, abserr, nevals)
@test result[1] ≈ (cos(a*p)-cos(b*p))/p atol=atol rtol=rtol
n = Int(nevals[1])
n, @belapsed integration_cquad($gslf, $a, $b, $atol, $rtol, $ws, $result, $abserr, $nevals)
end
catch e
display(e)
finally
integration_cquad_workspace_free(ws)
end
segbuf = QuadGK.alloc_segbuf()
I, E, n2 = quadgk_count(f, a, b; atol, rtol, segbuf)
@test I ≈ (cos(a*p)-cos(b*p))/p atol=atol rtol=rtol
t2 = @belapsed quadgk($f, $a, $b; atol=$atol, rtol=$rtol, segbuf=$segbuf)
@info "Quadrature benchmark" difficulty=p cquad_evals=n1 cquad_time=t1 quadgk_evals=n2 quadgk_time=t2 ratio_evals=n1/n2 ratio_time=t1/t2
end Original results
New test (lxvm/gsl_func2)using QuadGK, GSL, BenchmarkTools, Test
atol = 1e-10
rtol = 1e-10
for p in [1.0, 10.0, 100.0]
f = x -> sin(x*p)
a = 0
b = 1
ws_size = 1000
ws = integration_cquad_workspace_alloc(ws_size)
n1, t1 = try
result = Cdouble[0]
abserr = Cdouble[0]
nevals = Csize_t[0]
# gslf = @gsl_function(f)
integration_cquad(f, a, b, atol, rtol, ws, result, abserr, nevals)
@test result[1] ≈ (cos(a*p)-cos(b*p))/p atol=atol rtol=rtol
n = Int(nevals[1])
n, @belapsed integration_cquad($f, $a, $b, $atol, $rtol, $ws, $result, $abserr, $nevals)
catch e
display(e)
finally
integration_cquad_workspace_free(ws)
end
segbuf = QuadGK.alloc_segbuf()
I, E, n2 = quadgk_count(f, a, b; atol, rtol, segbuf)
@test I ≈ (cos(a*p)-cos(b*p))/p atol=atol rtol=rtol
t2 = @belapsed quadgk($f, $a, $b; atol=$atol, rtol=$rtol, segbuf=$segbuf)
@info "Quadrature benchmark" difficulty=p cquad_evals=n1 cquad_time=t1 quadgk_evals=n2 quadgk_time=t2 ratio_evals=n1/n2 ratio_time=t1/t2
end
New result
|
Avoids need for
@gsl_function
macro as suggested in #128