-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving ieee754_rem_pio2 #22004
Comments
You can at least do diff --git a/base/math.jl b/base/math.jl
index 4d8eb64fc8..ab21d70acf 100644
--- a/base/math.jl
+++ b/base/math.jl
@@ -737,9 +737,9 @@ function ieee754_rem_pio2(x::Float64)
# this is just wrapping up
# https://github.com/JuliaLang/openspecfun/blob/master/rem_pio2/e_rem_pio2.c
- y = [0.0,0.0]
- n = ccall((:__ieee754_rem_pio2, openspecfun), Cint, (Float64,Ptr{Float64}), x, y)
- return (n,y)
+ y = Ref{NTuple{2,Float64}}()
+ n = ccall((:__ieee754_rem_pio2, openspecfun), Cint, (Float64,Ptr{Void}), x, y)
+ return (n,y[])
end
# multiples of pi/2, as double-double (ie with "tail") Edit: inlining this function also seems to have a big effect. |
And if you want to do the llvmcall trick it's diff --git a/base/math.jl b/base/math.jl
index 4d8eb64fc8..78afe3e891 100644
--- a/base/math.jl
+++ b/base/math.jl
@@ -737,9 +737,16 @@ function ieee754_rem_pio2(x::Float64)
# this is just wrapping up
# https://github.com/JuliaLang/openspecfun/blob/master/rem_pio2/e_rem_pio2.c
- y = [0.0,0.0]
- n = ccall((:__ieee754_rem_pio2, openspecfun), Cint, (Float64,Ptr{Float64}), x, y)
- return (n,y)
+ Base.llvmcall("""
+ %f = bitcast i8* %1 to i32 (double, [2 x double]*)*
+ %py = alloca [2 x double]
+ %n = call i32 %f(double %0, [2 x double]* %py)
+ %y = load [2 x double], [2 x double]* %py
+ %res0 = insertvalue {i32, [2 x double]} undef, i32 %n, 0
+ %res1 = insertvalue {i32, [2 x double]} %res0, [2 x double] %y, 1
+ ret {i32, [2 x double]} %res1
+ """, Tuple{Cint,NTuple{2,Float64}}, Tuple{Float64,Ptr{Void}}, x,
+ cglobal((:__ieee754_rem_pio2, openspecfun)))
end
# multiples of pi/2, as double-double (ie with "tail") |
Wow, with the llvmcall trick |
Would it be possible to make the optimization without calling into openspecfun? The Base dependency on openspecfun is slated to be removed in the near future, but |
@ararslan I looked into rewriting |
I believe @simonbyrne started on an implementation a while back. |
It would be great to have a pure Julia implementation, and has been on my todo list for a long time. Hopefully this can be part of @pkofod's GSoC project. Doing a range reduction modulo pi usually requires a couple of different strategies: typically Cody-Waite for smallish values, and Payne-Hanek for large ones. I have an (untested) implementation for Payne-Hanek here: The books by Jean-Michel Muller are a good reference for this stuff. |
Very good there is someone that is going to work on this within GSoC! I'm eager to see if a pure Julia implementation can outperform the C one :-) |
Let's see... I'm just gonna hijack this thread. I had a go over at https://github.com/pkofod/RemPiO2.jl There are some contribution credit headers missing if this is to be "copied" over to base, but the functionality should be there. Payne Hanek returns a For timing differences, have a look at https://gist.github.com/pkofod/81435213366a19fc5a0d6ce4f1c64c4d . I've updated Now, as you may notice, the timings seem quite different, so I'm sort of wondering if I "cheated" somehow, or failed to take something into account. More than happy to receive comments on this part. |
Is it a problem to pack the hi and lo values in a ((edit: I believe there might be a bug/issue still for the medium/large values in payne hanek...)) Edit: Still working out the kinks in this reduction scheme. There are some small idiosyncrasies that gives some rounding problems for something like 10/10000 values (of course this has to be ironed out) when using the output hi and lo for calculating sin and cos for example. However, it does seems as if there are some speedups to be found here. For values larger than 2.0^20, sin is something like 2-3 times faster. Let's see if those speedups survive the bughunting :) |
Alright, bugs were not really bugs but floating point "fun". A PR should arrive later today (my time). Timing these things can be tricky, but after speaking to @simonbyrne last night, I'm fairly sure that the following actually shows the difference between the version in base, and the version in the upcoming PR for small values. Still running the benchmarks on the larger cases. While there are only a finite number of Spikes are near multiples of pi/2 where additional operations are required for sufficient precision. Note: the new PR does NOT allocate a new vector each time, so the problem of this issue should be fixed by this. |
Looks great! Impressive work. |
mod2pi
is a useful function, being more accurate ofmod(x, 2pi)
, but it's also much slower thanmod
, in the case ofFloat64
input (instead forBigFloat
is only very slightly slower). According to@profile
, the bottleneck ofmod2pi(::Float64)
isieee754_rem_pio2
, in particular the allocation of they
array (line 740 ofmath.jl
):By looking at the function it wraps, https://github.com/JuliaLang/openspecfun/blob/39699a1c1824bf88410cabb8a7438af91ea98f4c/rem_pio2/e_rem_pio2.c, it seems to me that the initial value of the
y
array isn't relevant, so it should be safe to initialize at a random value withVector{Float64{(2)
. Here is a performance comparison between current implementation and the implementation I propose:There should be a speed-up of ~10%, not much, but better than nothing. In addition, the last test shows that the result is exactly equal, not approximately.
I already have a patch ready to be submitted, but I was wondering whether something smarter can be done to avoid allocation, like for
sincos
in PR #21589. However, writing LLVM code goes way beyond my capabilities.As a last remark, given how the return value of
ieee754_rem_pio2
is used inmath.jl
, I don't think thaty
needs to be an array at all (y[1]
andy[2]
are always used as two distinct elements, not as a single vector). Is it possible to patch openspecfun to redefineieee754_rem_pio2
in order not to use an array? I don't know if that interface is used elsewhere, though.The text was updated successfully, but these errors were encountered: