Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving ieee754_rem_pio2 #22004

Closed
giordano opened this issue May 21, 2017 · 12 comments · Fixed by #22603
Closed

Improving ieee754_rem_pio2 #22004

giordano opened this issue May 21, 2017 · 12 comments · Fixed by #22603
Labels
maths Mathematical functions performance Must go faster

Comments

@giordano
Copy link
Contributor

mod2pi is a useful function, being more accurate of mod(x, 2pi), but it's also much slower than mod, in the case of Float64 input (instead for BigFloat is only very slightly slower). According to @profile, the bottleneck of mod2pi(::Float64) is ieee754_rem_pio2, in particular the allocation of the y array (line 740 of math.jl):

screenshot_20170521_153744

By looking at the function it wraps, https://github.com/JuliaLang/openspecfun/blob/39699a1c1824bf88410cabb8a7438af91ea98f4c/rem_pio2/e_rem_pio2.c, it seems to me that the initial value of the y array isn't relevant, so it should be safe to initialize at a random value with Vector{Float64{(2). Here is a performance comparison between current implementation and the implementation I propose:

julia> using BenchmarkTools

julia> M = collect(-10pi:1e-4:10pi);

julia> function ieee754_rem_pio2_faster(x::Float64)
           y = Vector{Float64}(2)
           n = ccall((:__ieee754_rem_pio2, Base.Math.openspecfun), Cint, (Float64,Ptr{Float64}), x, y)
           return (n,y)
       end
ieee754_rem_pio2_faster (generic function with 1 method)

julia> @benchmark Base.Math.ieee754_rem_pio2.($M)
BenchmarkTools.Trial: 
  memory estimate:  81.49 MiB
  allocs estimate:  1256640
  --------------
  minimum time:     37.773 ms (0.00% GC)
  median time:      231.986 ms (86.90% GC)
  mean time:        186.099 ms (81.66% GC)
  maximum time:     282.167 ms (87.26% GC)
  --------------
  samples:          27
  evals/sample:     1

julia> @benchmark ieee754_rem_pio2_faster.($M)
BenchmarkTools.Trial: 
  memory estimate:  81.49 MiB
  allocs estimate:  1256640
  --------------
  minimum time:     34.056 ms (0.00% GC)
  median time:      225.573 ms (88.34% GC)
  mean time:        180.034 ms (83.64% GC)
  maximum time:     260.522 ms (89.77% GC)
  --------------
  samples:          29
  evals/sample:     1

julia> Base.Math.ieee754_rem_pio2.(M) == ieee754_rem_pio2_faster.(M)
true

There should be a speed-up of ~10%, not much, but better than nothing. In addition, the last test shows that the result is exactly equal, not approximately.

I already have a patch ready to be submitted, but I was wondering whether something smarter can be done to avoid allocation, like for sincos in PR #21589. However, writing LLVM code goes way beyond my capabilities.

As a last remark, given how the return value of ieee754_rem_pio2 is used in math.jl, I don't think that y needs to be an array at all (y[1] and y[2] are always used as two distinct elements, not as a single vector). Is it possible to patch openspecfun to redefine ieee754_rem_pio2 in order not to use an array? I don't know if that interface is used elsewhere, though.

@yuyichao
Copy link
Contributor

yuyichao commented May 21, 2017

You can at least do

diff --git a/base/math.jl b/base/math.jl
index 4d8eb64fc8..ab21d70acf 100644
--- a/base/math.jl
+++ b/base/math.jl
@@ -737,9 +737,9 @@ function ieee754_rem_pio2(x::Float64)
     # this is just wrapping up
     # https://github.com/JuliaLang/openspecfun/blob/master/rem_pio2/e_rem_pio2.c

-    y = [0.0,0.0]
-    n = ccall((:__ieee754_rem_pio2, openspecfun), Cint, (Float64,Ptr{Float64}), x, y)
-    return (n,y)
+    y = Ref{NTuple{2,Float64}}()
+    n = ccall((:__ieee754_rem_pio2, openspecfun), Cint, (Float64,Ptr{Void}), x, y)
+    return (n,y[])
 end

 # multiples of pi/2, as double-double (ie with "tail")

Edit: inlining this function also seems to have a big effect.

@yuyichao
Copy link
Contributor

yuyichao commented May 21, 2017

And if you want to do the llvmcall trick it's

diff --git a/base/math.jl b/base/math.jl
index 4d8eb64fc8..78afe3e891 100644
--- a/base/math.jl
+++ b/base/math.jl
@@ -737,9 +737,16 @@ function ieee754_rem_pio2(x::Float64)
     # this is just wrapping up
     # https://github.com/JuliaLang/openspecfun/blob/master/rem_pio2/e_rem_pio2.c

-    y = [0.0,0.0]
-    n = ccall((:__ieee754_rem_pio2, openspecfun), Cint, (Float64,Ptr{Float64}), x, y)
-    return (n,y)
+    Base.llvmcall("""
+                  %f = bitcast i8* %1 to i32 (double, [2 x double]*)*
+                  %py = alloca [2 x double]
+                  %n = call i32 %f(double %0, [2 x double]* %py)
+                  %y = load [2 x double], [2 x double]* %py
+                  %res0 = insertvalue {i32, [2 x double]} undef, i32 %n, 0
+                  %res1 = insertvalue {i32, [2 x double]} %res0, [2 x double] %y, 1
+                  ret {i32, [2 x double]} %res1
+                  """, Tuple{Cint,NTuple{2,Float64}}, Tuple{Float64,Ptr{Void}}, x,
+                  cglobal((:__ieee754_rem_pio2, openspecfun)))
 end

 # multiples of pi/2, as double-double (ie with "tail")

@giordano
Copy link
Contributor Author

Wow, with the llvmcall trick mod2pi would be even faster than mod(x, 2pi)! I think you should open a PR, you did all the work ;-)

@ararslan
Copy link
Member

Would it be possible to make the optimization without calling into openspecfun? The Base dependency on openspecfun is slated to be removed in the near future, but mod2pi would stay in Base. So if it's possible to speed this up with a pure-Julia implementation, that will make excising openspecfun that much easier.

@giordano
Copy link
Contributor Author

giordano commented May 21, 2017

@ararslan I looked into rewriting ieee754_rem_pio2 in Julia (that would allow not using a y array), but I'm not able to completely port it, I miss a few lines, like GET_HIGH_WORD(hx,x) or INSERT_WORDS(z, ix - ((int32_t)(e0<<20)), low).

@ararslan
Copy link
Member

I believe @simonbyrne started on an implementation a while back.

@ararslan ararslan added the maths Mathematical functions label May 22, 2017
@simonbyrne
Copy link
Contributor

It would be great to have a pure Julia implementation, and has been on my todo list for a long time. Hopefully this can be part of @pkofod's GSoC project.

Doing a range reduction modulo pi usually requires a couple of different strategies: typically Cody-Waite for smallish values, and Payne-Hanek for large ones. I have an (untested) implementation for Payne-Hanek here:
https://gist.github.com/simonbyrne/d640ac1c1db3e1774cf2d405049beef3

The books by Jean-Michel Muller are a good reference for this stuff.

@giordano
Copy link
Contributor Author

Very good there is someone that is going to work on this within GSoC! I'm eager to see if a pure Julia implementation can outperform the C one :-)

@pkofod
Copy link
Contributor

pkofod commented Jun 25, 2017

Let's see... I'm just gonna hijack this thread.

I had a go over at https://github.com/pkofod/RemPiO2.jl

There are some contribution credit headers missing if this is to be "copied" over to base, but the functionality should be there. Payne Hanek returns a t::TwicePrecision{Float64} such that t.hi+t.lo==sum(y) where y is the vector of [hi, lo] disussed above. The two hi's and two lo's aren't identical between the two implementations, but if I feed them to a sin or cos kernel based on openlibm, I get the same results. Comments are more than welcome

For timing differences, have a look at https://gist.github.com/pkofod/81435213366a19fc5a0d6ce4f1c64c4d . I've updated runtests.jl since that for larger numbers, and they seem to return the same ratio. For the medium case (cody waite) the hi-lo's are exactly the same, but for the large case, it is as explained above.

Now, as you may notice, the timings seem quite different, so I'm sort of wondering if I "cheated" somehow, or failed to take something into account. More than happy to receive comments on this part.

@pkofod
Copy link
Contributor

pkofod commented Jun 26, 2017

Is it a problem to pack the hi and lo values in a TwicePrecision like I do when it's such a light function call? I didn't think it would matter, but removing it and just returning (n, yhi, ylo) seems to get the new/old time ration down to 0.15 (!). This seems like a very drastic difference. A drastic difference to the openlibm version, but aslo almost a factor of 2 compared to the one with the TwicePrecision-version (the one that is online right now).

((edit: I believe there might be a bug/issue still for the medium/large values in payne hanek...))

Edit: Still working out the kinks in this reduction scheme. There are some small idiosyncrasies that gives some rounding problems for something like 10/10000 values (of course this has to be ironed out) when using the output hi and lo for calculating sin and cos for example. However, it does seems as if there are some speedups to be found here. For values larger than 2.0^20, sin is something like 2-3 times faster. Let's see if those speedups survive the bughunting :)

@pkofod
Copy link
Contributor

pkofod commented Jun 28, 2017

Alright, bugs were not really bugs but floating point "fun". A PR should arrive later today (my time).

Timing these things can be tricky, but after speaking to @simonbyrne last night, I'm fairly sure that the following actually shows the difference between the version in base, and the version in the upcoming PR for small values. Still running the benchmarks on the larger cases. While there are only a finite number of Float64s there sure are many! The picture is only based on 840 values of x.

Spikes are near multiples of pi/2 where additional operations are required for sufficient precision.

Note: the new PR does NOT allocate a new vector each time, so the problem of this issue should be fixed by this.

newvsopenspec

ratio
ratio

@KristofferC
Copy link
Member

Looks great! Impressive work.

yuyichao added a commit that referenced this issue Jul 8, 2017
yuyichao added a commit that referenced this issue Jul 8, 2017
yuyichao added a commit that referenced this issue Jul 8, 2017
yuyichao added a commit that referenced this issue Jul 8, 2017
yuyichao added a commit that referenced this issue Jul 8, 2017
yuyichao added a commit that referenced this issue Jul 8, 2017
yuyichao added a commit that referenced this issue Jul 8, 2017
yuyichao added a commit that referenced this issue Jul 8, 2017
yuyichao added a commit that referenced this issue Jul 8, 2017
yuyichao added a commit that referenced this issue Jul 9, 2017
yuyichao added a commit that referenced this issue Jul 9, 2017
yuyichao added a commit to yuyichao/julia that referenced this issue Jul 9, 2017
yuyichao added a commit that referenced this issue Jul 9, 2017
yuyichao added a commit that referenced this issue Jul 9, 2017
yuyichao added a commit that referenced this issue Jul 10, 2017
yuyichao added a commit that referenced this issue Jul 15, 2017
yuyichao added a commit that referenced this issue Jul 17, 2017
yuyichao added a commit to yuyichao/julia that referenced this issue Jul 17, 2017
yuyichao added a commit that referenced this issue Jul 17, 2017
yuyichao added a commit that referenced this issue Jul 17, 2017
yuyichao added a commit that referenced this issue Jul 17, 2017
yuyichao added a commit that referenced this issue Jul 17, 2017
yuyichao added a commit that referenced this issue Jul 17, 2017
yuyichao added a commit that referenced this issue Jul 22, 2017
yuyichao added a commit that referenced this issue Jul 22, 2017
yuyichao added a commit that referenced this issue Jul 24, 2017
yuyichao added a commit that referenced this issue Jul 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maths Mathematical functions performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants