Improving ieee754_rem_pio2 #22004

giordano · 2017-05-21T13:48:13Z

mod2pi is a useful function, being more accurate of mod(x, 2pi), but it's also much slower than mod, in the case of Float64 input (instead for BigFloat is only very slightly slower). According to @profile, the bottleneck of mod2pi(::Float64) is ieee754_rem_pio2, in particular the allocation of the y array (line 740 of math.jl):

By looking at the function it wraps, https://github.com/JuliaLang/openspecfun/blob/39699a1c1824bf88410cabb8a7438af91ea98f4c/rem_pio2/e_rem_pio2.c, it seems to me that the initial value of the y array isn't relevant, so it should be safe to initialize at a random value with Vector{Float64{(2). Here is a performance comparison between current implementation and the implementation I propose:

julia> using BenchmarkTools

julia> M = collect(-10pi:1e-4:10pi);

julia> function ieee754_rem_pio2_faster(x::Float64)
           y = Vector{Float64}(2)
           n = ccall((:__ieee754_rem_pio2, Base.Math.openspecfun), Cint, (Float64,Ptr{Float64}), x, y)
           return (n,y)
       end
ieee754_rem_pio2_faster (generic function with 1 method)

julia> @benchmark Base.Math.ieee754_rem_pio2.($M)
BenchmarkTools.Trial: 
  memory estimate:  81.49 MiB
  allocs estimate:  1256640
  --------------
  minimum time:     37.773 ms (0.00% GC)
  median time:      231.986 ms (86.90% GC)
  mean time:        186.099 ms (81.66% GC)
  maximum time:     282.167 ms (87.26% GC)
  --------------
  samples:          27
  evals/sample:     1

julia> @benchmark ieee754_rem_pio2_faster.($M)
BenchmarkTools.Trial: 
  memory estimate:  81.49 MiB
  allocs estimate:  1256640
  --------------
  minimum time:     34.056 ms (0.00% GC)
  median time:      225.573 ms (88.34% GC)
  mean time:        180.034 ms (83.64% GC)
  maximum time:     260.522 ms (89.77% GC)
  --------------
  samples:          29
  evals/sample:     1

julia> Base.Math.ieee754_rem_pio2.(M) == ieee754_rem_pio2_faster.(M)
true

There should be a speed-up of ~10%, not much, but better than nothing. In addition, the last test shows that the result is exactly equal, not approximately.

I already have a patch ready to be submitted, but I was wondering whether something smarter can be done to avoid allocation, like for sincos in PR #21589. However, writing LLVM code goes way beyond my capabilities.

As a last remark, given how the return value of ieee754_rem_pio2 is used in math.jl, I don't think that y needs to be an array at all (y[1] and y[2] are always used as two distinct elements, not as a single vector). Is it possible to patch openspecfun to redefine ieee754_rem_pio2 in order not to use an array? I don't know if that interface is used elsewhere, though.

The text was updated successfully, but these errors were encountered:

yuyichao · 2017-05-21T14:16:19Z

You can at least do

diff --git a/base/math.jl b/base/math.jl
index 4d8eb64fc8..ab21d70acf 100644
--- a/base/math.jl
+++ b/base/math.jl
@@ -737,9 +737,9 @@ function ieee754_rem_pio2(x::Float64)
     # this is just wrapping up
     # https://github.com/JuliaLang/openspecfun/blob/master/rem_pio2/e_rem_pio2.c

-    y = [0.0,0.0]
-    n = ccall((:__ieee754_rem_pio2, openspecfun), Cint, (Float64,Ptr{Float64}), x, y)
-    return (n,y)
+    y = Ref{NTuple{2,Float64}}()
+    n = ccall((:__ieee754_rem_pio2, openspecfun), Cint, (Float64,Ptr{Void}), x, y)
+    return (n,y[])
 end

 # multiples of pi/2, as double-double (ie with "tail")

Edit: inlining this function also seems to have a big effect.

yuyichao · 2017-05-21T14:34:04Z

And if you want to do the llvmcall trick it's

diff --git a/base/math.jl b/base/math.jl
index 4d8eb64fc8..78afe3e891 100644
--- a/base/math.jl
+++ b/base/math.jl
@@ -737,9 +737,16 @@ function ieee754_rem_pio2(x::Float64)
     # this is just wrapping up
     # https://github.com/JuliaLang/openspecfun/blob/master/rem_pio2/e_rem_pio2.c

-    y = [0.0,0.0]
-    n = ccall((:__ieee754_rem_pio2, openspecfun), Cint, (Float64,Ptr{Float64}), x, y)
-    return (n,y)
+    Base.llvmcall("""
+                  %f = bitcast i8* %1 to i32 (double, [2 x double]*)*
+                  %py = alloca [2 x double]
+                  %n = call i32 %f(double %0, [2 x double]* %py)
+                  %y = load [2 x double], [2 x double]* %py
+                  %res0 = insertvalue {i32, [2 x double]} undef, i32 %n, 0
+                  %res1 = insertvalue {i32, [2 x double]} %res0, [2 x double] %y, 1
+                  ret {i32, [2 x double]} %res1
+                  """, Tuple{Cint,NTuple{2,Float64}}, Tuple{Float64,Ptr{Void}}, x,
+                  cglobal((:__ieee754_rem_pio2, openspecfun)))
 end

 # multiples of pi/2, as double-double (ie with "tail")

giordano · 2017-05-21T14:56:07Z

Wow, with the llvmcall trick mod2pi would be even faster than mod(x, 2pi)! I think you should open a PR, you did all the work ;-)

ararslan · 2017-05-21T21:49:04Z

Would it be possible to make the optimization without calling into openspecfun? The Base dependency on openspecfun is slated to be removed in the near future, but mod2pi would stay in Base. So if it's possible to speed this up with a pure-Julia implementation, that will make excising openspecfun that much easier.

giordano · 2017-05-21T21:51:59Z

@ararslan I looked into rewriting ieee754_rem_pio2 in Julia (that would allow not using a y array), but I'm not able to completely port it, I miss a few lines, like GET_HIGH_WORD(hx,x) or INSERT_WORDS(z, ix - ((int32_t)(e0<<20)), low).

ararslan · 2017-05-21T22:20:10Z

I believe @simonbyrne started on an implementation a while back.

simonbyrne · 2017-05-24T15:25:56Z

It would be great to have a pure Julia implementation, and has been on my todo list for a long time. Hopefully this can be part of @pkofod's GSoC project.

Doing a range reduction modulo pi usually requires a couple of different strategies: typically Cody-Waite for smallish values, and Payne-Hanek for large ones. I have an (untested) implementation for Payne-Hanek here:
https://gist.github.com/simonbyrne/d640ac1c1db3e1774cf2d405049beef3

The books by Jean-Michel Muller are a good reference for this stuff.

giordano · 2017-05-24T22:07:12Z

Very good there is someone that is going to work on this within GSoC! I'm eager to see if a pure Julia implementation can outperform the C one :-)

pkofod · 2017-06-25T20:38:16Z

Let's see... I'm just gonna hijack this thread.

I had a go over at https://github.com/pkofod/RemPiO2.jl

There are some contribution credit headers missing if this is to be "copied" over to base, but the functionality should be there. Payne Hanek returns a t::TwicePrecision{Float64} such that t.hi+t.lo==sum(y) where y is the vector of [hi, lo] disussed above. The two hi's and two lo's aren't identical between the two implementations, but if I feed them to a sin or cos kernel based on openlibm, I get the same results. Comments are more than welcome

For timing differences, have a look at https://gist.github.com/pkofod/81435213366a19fc5a0d6ce4f1c64c4d . I've updated runtests.jl since that for larger numbers, and they seem to return the same ratio. For the medium case (cody waite) the hi-lo's are exactly the same, but for the large case, it is as explained above.

Now, as you may notice, the timings seem quite different, so I'm sort of wondering if I "cheated" somehow, or failed to take something into account. More than happy to receive comments on this part.

pkofod · 2017-06-26T20:39:03Z

Is it a problem to pack the hi and lo values in a TwicePrecision like I do when it's such a light function call? I didn't think it would matter, but removing it and just returning (n, yhi, ylo) seems to get the new/old time ration down to 0.15 (!). This seems like a very drastic difference. A drastic difference to the openlibm version, but aslo almost a factor of 2 compared to the one with the TwicePrecision-version (the one that is online right now).

((edit: I believe there might be a bug/issue still for the medium/large values in payne hanek...))

Edit: Still working out the kinks in this reduction scheme. There are some small idiosyncrasies that gives some rounding problems for something like 10/10000 values (of course this has to be ironed out) when using the output hi and lo for calculating sin and cos for example. However, it does seems as if there are some speedups to be found here. For values larger than 2.0^20, sin is something like 2-3 times faster. Let's see if those speedups survive the bughunting :)

pkofod · 2017-06-28T10:26:09Z

Alright, bugs were not really bugs but floating point "fun". A PR should arrive later today (my time).

Timing these things can be tricky, but after speaking to @simonbyrne last night, I'm fairly sure that the following actually shows the difference between the version in base, and the version in the upcoming PR for small values. Still running the benchmarks on the larger cases. While there are only a finite number of Float64s there sure are many! The picture is only based on 840 values of x.

Spikes are near multiples of pi/2 where additional operations are required for sufficient precision.

Note: the new PR does NOT allocate a new vector each time, so the problem of this issue should be fixed by this.

ratio

KristofferC · 2017-06-28T10:52:51Z

Looks great! Impressive work.

Fix #22004

Fix JuliaLang#22004

Fix #22004

Fix JuliaLang#22004

Fix #22004

ararslan added the maths Mathematical functions label May 22, 2017

pkofod mentioned this issue May 24, 2017

Ressources pkofod/gsoc-libm#1

Open

JeffBezanson added the performance Must go faster label Jun 1, 2017

ararslan mentioned this issue Jun 16, 2017

Actually decouple special function (openspecfun) from base #22390

Closed

pkofod mentioned this issue Jun 28, 2017

Remove ieee754_rem_pio2 in favor of a rem_pio2_kernel written in Julia. #22603

Merged

yuyichao added a commit that referenced this issue Jul 8, 2017

Improving ieee754_rem_pio2

8135b09

Fix #22004

yuyichao added a commit that referenced this issue Jul 8, 2017

Improving ieee754_rem_pio2

aae4455

Fix #22004

yuyichao added a commit that referenced this issue Jul 8, 2017

Improving ieee754_rem_pio2

4e3af21

Fix #22004

yuyichao added a commit that referenced this issue Jul 8, 2017

Improving ieee754_rem_pio2

70f88e0

Fix #22004

yuyichao added a commit that referenced this issue Jul 8, 2017

Improving ieee754_rem_pio2

340ae86

Fix #22004

yuyichao added a commit that referenced this issue Jul 8, 2017

Improving ieee754_rem_pio2

d2b9c21

Fix #22004

yuyichao added a commit that referenced this issue Jul 8, 2017

Improving ieee754_rem_pio2

65dda36

Fix #22004

yuyichao added a commit that referenced this issue Jul 8, 2017

Improving ieee754_rem_pio2

17f09fa

Fix #22004

yuyichao added a commit that referenced this issue Jul 8, 2017

Improving ieee754_rem_pio2

70f6b71

Fix #22004

yuyichao added a commit that referenced this issue Jul 9, 2017

Improving ieee754_rem_pio2

ee5ead2

Fix #22004

yuyichao added a commit that referenced this issue Jul 9, 2017

Improving ieee754_rem_pio2

826c083

Fix #22004

yuyichao added a commit to yuyichao/julia that referenced this issue Jul 9, 2017

Improving ieee754_rem_pio2

712640c

Fix JuliaLang#22004

yuyichao added a commit that referenced this issue Jul 9, 2017

Improving ieee754_rem_pio2

39d3c44

Fix #22004

yuyichao added a commit that referenced this issue Jul 9, 2017

Improving ieee754_rem_pio2

1b93628

Fix #22004

yuyichao added a commit that referenced this issue Jul 10, 2017

Improving ieee754_rem_pio2

aed9635

Fix #22004

yuyichao added a commit that referenced this issue Jul 15, 2017

Improving ieee754_rem_pio2

ba206a6

Fix #22004

yuyichao added a commit that referenced this issue Jul 17, 2017

Improving ieee754_rem_pio2

f542154

Fix #22004

yuyichao added a commit to yuyichao/julia that referenced this issue Jul 17, 2017

Improving ieee754_rem_pio2

e6bcbe1

Fix JuliaLang#22004

yuyichao added a commit that referenced this issue Jul 17, 2017

Improving ieee754_rem_pio2

a1fd22e

Fix #22004

yuyichao added a commit that referenced this issue Jul 17, 2017

Improving ieee754_rem_pio2

bf39419

Fix #22004

yuyichao added a commit that referenced this issue Jul 17, 2017

Improving ieee754_rem_pio2

b705cc0

Fix #22004

yuyichao added a commit that referenced this issue Jul 17, 2017

Improving ieee754_rem_pio2

6f0601f

Fix #22004

yuyichao added a commit that referenced this issue Jul 17, 2017

Improving ieee754_rem_pio2

c1eb4da

Fix #22004

yuyichao added a commit that referenced this issue Jul 22, 2017

Improving ieee754_rem_pio2

bef3e77

Fix #22004

yuyichao added a commit that referenced this issue Jul 22, 2017

Improving ieee754_rem_pio2

f71401a

Fix #22004

yuyichao added a commit that referenced this issue Jul 24, 2017

Improving ieee754_rem_pio2

cd8f33e

Fix #22004

yuyichao added a commit that referenced this issue Jul 27, 2017

Improving ieee754_rem_pio2

05cefe8

Fix #22004

yuyichao closed this as completed in b1a188c Jul 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving ieee754_rem_pio2 #22004

Improving ieee754_rem_pio2 #22004

giordano commented May 21, 2017

yuyichao commented May 21, 2017 •

edited

Loading

yuyichao commented May 21, 2017 •

edited

Loading

giordano commented May 21, 2017

ararslan commented May 21, 2017

giordano commented May 21, 2017 •

edited

Loading

ararslan commented May 21, 2017

simonbyrne commented May 24, 2017

giordano commented May 24, 2017

pkofod commented Jun 25, 2017

pkofod commented Jun 26, 2017 •

edited

Loading

pkofod commented Jun 28, 2017 •

edited by StefanKarpinski

Loading

KristofferC commented Jun 28, 2017

Improving ieee754_rem_pio2 #22004

Improving ieee754_rem_pio2 #22004

Comments

giordano commented May 21, 2017

yuyichao commented May 21, 2017 • edited Loading

yuyichao commented May 21, 2017 • edited Loading

giordano commented May 21, 2017

ararslan commented May 21, 2017

giordano commented May 21, 2017 • edited Loading

ararslan commented May 21, 2017

simonbyrne commented May 24, 2017

giordano commented May 24, 2017

pkofod commented Jun 25, 2017

pkofod commented Jun 26, 2017 • edited Loading

pkofod commented Jun 28, 2017 • edited by StefanKarpinski Loading

KristofferC commented Jun 28, 2017

yuyichao commented May 21, 2017 •

edited

Loading

yuyichao commented May 21, 2017 •

edited

Loading

giordano commented May 21, 2017 •

edited

Loading

pkofod commented Jun 26, 2017 •

edited

Loading

pkofod commented Jun 28, 2017 •

edited by StefanKarpinski

Loading