^ is slow #2741

jtravs · 2013-04-03T13:27:27Z

This is my second day using Julia and I'm getting increasingly excited!

However, I nearly didn't make it even this far, as my early test programs had awful performance, and I nearly gave up with the common statement "beautiful language but the performance sucks". This was after carefully reading the documentation, especially the performance tips section.

Luckily, I happened across this discussion:
https://groups.google.com/d/topic/julia-dev/_UZ2A_Jp8Jc/discussion
and after changing one line of code from using A.^3, where A is a large two dimensional array, to using a nested for loop, I got performance faster than my C++ code (with an almost fair comparison).

I am merely suggesting here that a sentence is added to the performance tips section to note that currently, explicit loops can be much faster than array operations. Such a note would have saved me time, and prevented me from almost abandoning julia.

Thanks for making a truly beautiful (and fast) language,
John

ViralBShah · 2013-04-03T13:47:13Z

Good point. This should be noted in the performance tips. This is something that is clearly on the roadmap for improvement - and it is not that the vectorized code performance is bad, but we need better memory allocation and GC.

Also check out the Devectorize package for automatic devectorization until this is fixed.

JeffBezanson · 2013-04-03T16:14:00Z

Hi @jtravs, if you could show a specific code example it will help me determine if there is a performance bug that could be fixed.

jtravs · 2013-04-03T17:25:40Z

Hi Jeff (and others), thanks for the responses. Here is the code I first noticed this issue on:

function f(in::Array{Float64,2}, out::Array{Float64,2})
    out = in.^3
end

function g(in::Array{Float64,2}, out::Array{Float64,2})
    for i=1:size(in,2)
        for j=1:size(in,1)
           out[j,i] = in[j,i]*in[j,i]*in[j,i]
        end
    end   
end

A = rand(8192, 200);
P = similar(A)

@time f(A, P);
@time f(A, P);
@time g(A, P);
@time g(A, P);

Q = similar(A)
f(A,P);
g(A,Q);
println(all(Q == P));

I get:

elapsed time: 0.247349689 seconds
elapsed time: 0.218872701 seconds
elapsed time: 0.021574802 seconds
elapsed time: 0.00601552 seconds
true

johnmyleswhite · 2013-04-03T17:31:46Z

One source of your performance trouble is that your f function is spending a bunch of time allocating space, which g doesn't have to do. This happens for two reasons:

You create a temporary array in.^3 that you try to assign to out.
The assignment to out doesn't actually work for reasons that are a little subtle. You just end up changing the local binding of the out variable rather than editing the contents of the array you pass in. This is one of the things about Julia's syntax that is definitively not intuitive.

StefanKarpinski · 2013-04-03T17:35:04Z

The assignment to out doesn't actually work for reasons that are a little subtle. You just end up changing the local binding of the out variable rather than editing the contents of the array you pass in. This is one of the things about Julia's syntax that is definitively not intuitive.

I realize my notion of what is or isn't intuitive in programming languages may not be typical, but I don't see how this could possibly do anything else. But, yes, John is totally right about these points.

johnmyleswhite · 2013-04-03T17:38:13Z

The problem is that intuitively accessible and logical consistent are largely distinct properties: Julia might not be able to function in any other way, but the fact that out[:] = foo works and that out = foo does not is quite subtle -- in many ways, this distinction requires understanding the nature of pointers.

timholy · 2013-04-03T17:41:53Z

The real problem is that taking a number to a general power is much slower than 3 multiplies.

JeffBezanson · 2013-04-03T18:02:31Z

@timholy go to the head of the class! All of the time is in the pow function. If g is modified to compute in[j,i]^3, the times are the same.

It turns out openlibm pow has a special case for 2 , and given the large time difference it looks like it would be worth it to add one for 3 as well.

JeffBezanson · 2013-04-03T19:11:23Z

I added cases for 3 and 4 in openlibm. Closing this as unrelated to array expressions.

ViralBShah · 2013-04-04T07:45:00Z

This is the Apple implementation:
http://opensource.apple.com/source/Libm/Libm-2026/Source/Intel/xmm_power.c
http://opensource.apple.com/source/Libm/Libm-2026/Source/Intel/expf_logf_powf.c
http://opensource.apple.com/source/Libm/Libm-2026/Source/Intel/powf.s

In general, their libm seems to have a bunch of stuff that is done with SSE.

JeffBezanson · 2013-04-04T08:04:47Z

Maybe we should use system libm on mac? Can you try a quick performance comparison of pow in openlibm vs. mac?

jtravs · 2013-04-04T08:10:23Z

Being able to choose the libm would be nice; then for example some users could use the optimized intel libM: http://software.intel.com/en-us/articles/implement-the-libm-math-library

ViralBShah · 2013-04-04T09:23:57Z

We already have the ability to choose between openlibm and the system libm. Perhaps we need to make the libm choice flexible like we do with the BLAS.

Cc: @staticfloat

ViralBShah · 2013-04-04T09:52:19Z

@JeffBezanson See experiments in https://gist.github.com/ViralBShah/5309148

Basically, except for your performance hack for small integer powers, Apple's pow is faster.

ViralBShah · 2013-04-04T09:57:23Z

APSL qualifies as a free software license according to FSF:
http://www.gnu.org/philosophy/apsl.html

It may be worth pulling in a few of these functions into openlibm.

pao · 2013-04-04T12:46:26Z

FSF-free doesn't necessarily mean it's license compatible (for instance, APSL is not GPL compatible). I'm not sure what the rules are for MIT.

ViralBShah · 2013-04-04T15:08:34Z

Ah yes, I forgot there are all those fine distinctions. It is such a shame that after all these decades of numerical computation, we do not have one open source high quality high performance libm implementation.

kmsquire · 2013-04-04T17:01:28Z

The main reason it's not GPL compatible seems to be that you can't combine it with GPL code and simply release the whole thing as GPL, because the APSL requires that you still include license notices, and the GPL wants to be the whole shebang.

Compliance with the license mainly seems to entail

including the license blurb in the source file (which is not there right now)
"prominently" including a copyright notice in any binaries.
similar to the GPL, it requries releasing any changes publicly under the same license (but seemingly just of just the changed files; see http://programmers.stackexchange.com/questions/74769/whats-the-limitation-of-apsl-compared-to-bsd-or-mit)

It explicitly allows linking with other libraries, even non-free ones, as long as the license terms are followed. So linking and releasing don't seem to be an issue, as long as you're willing to comply with the license.

Some obvious issues:

It's unclear how prominent the notice needs to be.
Neither the files pointed to above nor the directory the file is in contain a license, only a copyright notice, which technically gives no rights at all (but some of the other files have license information)
The terms of the license may be inconvenient enough that you might not want to include it in openlibm

The first two might be cleared up by contacting Apple and/or the original developers, if anyone cares to (and hey, maybe the developers would be interested in Julia).

JeffBezanson · 2013-04-04T17:08:20Z

I'd say GPL-incompatible is not acceptable.

ViralBShah · 2013-04-04T17:15:50Z

Yes, incompatibility with GPL is certainly a deal breaker.

kmsquire · 2013-04-04T17:33:16Z

No problem, of course, but I'm curious why being incompatible with the GPL
is bad for Julia? You don't have any GPL code (yet) in mainline julia, and
there seems to be explicit emphasis on excluding GPL code.

Don't get me wrong--I'm actually a fan of the GPL, though I don't think
it's appropriate for everything, and I have no issues with the license
choices made for Julia. I'm just not following your reasoning in this case.

JeffBezanson · 2013-04-04T17:34:53Z

There are many GPL libraries everybody uses with julia, so the base system has to be at least GPL-compatible.

kmsquire · 2013-04-04T17:41:49Z

Okay, so that would, e.g., allow julia to be included/embedded in a GPL
program. Makes sense, thanks.

ViralBShah · 2013-04-04T18:59:35Z

openlibm is also something that could potentially be used more widely, and it would be nice to be GPL compatible.

jtravs · 2013-04-09T10:01:24Z

Is the discussion in this thread:
https://groups.google.com/d/topic/julia-users/c6TmwkOkEeg/discussion
a related issue? Replacing

factor = h/(norm(dr)^3)

(where dr is a 3 element array) with:

denom = sqrt(dr[1]_dr[1] + dr[2]_dr[2] + dr[3]_dr[3])
factor = h/(denom_denom*denom)

gives a very big speed-up.

Thanks,
J

timholy · 2013-04-09T10:46:59Z

It's worth noting that part of the speedup is due to replacing BLAS.norm with the explicit formula, but the majority of the benefit comes from avoiding even the new-and-improved pow function. Changing

denom = sqrt(dr[1]*dr[1] + dr[2]*dr[2] + dr[3]*dr[3])

to

denom = sqrt(dr[1]^2 + dr[2]^2 + dr[3]^2)

leads to a substantial performance hit.

ViralBShah · 2013-04-09T14:09:21Z

This issue has a misleading title. Perhaps we should have a new issue.

simonster · 2014-05-26T14:38:53Z

This was optimized into three multiplies, but it isn't optimized after f2b3192 (necessary to fix #6506), so ultimately we're waiting on the LLVM bug that prevents us from using the powi intrinsic to get fixed here.

simonster · 2014-06-24T18:09:40Z

I suppose another question is if this optimization is safe. I tested (x*x)*(x*x) against x^4 on 100,000 random numbers generated with rand() (code here). It seems that x^4 always gives the correct result to 0.5 ulp (it always matches float64(big(x)^4)) whereas (x*x)*(x*x) gives a less accurate result for half of the inputs. If I have computed it correctly, then the maximum error of (x*x)*(x*x) exceeds 1.8 ulp within this range. For higher powers the result is even less accurate.

JeffBezanson · 2014-06-24T18:29:59Z

That is a very good point.

StefanKarpinski · 2014-06-24T18:31:14Z

That's interesting because pairwise summation tends to be more accurate than iterative summation. This is a case where you've found pairwise multiplication of the same value to be less accurate than iterative multiplication. Clearly it's a rather different issue because since you're dealing with the same value over and over again, but I wonder if it's really generalizable or not. Is iterative always worse than pairwise? Or are there situations where the pairwise optimization gives better results than iterative?

andrioni · 2014-06-24T18:55:37Z

Actually, he compared pairwise against libm pow, not against iterated multiplication. I've just updated the gist with it.

StefanKarpinski · 2014-06-24T18:58:39Z

Ah, sorry. I misunderstood that.

JeffBezanson · 2014-06-24T19:03:18Z

I added a special case to openlibm for y==4. Perhaps that was ill-advised. However we are using the powi_llvm intrinsic, which calls whatever pow function it can find (to work around an llvm bug), probably the system libm.

simonster · 2014-06-24T19:19:08Z

Yes, it seems this must be calling something besides openlibm, which would also explain why ^ is always within 0.5 ulp for me but not quite for @andrioni. If I call openlibm directly with ccall((:pow, Base.Math.libm), Float64, (Float64, Float64), x, 4), then the results are unsurprisingly the same as pairwise multiplication given the special case. For x^5 I can directly compare with openlibm, and it looks like multiplication gives a max error of ~2.6 ulp while openlibm gives ~0.8 ulp and (on my system) ^ gives <0.5 ulp.

andrioni · 2014-06-24T19:39:52Z

^ for me had 0.5008007537332742 as the maximum ulp. I'm on OS X 10.7.

simonster · 2014-06-24T20:42:37Z

For me ^ is 0.49999994903638395 on Linux. Apparently glibc's pow guarantees that the result is correctly rounded to nearest for all inputs. Also apparently it is (or maybe only used to be?) so slow on pathological inputs that many bug reports were filed.

timholy · 2014-06-24T21:54:38Z

It's definitely not calling libm, because otherwise x^2 would be much slower than x*x---last I knew, we can't inline ccalls.

JeffBezanson · 2014-06-24T22:49:48Z

No, but LLVM does. It replaces pow(x,2) with x*x. This is mildly sketchy, but in this particular case it's quite nice.

timholy · 2014-06-25T00:49:39Z

Ah, didn't realize that. That is nice.

quinnj · 2014-08-29T13:57:17Z

Any other work to be done here? Seems like this has evolved over a few different issues.

stevengj · 2014-08-29T14:12:05Z

Still waiting on a fix for the LLVM bug so that we can use powi, I think?

quinnj · 2014-08-29T14:15:00Z

Was that what @Keno referred to as fixed?

Keno · 2014-08-29T17:46:13Z

Yes, that is fixed in LLVM 3.6

stevengj · 2014-08-29T19:54:34Z

So, once we upgrade to LLVM 3.6, we can re-enable the powi intrinsic and close this issue.

Keno · 2014-08-29T19:55:24Z

We, can do it right now with ifdef LLVM36 (will do, hold on).

JeffBezanson closed this as completed Apr 3, 2013

ViralBShah added a commit that referenced this issue Apr 4, 2013

Update openlibm to get Jeff's perf fix for #2741

4600581

timholy reopened this Apr 9, 2013

Keno closed this as completed in 985dd66 Aug 29, 2014

timholy mentioned this issue Nov 2, 2014

Is multiplication over power still a performance tip? #8876

Closed

This was referenced Nov 2, 2014

Updated manual: Small integer powers are no longer slow. #8878

Merged

^(-n) is slow #8939

Closed

simonbyrne mentioned this issue Jan 5, 2017

^(::Float64, ::Integer) incorrect subnormal results #19872

Closed

vtjnash mentioned this issue Jan 6, 2017

only use accurate powf function #19890

Merged

stevengj mentioned this issue Feb 8, 2017

parse x^literal as call to ^(x, Val{literal})? #20527

Closed

This was referenced Sep 21, 2017

Surprising performance for ^(::Float64, ::Int) vs ^(::Complex128, ::Int) #23804

Closed

Faster evaluation JuliaAlgebra/DynamicPolynomials.jl#3

Open

simonbyrne mentioned this issue Dec 5, 2017

improvements to accuracy/performance for float^integer #24500

Merged

KristofferC mentioned this issue Aug 28, 2021

^(::Float, ::Integer) #42031

Merged

^ is slow #2741

^ is slow #2741

Comments

jtravs commented Apr 3, 2013

ViralBShah commented Apr 3, 2013

JeffBezanson commented Apr 3, 2013

jtravs commented Apr 3, 2013

johnmyleswhite commented Apr 3, 2013

StefanKarpinski commented Apr 3, 2013

johnmyleswhite commented Apr 3, 2013

timholy commented Apr 3, 2013

JeffBezanson commented Apr 3, 2013

JeffBezanson commented Apr 3, 2013

ViralBShah commented Apr 4, 2013

JeffBezanson commented Apr 4, 2013

jtravs commented Apr 4, 2013

ViralBShah commented Apr 4, 2013

ViralBShah commented Apr 4, 2013

ViralBShah commented Apr 4, 2013

pao commented Apr 4, 2013

ViralBShah commented Apr 4, 2013

kmsquire commented Apr 4, 2013

JeffBezanson commented Apr 4, 2013

ViralBShah commented Apr 4, 2013

kmsquire commented Apr 4, 2013

JeffBezanson commented Apr 4, 2013

kmsquire commented Apr 4, 2013

ViralBShah commented Apr 4, 2013

jtravs commented Apr 9, 2013

timholy commented Apr 9, 2013

ViralBShah commented Apr 9, 2013

simonster commented May 26, 2014

simonster commented Jun 24, 2014

JeffBezanson commented Jun 24, 2014

StefanKarpinski commented Jun 24, 2014

andrioni commented Jun 24, 2014

StefanKarpinski commented Jun 24, 2014

JeffBezanson commented Jun 24, 2014

simonster commented Jun 24, 2014

andrioni commented Jun 24, 2014

simonster commented Jun 24, 2014

timholy commented Jun 24, 2014

JeffBezanson commented Jun 24, 2014

timholy commented Jun 25, 2014

quinnj commented Aug 29, 2014

stevengj commented Aug 29, 2014

quinnj commented Aug 29, 2014

Keno commented Aug 29, 2014

stevengj commented Aug 29, 2014

Keno commented Aug 29, 2014