Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't short-circuit chained comparisons #16088

Closed
StefanKarpinski opened this issue Apr 28, 2016 · 14 comments
Closed

don't short-circuit chained comparisons #16088

StefanKarpinski opened this issue Apr 28, 2016 · 14 comments
Labels
breaking This change will break code compiler:lowering Syntax lowering (compiler front end, 2nd stage) good first issue Indicates a good issue for first-time contributors to Julia help wanted Indicates that a maintainer wants help on an issue or pull request needs decision A decision on this change is needed speculative Whether the change will be implemented is speculative
Milestone

Comments

@StefanKarpinski
Copy link
Member

It's harder to lower, more surprising and usually less efficient. We should just lower a < b < c to (a < b) & (b < c).

@StefanKarpinski StefanKarpinski added speculative Whether the change will be implemented is speculative needs decision A decision on this change is needed help wanted Indicates that a maintainer wants help on an issue or pull request parser Language parsing and surface syntax good first issue Indicates a good issue for first-time contributors to Julia labels Apr 28, 2016
@JeffBezanson JeffBezanson added compiler:lowering Syntax lowering (compiler front end, 2nd stage) and removed parser Language parsing and surface syntax labels Apr 28, 2016
@yuyichao yuyichao added the breaking This change will break code label Apr 28, 2016
@stevengj
Copy link
Member

stevengj commented Apr 28, 2016

I wasn't even aware that these short-circuit, but I see that it is documented in the manual. (I guess it comes from Python?) Somehow I doubt that many programs rely on this, but I don't know how to go about doing a survey.

@stevengj
Copy link
Member

stevengj commented Apr 28, 2016

The manual says "However, the order of evaluations in a chained comparison is undefined". This makes the short-circuit behavior useless anyway, so you might as well get rid of it. Any code that relies on the short-circuit behavior is probably broken anyway, thanks to the undefined order.

@ArchRobison
Copy link
Contributor

I'm not sure about the efficiency trade-off. Assuming short-circuit (lazy) semantics, several cases come to mind for a() < b() < c():

  • Evaluation of c() is expensive, and therefore short-circuiting wins.
  • Evaluation of c() is cheap, and therefore the extra branch is the overriding cost. But if c() is that cheap, it's probably all inline and the compiler can estimate its cost fairly well, leading to two subcases of concern:
    • c() is provably safe to evaluate unnecessarily. Then the compiler can convert short-circuit to eager evaluation. (I've worked on a compiler that did this, but I'm not sure about LLVM.)
    • c() is not provably safe to evaluate unnecessarily. Common examples are indirect load off a possibly null pointer or possible divide by zero. Not being able to evaluate c() unnecessarily can hurt vectorization on machines without predication (e.g. won't be a problem for AVX-512, but can be a problem for SSE/AVX.).

For the last case, there is also a flow-analysis loss for lazy evaluation: with evaluation of c() uncertain, subsequent evaluations "downstream" cannot be proven to be redundant. But modern partial-redundancy elimination (PRE) techniques deal with that. (I'm not sure what the state of PRE is in LLVM. It seemed to be missing from LLVM 3.3 from what I recall.)

Maybe we could add a switch to control lazy versus eager evaluation of chained comparisons and see what the impact is on current benchmarks of interest?

@ArchRobison
Copy link
Contributor

With respect to "order of evaluations in a chained comparison is undefined", I'd be in favor nailing down the order, regardless of whether we decide on eager or lazy evaluation, because having the order differ in different implementations can cause a lot of pain. The optimization advantages of undefined order, useful on ye olde PDP 11 seem to have evaporated with modern compilers. Even the C++ committee is seriously considering nailing down the order.

@StefanKarpinski
Copy link
Member Author

Then the compiler can convert short-circuit to eager evaluation.

@ArchRobison: if LLVM was better at doing this, I would agree with your analysis, but it seems to either never do this or be very bad at it. Perhaps that's effectively agreement since LLVM could be improved to make this argument go through. Similarly for the second subcase.

I wholeheartedly agree about making the order of evaluation defined.

@stevengj
Copy link
Member

stevengj commented Apr 28, 2016

I would think that a lot of chained comparisons in practice are safe to evaluate unnecessarily, e.g. 0 < i ≤ n and similar are probably very common. I just checked, and LLVM 3.7.1 in Julia 0.5 is not able to convert short-circuit to eager evaluation in 0 < i ≤ n for Int types.

In the case where one of the operands is very expensive to evaluate unnecessarily, one can always write && explicitly. Of course, in the current situation you could always write & explicitly, but eager evaluation seems like a better default to me since "cheap" operands are probably more common if I had to guess (assuming eager is faster in that case) . Of course, it would be even better if LLVM were just smarter.

@stevengj
Copy link
Member

stevengj commented Apr 28, 2016

I just did a quick grep through an archive of all registered Julia packages for chained < (or <=) and chained == expressions, and in all cases I could find it was safe (and probably advantageous) to evaluate eagerly. Almost all operands were constants, variables, simple functions like length(A) or size(M,1) or zero(𝜙) or kernel.N𝜙 + 1, or array lookups x[i]. In the few examples where there was a more expensive operand, it was something like lb <= fitness(gt) < ub where eager evaluation is safe because you have to evaluate fitness(gt) anyway.

Of course, I could have missed some; this is hard to grep for, and there were a lot of false positives to sift through. But I think I at least got a representative sampling.

@ArchRobison
Copy link
Contributor

Thanks for looking. Array lookups can be surprisingly expensive, depending on whether they hit L1 cache or not, and branches can be surprisingly cheap, depending on whether the branch predictor bets right. (Modern CPUs have become a casino.)

I tried the 0 < i ≤ n example in C both ways:

void bar();

void f(int i, int n) {
    if (0<i && i<=n) bar();
}

void g(int i, int n) {
    if ((0<i) & (i<=n)) bar();
}

For 64-bit x86, clang -O3 and icc -O3 leave the first as lazy and the second as eager. gcc 5.2 changed both to the lazy form, suggesting that the gcc developers bet that lazy is faster than eager on 64-bit x86. Story on other processors may differ.

I think timing benchmarks of interest is the data we need to make a decision based on performance considerations, and performance seems to one of the motivations for this issue.

@mbauman
Copy link
Member

mbauman commented Apr 28, 2016

I looked at this back when I was introducing the array indexing infrastructure. At the time, it was surprisingly cheaper to have the short-circuiting semantics (#10525 (comment)) — those branches are extremely predictable.

In this specific case, it'd be interesting to try reinterpret(UInt, i - 1) < n; is there a particular form in which LLVM would be able to do that optimization (if it is indeed an optimization)?

@stevengj
Copy link
Member

A quick benchmark:

function foo(a)
       s = 0
       for i in 2:length(a)
           s += a[i-1] < i < a[i]
       end
end
function bar(a)
       s = 0
       for i in 2:length(a)
           s += (a[i-1] < i) & (i < a[i])
       end
end
a = rand(Int, 10^4)
@elapsed(foo(a)) / @elapsed(bar(a))

gives a 5x slowdown from short-circuiting. This seems like a pretty huge penalty, especially considering that the comparisons per se are only part of the loop body. Surprisingly, if I use @inbounds in the loop body, the ratio decreases to 1.5 (i.e. short-circuit is slower, but not by nearly as much). This is confusing to me.

@ArchRobison
Copy link
Contributor

Nice example. One improvement: make foo and bar return the value of s. Otherwise the benchmark is just searching for out-of-bounds indices. Though even with that fix I'm seeing the crazy big time ratio (about 3.4x for a 10^7 element a). There's a lesson here that I want to figure out.

@ArchRobison
Copy link
Contributor

The slowdown is surely a branch miss-prediction issue, since using a = rand(10^4:10^5, 10^4) in the example causes the ratio to become about 0.9 for me. I tried a variation of the example with lots of delinquent loads (see below), and the ratio was about 1.2x, still in favor of eager.

function foo(a,b)
       s = 0
       for i in 2:length(a)
           s += a[b[i-1]] < i < a[b[i]]
       end
       s
end
function bar(a,b)
       s = 0
       for i in 2:length(a)
           s += (a[b[i-1]] < i) & (i < a[b[i]])
       end
       s
end
a = rand(Int, 10^8)
b = rand(1:10^8, 10^8)
@elapsed(foo(a,b)) / @elapsed(bar(a,b))

@ArchRobison
Copy link
Contributor

A thought that passed my mind is that the quick benchmark has the chain that is not used to direct control flow. It might be useful to understand how often that happens compared to the control-flow case.

I personally love the chained form for assertions, but in that scenario the branches are highly predictable unless I'm having a bad day.

@kshyatt kshyatt added the Hacktoberfest Good for Hacktoberfest participants label Oct 5, 2016
@JeffBezanson JeffBezanson removed the Hacktoberfest Good for Hacktoberfest participants label May 2, 2017
@JeffBezanson JeffBezanson added this to the 1.0 milestone May 2, 2017
@StefanKarpinski
Copy link
Member Author

Triage: resolved that this doesn't matter much and changing it now would be pointless churn. There are corner cases where the current behavior could be more efficient. In cases where it's less efficient, future optimization work can address that gap, whereas laziness is semantically significant and therefore cannot be avoided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code compiler:lowering Syntax lowering (compiler front end, 2nd stage) good first issue Indicates a good issue for first-time contributors to Julia help wanted Indicates that a maintainer wants help on an issue or pull request needs decision A decision on this change is needed speculative Whether the change will be implemented is speculative
Projects
None yet
Development

No branches or pull requests

7 participants