[X86] cmov conversion hurts binary search performance #39374

nikic · 2018-12-14T17:45:51Z


Bugzilla Link	40027
Version	trunk
OS	All
CC	@adibiagio,@chandlerc,@topperc,@Dushistov,@ot,@jdm,@LebedevRI,@RKSimon,@rnk,@rotateright

Extended Description

Originally reported at: rust-lang/rust#53823

The cmov conversion pass converts a cmov in Rust's binary search implementation into a branch. As binary search branches are badly predicted, this significantly hurts performance. For small data sets, which are not dominated by cache performance, the slowdown is around 2-3x.

https://gist.github.com/nikic/13e907d7097f74a92d082fcc61bc212c has the LLVM IR for the function, as well as generated assembly with cmov conversion enabled (default) and disabled (-x86-cmov-converter=false).

It would be great if the cmov conversion heuristics could be adjusted in some way to avoid converting this case.

nikic · 2018-12-14T17:51:51Z

For reference, the relevant line in the original Rust code is: https://github.com/rust-lang/rust/blob/059e6a6f57f4e80d527a3cd8a8afe7f51f01af8e/src/libcore/slice/mod.rs#L1423

The interesting part of the IR:

  %4 = lshr i64 %size.022.i.i, 1
  %5 = add i64 %4, %base.021.i.i
  %6 = getelementptr inbounds [0 x i64], [0 x i64]* %slice.0, i64 0, i64 %5
  %7 = load i64, i64* %6, align 8, !alias.scope !&#8203;14, !noalias !&#8203;19
  %8 = icmp ugt i64 %7, %0
  %base.0..i.i = select i1 %8, i64 %base.021.i.i, i64 %5

And assembly with cmov (what we want):

	movq	%rsi, %r9
	shrq	%r9
	leaq	(%r9,%r8), %rcx
	cmpq	%rdx, (%rdi,%rcx,8)
	cmovaq	%r8, %rcx

And with branch (what we get):

	movq	%rsi, %rcx
	shrq	%rcx
	leaq	(%rcx,%rdx), %r9
	cmpq	%r8, (%rdi,%r9,8)
	ja	.LBB0_7
# %bb.6:                                # %bb9.i.i
                                        #   in Loop: Header=BB0_5 Depth=1
	movq	%r9, %rdx
.LBB0_7:                                # %bb9.i.i
                                        #   in Loop: Header=BB0_5 Depth=1

rnk · 2018-12-14T19:26:21Z

I remember this problem from a while ago, and I wonder if the !unpredictable metadata attachment could solve this problem:
https://llvm.org/docs/LangRef.html#unpredictable-metadata

I see a few instances of it in the test suite which gives me hope:

$ git grep 'select.*!unpredict' ../llvm
../llvm/test/Transforms/CodeGenPrepare/X86/select.ll:; CHECK-NEXT:    [[SEL:%.*]] = select i1 [[CMP]], float [[DIV]], float 2.000000e+00, !unpredictable !0
../llvm/test/Transforms/CodeGenPrepare/X86/select.ll:  %sel = select i1 %cmp, float %div, float 2.0, !unpredictable !0
../llvm/test/Transforms/Inline/profile-meta.ll:  %sel = select i1 %c, i32 %a, i32 %b, !prof !0, !unpredictable !1
../llvm/test/Transforms/Inline/profile-meta.ll:; CHECK-NEXT:  [[SEL:%.*]] = select i1 %C, i32 %A, i32 %B, !prof !0, !unpredictable !1
../llvm/test/Transforms/SimplifyCFG/PhiEliminate2.ll:; CHECK-NEXT:  %V6 = select i1 %C, i32 %V4, i32 %V5, !prof !0, !unpredictable !1

nikic · 2018-12-14T19:40:05Z

Just gave !unpredictable a try, and unfortunately it didn't help.

This does sound like something that the pass should take into account -- not sure if this information is still available at that point though.

nikic · 2018-12-14T20:14:58Z

Looks like a similar optimization in CodeGenPrepare can be disabled via !unpredictable: https://github.com/llvm-mirror/llvm/blob/f2511abd84e3aa730ed68ac30881b0895d03b6d4/lib/CodeGen/CodeGenPrepare.cpp#L5756

But this metadata is not preserved in SelectionDAG.

rotateright · 2018-12-14T20:16:03Z

Yes, 'unpredictable' was created for situations like this, but I never looked at getting metadata to survive to a machine IR pass. Taming the heuristic cost calc might be the easier fix.

Which CPU(s) are showing the 2-3x slowdown?

It's possible that a particular sched model and/or generic defaults should be adjusted to avoid this problem.

nikic · 2018-12-14T21:16:28Z

@spatel: I'm seeing 2.8x slowdown on i5-4690 CPU (Haswell), with both default and native target CPU. On the Rust bug report two other people report similar numbers, though I don't know which CPUs those are.

I don't think that changing scheduling models will help for this particular case (where the transform should be pretty much universally disadvantageous, as the branch is unpredictable).

rotateright · 2018-12-14T21:41:40Z

@spatel: I'm seeing 2.8x slowdown on i5-4690 CPU (Haswell), with both
default and native target CPU. On the Rust bug report two other people
report similar numbers, though I don't know which CPUs those are.

I don't think that changing scheduling models will help for this particular
case (where the transform should be pretty much universally disadvantageous,
as the branch is unpredictable).

Yes, I agree - and apparently so did the author of this pass. :)
There's a test in /test/CodeGen/X86/x86-cmov-converter.ll called "BinarySearch", and that's supposed to be guarding this transform from happening on code like yours. Seeing how your machine code differs from that reference should be informative.

nikic · 2018-12-14T21:52:00Z

So that test case has a structure like

%Left.sink = select i1 %tobool, %struct.Node** %Left, %struct.Node** %Right
%3 = load %struct.Node*, %struct.Node** %Left.sink, align 8

which I think is guarded by https://github.com/llvm-mirror/llvm/blob/0bbe50f2fb47207ac8093bb700edcb75616b7dd5/lib/Target/X86/X86CmovConversion.cpp#L536 in the cmov pass.

Notably, and as the comment there indicates, this is not actually a binary search, but rather a binary tree search, and the code uses "load depending on select" to detect that case.

nikic · 2018-12-15T10:25:14Z

This case seems pretty tricky to me, because on the surface, the pattern is exactly one where cmov conversion should be profitable (both select values are already computed, we're only waiting on the condition) -- under the assumption that the branch is predictable.

The cmov conversion cost modeling currently only models whether the conversion is profitable under the assumption of a reasonably well-predicted branch, but (apart from the tree-search case mentioned above) doesn't make an effort to determine whether or not the branch is expected to be predictable.

And I'm not sure how it could really do that. My best idea would be that branch predictability correlates with access locality, and that branches depending on random-access loads are more likely to be unpredictable, while branches depending on loads from a linear induction variable are likely to be predictable. But that's something that doesn't sound easy to check at the point where cmov conversion operates.

nikic · 2018-12-17T13:51:27Z

Looks like this issue has previously been reported at #36492 .

rotateright · 2018-12-17T14:14:45Z

This commit added a magic-number (why 12.5%?) restriction:
https://reviews.llvm.org/rL310352 (bug 33954)

...so there's still a possibility that we could adjust the code to avoid this case.

If not (or even if we do that), we should figure out how to run the plumbing to get 'unpredictable' down to this layer and have this pass use that info.

nikic · 2018-12-17T14:35:52Z

For reference, the computed values in the cost modelling (for generic target) are:

Diff[0]: 1
Diff[1]: 6
LoopDepth[0].Depth: 8
LoopDepth[1].Depth: 15

llvmbot · 2019-02-19T13:00:57Z

*** Bug llvm/llvm-bugzilla-archive#40646 has been marked as a duplicate of this bug. ***

ot · 2020-06-30T21:49:19Z

I was wondering if there has been any progress on this. One case very similar to binary search where this hits us is binary heap primitives. See for example this code: https://godbolt.org/z/h6hFxg

The condition highlighted is unpredictable, and the cmov conversion has a significant negative impact. We have a workaround that uses inline asm to force a cmov, but supporting that is quite annoying (the real implementation is a template and we have an extension point to define per-comparator workarounds).

It also seems that branch probabilities from FDO data are not taken into account by the optimization.

As in the godbolt link, disabling the optimization globally restores the cmov, but that may have other unwanted side effects. Is there any way to disable cmov conversion on a per-function basis?

Ideally, we'd use the "unpredictable" annotation, but I'm not sure whether it's possible to express that in Clang (is there a corresponding builtin?) and it also looks like the optimization is still not honoring that either.

rotateright · 2020-07-01T12:53:22Z

Ideally, we'd use the "unpredictable" annotation, but I'm not sure whether
it's possible to express that in Clang (is there a corresponding builtin?)
and it also looks like the optimization is still not honoring that either.

Yes, there is a clang builtin:

if (__builtin_unpredictable(x > 0)) {
foo();
}

https://clang.llvm.org/docs/LanguageExtensions.html#builtin-functions

But no, it does not make a difference on your code example because that information is lost before the problematic pass.

AFAIK, there have been no improvements on this since this bug was filed.

nikic · 2020-07-01T13:00:00Z

We did make the cost modelling a bit more conservative to avoid some additional regressions in LLVM 10. But nothing that would affect the binary search related cases.

We probably should just make this fully conservative and assume worst-case probabilities for everything, until such a time as accurate profiling information is available this deep in the backend.

llvmbot · 2021-11-27T01:35:44Z

mentioned in issue llvm/llvm-bugzilla-archive#40646

nikic · 2021-11-27T02:46:16Z

mentioned in issue llvm/llvm-bugzilla-archive#44539

rotateright · 2021-11-27T02:46:17Z

mentioned in issue llvm/llvm-bugzilla-archive#44539

adibiagio · 2021-11-27T02:46:17Z

mentioned in issue llvm/llvm-bugzilla-archive#44539

davidbolvansky · 2023-06-02T20:44:33Z

After https://reviews.llvm.org/D118118 this flag / metadata is propagated to the X86 backend and with builtin __builtin_unpredictable or attached unpredictable metadata on selects you can force LLVM to codegen CMovs.

nikic · 2023-06-02T21:16:15Z

It's nice to have the opt-out, but we should probably still make this optimize well by default. We now have the SelectOptimize pass that is based on branch weights and sensible cost modeling. We should probably schedule it for X86 and delete the cmov pass.

davidbolvansky · 2023-06-02T21:35:44Z

Right, optimizing this correctly by default would ve great but this can be very trickly as history of gcc/clang shows

SelectOptimize may catch this but may regress other cases. Not sure if it works without PGO profile atleast as good as CMov pass.

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

mlochbaum mentioned this issue Feb 11, 2023

Updating fluxsort/quadsort? scandum/wolfsort#3

Closed

ryao mentioned this issue May 18, 2023

Optimization opportunity: ((((a) > (b)) - ((a) < (b))) < 0) -> ((a) < (b)) #62790

Open

workingjubilee mentioned this issue Jun 20, 2024

reword the hint::blackbox non-guarantees rust-lang/rust#126703

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] cmov conversion hurts binary search performance #39374

[X86] cmov conversion hurts binary search performance #39374

nikic commented Dec 14, 2018

nikic commented Dec 14, 2018 •

edited

Loading

rnk commented Dec 14, 2018 •

edited by nikic

Loading

nikic commented Dec 14, 2018

nikic commented Dec 14, 2018

rotateright commented Dec 14, 2018

nikic commented Dec 14, 2018

rotateright commented Dec 14, 2018

nikic commented Dec 14, 2018

nikic commented Dec 15, 2018

nikic commented Dec 17, 2018

rotateright commented Dec 17, 2018

nikic commented Dec 17, 2018

llvmbot commented Feb 19, 2019

ot mannequin commented Jun 30, 2020

rotateright commented Jul 1, 2020

nikic commented Jul 1, 2020

llvmbot commented Nov 27, 2021

nikic commented Nov 27, 2021

rotateright commented Nov 27, 2021

adibiagio commented Nov 27, 2021

davidbolvansky commented Jun 2, 2023

nikic commented Jun 2, 2023

davidbolvansky commented Jun 2, 2023

[X86] cmov conversion hurts binary search performance #39374

[X86] cmov conversion hurts binary search performance #39374

Comments

nikic commented Dec 14, 2018

Extended Description

nikic commented Dec 14, 2018 • edited Loading

rnk commented Dec 14, 2018 • edited by nikic Loading

nikic commented Dec 14, 2018

nikic commented Dec 14, 2018

rotateright commented Dec 14, 2018

nikic commented Dec 14, 2018

rotateright commented Dec 14, 2018

nikic commented Dec 14, 2018

nikic commented Dec 15, 2018

nikic commented Dec 17, 2018

rotateright commented Dec 17, 2018

nikic commented Dec 17, 2018

llvmbot commented Feb 19, 2019

ot mannequin commented Jun 30, 2020

rotateright commented Jul 1, 2020

nikic commented Jul 1, 2020

llvmbot commented Nov 27, 2021

nikic commented Nov 27, 2021

rotateright commented Nov 27, 2021

adibiagio commented Nov 27, 2021

davidbolvansky commented Jun 2, 2023

nikic commented Jun 2, 2023

davidbolvansky commented Jun 2, 2023

nikic commented Dec 14, 2018 •

edited

Loading

rnk commented Dec 14, 2018 •

edited by nikic

Loading