Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] cmov conversion hurts binary search performance #39374

Open
nikic opened this issue Dec 14, 2018 · 23 comments
Open

[X86] cmov conversion hurts binary search performance #39374

nikic opened this issue Dec 14, 2018 · 23 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@nikic
Copy link
Contributor

nikic commented Dec 14, 2018

Bugzilla Link 40027
Version trunk
OS All
CC @adibiagio,@chandlerc,@topperc,@Dushistov,@ot,@jdm,@LebedevRI,@RKSimon,@rnk,@rotateright

Extended Description

Originally reported at: rust-lang/rust#53823

The cmov conversion pass converts a cmov in Rust's binary search implementation into a branch. As binary search branches are badly predicted, this significantly hurts performance. For small data sets, which are not dominated by cache performance, the slowdown is around 2-3x.

https://gist.github.com/nikic/13e907d7097f74a92d082fcc61bc212c has the LLVM IR for the function, as well as generated assembly with cmov conversion enabled (default) and disabled (-x86-cmov-converter=false).

It would be great if the cmov conversion heuristics could be adjusted in some way to avoid converting this case.

@nikic
Copy link
Contributor Author

nikic commented Dec 14, 2018

For reference, the relevant line in the original Rust code is: https://github.com/rust-lang/rust/blob/059e6a6f57f4e80d527a3cd8a8afe7f51f01af8e/src/libcore/slice/mod.rs#L1423

The interesting part of the IR:

  %4 = lshr i64 %size.022.i.i, 1
  %5 = add i64 %4, %base.021.i.i
  %6 = getelementptr inbounds [0 x i64], [0 x i64]* %slice.0, i64 0, i64 %5
  %7 = load i64, i64* %6, align 8, !alias.scope !​14, !noalias !​19
  %8 = icmp ugt i64 %7, %0
  %base.0..i.i = select i1 %8, i64 %base.021.i.i, i64 %5

And assembly with cmov (what we want):

	movq	%rsi, %r9
	shrq	%r9
	leaq	(%r9,%r8), %rcx
	cmpq	%rdx, (%rdi,%rcx,8)
	cmovaq	%r8, %rcx

And with branch (what we get):

	movq	%rsi, %rcx
	shrq	%rcx
	leaq	(%rcx,%rdx), %r9
	cmpq	%r8, (%rdi,%r9,8)
	ja	.LBB0_7
# %bb.6:                                # %bb9.i.i
                                        #   in Loop: Header=BB0_5 Depth=1
	movq	%r9, %rdx
.LBB0_7:                                # %bb9.i.i
                                        #   in Loop: Header=BB0_5 Depth=1

@rnk
Copy link
Collaborator

rnk commented Dec 14, 2018

I remember this problem from a while ago, and I wonder if the !unpredictable metadata attachment could solve this problem:
https://llvm.org/docs/LangRef.html#unpredictable-metadata

I see a few instances of it in the test suite which gives me hope:

$ git grep 'select.*!unpredict' ../llvm
../llvm/test/Transforms/CodeGenPrepare/X86/select.ll:; CHECK-NEXT:    [[SEL:%.*]] = select i1 [[CMP]], float [[DIV]], float 2.000000e+00, !unpredictable !0
../llvm/test/Transforms/CodeGenPrepare/X86/select.ll:  %sel = select i1 %cmp, float %div, float 2.0, !unpredictable !0
../llvm/test/Transforms/Inline/profile-meta.ll:  %sel = select i1 %c, i32 %a, i32 %b, !prof !0, !unpredictable !1
../llvm/test/Transforms/Inline/profile-meta.ll:; CHECK-NEXT:  [[SEL:%.*]] = select i1 %C, i32 %A, i32 %B, !prof !0, !unpredictable !1
../llvm/test/Transforms/SimplifyCFG/PhiEliminate2.ll:; CHECK-NEXT:  %V6 = select i1 %C, i32 %V4, i32 %V5, !prof !0, !unpredictable !1

@nikic
Copy link
Contributor Author

nikic commented Dec 14, 2018

Just gave !unpredictable a try, and unfortunately it didn't help.

This does sound like something that the pass should take into account -- not sure if this information is still available at that point though.

@nikic
Copy link
Contributor Author

nikic commented Dec 14, 2018

Looks like a similar optimization in CodeGenPrepare can be disabled via !unpredictable: https://github.com/llvm-mirror/llvm/blob/f2511abd84e3aa730ed68ac30881b0895d03b6d4/lib/CodeGen/CodeGenPrepare.cpp#L5756

But this metadata is not preserved in SelectionDAG.

@rotateright
Copy link
Contributor

Yes, 'unpredictable' was created for situations like this, but I never looked at getting metadata to survive to a machine IR pass. Taming the heuristic cost calc might be the easier fix.

Which CPU(s) are showing the 2-3x slowdown?

It's possible that a particular sched model and/or generic defaults should be adjusted to avoid this problem.

@nikic
Copy link
Contributor Author

nikic commented Dec 14, 2018

@​spatel: I'm seeing 2.8x slowdown on i5-4690 CPU (Haswell), with both default and native target CPU. On the Rust bug report two other people report similar numbers, though I don't know which CPUs those are.

I don't think that changing scheduling models will help for this particular case (where the transform should be pretty much universally disadvantageous, as the branch is unpredictable).

@rotateright
Copy link
Contributor

@​spatel: I'm seeing 2.8x slowdown on i5-4690 CPU (Haswell), with both
default and native target CPU. On the Rust bug report two other people
report similar numbers, though I don't know which CPUs those are.

I don't think that changing scheduling models will help for this particular
case (where the transform should be pretty much universally disadvantageous,
as the branch is unpredictable).

Yes, I agree - and apparently so did the author of this pass. :)
There's a test in /test/CodeGen/X86/x86-cmov-converter.ll called "BinarySearch", and that's supposed to be guarding this transform from happening on code like yours. Seeing how your machine code differs from that reference should be informative.

@nikic
Copy link
Contributor Author

nikic commented Dec 14, 2018

So that test case has a structure like

%Left.sink = select i1 %tobool, %struct.Node** %Left, %struct.Node** %Right
%3 = load %struct.Node*, %struct.Node** %Left.sink, align 8

which I think is guarded by https://github.com/llvm-mirror/llvm/blob/0bbe50f2fb47207ac8093bb700edcb75616b7dd5/lib/Target/X86/X86CmovConversion.cpp#L536 in the cmov pass.

Notably, and as the comment there indicates, this is not actually a binary search, but rather a binary tree search, and the code uses "load depending on select" to detect that case.

@nikic
Copy link
Contributor Author

nikic commented Dec 15, 2018

This case seems pretty tricky to me, because on the surface, the pattern is exactly one where cmov conversion should be profitable (both select values are already computed, we're only waiting on the condition) -- under the assumption that the branch is predictable.

The cmov conversion cost modeling currently only models whether the conversion is profitable under the assumption of a reasonably well-predicted branch, but (apart from the tree-search case mentioned above) doesn't make an effort to determine whether or not the branch is expected to be predictable.

And I'm not sure how it could really do that. My best idea would be that branch predictability correlates with access locality, and that branches depending on random-access loads are more likely to be unpredictable, while branches depending on loads from a linear induction variable are likely to be predictable. But that's something that doesn't sound easy to check at the point where cmov conversion operates.

@nikic
Copy link
Contributor Author

nikic commented Dec 17, 2018

Looks like this issue has previously been reported at #36492 .

@rotateright
Copy link
Contributor

This commit added a magic-number (why 12.5%?) restriction:
https://reviews.llvm.org/rL310352 (bug 33954)

...so there's still a possibility that we could adjust the code to avoid this case.

If not (or even if we do that), we should figure out how to run the plumbing to get 'unpredictable' down to this layer and have this pass use that info.

@nikic
Copy link
Contributor Author

nikic commented Dec 17, 2018

For reference, the computed values in the cost modelling (for generic target) are:

Diff[0]: 1
Diff[1]: 6
LoopDepth[0].Depth: 8
LoopDepth[1].Depth: 15

@llvmbot
Copy link
Collaborator

llvmbot commented Feb 19, 2019

*** Bug llvm/llvm-bugzilla-archive#40646 has been marked as a duplicate of this bug. ***

@ot
Copy link
Mannequin

ot mannequin commented Jun 30, 2020

I was wondering if there has been any progress on this. One case very similar to binary search where this hits us is binary heap primitives. See for example this code: https://godbolt.org/z/h6hFxg

The condition highlighted is unpredictable, and the cmov conversion has a significant negative impact. We have a workaround that uses inline asm to force a cmov, but supporting that is quite annoying (the real implementation is a template and we have an extension point to define per-comparator workarounds).

It also seems that branch probabilities from FDO data are not taken into account by the optimization.

As in the godbolt link, disabling the optimization globally restores the cmov, but that may have other unwanted side effects. Is there any way to disable cmov conversion on a per-function basis?

Ideally, we'd use the "unpredictable" annotation, but I'm not sure whether it's possible to express that in Clang (is there a corresponding builtin?) and it also looks like the optimization is still not honoring that either.

@rotateright
Copy link
Contributor

Ideally, we'd use the "unpredictable" annotation, but I'm not sure whether
it's possible to express that in Clang (is there a corresponding builtin?)
and it also looks like the optimization is still not honoring that either.

Yes, there is a clang builtin:

if (__builtin_unpredictable(x > 0)) {
foo();
}

https://clang.llvm.org/docs/LanguageExtensions.html#builtin-functions

But no, it does not make a difference on your code example because that information is lost before the problematic pass.

AFAIK, there have been no improvements on this since this bug was filed.

@nikic
Copy link
Contributor Author

nikic commented Jul 1, 2020

We did make the cost modelling a bit more conservative to avoid some additional regressions in LLVM 10. But nothing that would affect the binary search related cases.

We probably should just make this fully conservative and assume worst-case probabilities for everything, until such a time as accurate profiling information is available this deep in the backend.

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 27, 2021

mentioned in issue llvm/llvm-bugzilla-archive#40646

@nikic
Copy link
Contributor Author

nikic commented Nov 27, 2021

mentioned in issue llvm/llvm-bugzilla-archive#44539

2 similar comments
@rotateright
Copy link
Contributor

mentioned in issue llvm/llvm-bugzilla-archive#44539

@adibiagio
Copy link
Collaborator

mentioned in issue llvm/llvm-bugzilla-archive#44539

@davidbolvansky
Copy link
Collaborator

After https://reviews.llvm.org/D118118 this flag / metadata is propagated to the X86 backend and with builtin __builtin_unpredictable or attached unpredictable metadata on selects you can force LLVM to codegen CMovs.

@nikic
Copy link
Contributor Author

nikic commented Jun 2, 2023

It's nice to have the opt-out, but we should probably still make this optimize well by default. We now have the SelectOptimize pass that is based on branch weights and sensible cost modeling. We should probably schedule it for X86 and delete the cmov pass.

@davidbolvansky
Copy link
Collaborator

Right, optimizing this correctly by default would ve great but this can be very trickly as history of gcc/clang shows

SelectOptimize may catch this but may regress other cases. Not sure if it works without PGO profile atleast as good as CMov pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

6 participants