Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match an std::cmp::Ordering generates less optimized code in nightly #86511

Open
yume-chan opened this issue Jun 21, 2021 · 21 comments
Open

match an std::cmp::Ordering generates less optimized code in nightly #86511

yume-chan opened this issue Jun 21, 2021 · 21 comments
Labels
A-codegen Area: Code generation A-mir-opt Area: MIR optimizations C-bug Category: This is a bug. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Milestone

Comments

@yume-chan
Copy link

yume-chan commented Jun 21, 2021

I'm interested in this comment:

let cmp = f(unsafe { self.get_unchecked(mid) });
// The reason why we use if/else control flow rather than match
// is because match reorders comparison operations, which is perf sensitive.
// This is x86 asm for u8: https://rust.godbolt.org/z/8Y8Pra.
if cmp == Less {
left = mid + 1;
} else if cmp == Greater {
right = mid;
} else {
// SAFETY: same as the `get_unchecked` above
unsafe { crate::intrinsics::assume(mid < self.len()) };
return Ok(mid);
}

So I did some testing:


The following code snippet is the "bad" one (match an std::cmp::Ordering) in above comment, but with f manually inlined.

use std::cmp::Ordering;
use std::cmp::Ordering::*;

pub fn custom_binary_search_1<'a>(slice: &'a [u8], x: &u8) -> Result<usize, usize> {
    let s = slice;
    let mut left = 0;
    let mut right = s.len();
    while left < right {
        // never overflow because `slice::len()` max is `isize::MAX`.
        let mid = (left + right) / 2;
        // SAFETY: the call is made safe by the following invariants:
        // - `mid >= 0`
        // - `mid < size`: `mid` is limited by `[left; right)` bound.
        let cmp = unsafe{ s.get_unchecked(mid).cmp(x) };
        match cmp {
            Less => left = mid + 1,
            Greater => right = mid,
            Equal => return Ok(mid),
        }
    }
    Err(left)
}

In version 1.53.0-beta.12 (2021-06-12 e7a67cc), release mode, it generates the following x86 assembly (related parts only):

        jmp	.LBB0_3
.LBB0_7:
	add	rdx, 1                     # mid += 1
	mov	rcx, rdx                   # left = mid
	cmp	rcx, rsi                   # while left < right
	jae	.LBB0_9                    # (end loop)
.LBB0_3:
	lea	rdx, [rcx + rsi]           # mid = left + right
	shr	rdx                        # mid /= 2
	cmp	byte ptr [rdi + rdx], r8b  # match s[mid].cmp(x)
	jb	.LBB0_7                    # Less => ...
	je	.LBB0_6                    # Equal => ...
	mov	rsi, rdx                   # Greater => right = mid
	cmp	rcx, rsi                   # while left < right
	jb	.LBB0_3                    # (continue loop)
.LBB0_6:
	xor	eax, eax                   # return (0, mid)
	ret

(Which is identical to manually expanding to if..else if..else, however it's not the point of this issue)


In version 1.55.0-nightly (2021-06-20 e82b650), release mode, the generated assembly code is:

	mov	r9, -1                       # constant of std::cmp::Ordering::Less
	jmp	.LBB0_3

.LBB0_5:                                #   in Loop: Header=BB0_3 Depth=1
	mov	rcx, rdx                     # left = mid         
	add	rcx, 1                       # left += 1
	mov	rdx, rsi                     # temp = right

.LBB0_6:                                #   in Loop: Header=BB0_3 Depth=1
	mov	rsi, rdx                     # right = mid
	cmp	rcx, rdx                     # while left < right
	jae	.LBB0_7                      # (end loop)

.LBB0_3:                                # =>This Inner Loop Header: Depth=1
	lea	rdx, [rcx + rsi]             # mid = left + right
	shr	rdx                          # mid /= 2
	xor	eax, eax                     # cmp = 0
	cmp	byte ptr [rdi + rdx], r8b
	setne	al                           # cmp = s[mid] == x ? 0 : 1
	cmovb	rax, r9                      # cmp = s[mid] < x ? -1 : temp
	cmp	rax, -1                      # if cmp == Less
	je	.LBB0_5                      # Less => ...
# %bb.4:                                #   in Loop: Header=BB0_3 Depth=1
	cmp	rax, 1                       # if cmp == Greater
	je	.LBB0_6                      # Greater => ...
# %bb.9:
	ret

It's same as the "bad" result in original comment, and obviously much unoptimized.

I think match an Ordering should be a quite common use case, so shouldn't be deoptimized.


Version it worked on

Rust Playground: 1.53.0

Rust Playground: 1.53.0-beta.12 (2021-06-12 e7a67cc)

Version with regression

a55748f

@yume-chan yume-chan added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Jun 21, 2021
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Jun 21, 2021
@JohnTitor JohnTitor added regression-from-stable-to-beta Performance or correctness regression from stable to beta. and removed regression-untriaged Untriaged performance or correctness regression. labels Jun 21, 2021
@fee1-dead
Copy link
Member

See also #86391 and #86354.

@LeSeulArtichaut LeSeulArtichaut added this to the 1.54.0 milestone Jun 22, 2021
@apiraino apiraino added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jun 23, 2021
@apiraino
Copy link
Contributor

thanks @yume-chan for debugging

paging @Aaron1011 @michaelwoerister as perhaps can share insights about #85702 - is the PR relevant to this issue and the mentioned ones (#86391, #86354)? Do we have a common root cause? Perhaps this one is a duplicate? thanks :-)

@pnkfelix
Copy link
Member

pnkfelix commented Jul 8, 2021

(I find it hard to believe that #85702 is in any way truly related to the problem described here.)


Update: What I can believe is that the bisection from the description is only at the granularity of nightlies, not individual PR's. @yume-chan, is that correct? (I.e., is it true that the bisection you performed only went to the level of nightlies, not the finer grain of individual PR's?)

((cargo-bisect-rustc can help with traversing the finer-grain space of individual PR's, FYI.))

@yume-chan
Copy link
Author

yume-chan commented Jul 9, 2021

@pnkfelix Yes, I didn't bisect exact version. My versions are taken from Compiler Explorer and Rust Playgournd.

Thanks for pointing to cargo-bisect-rustc, the results are

searched nightlies: from nightly-2021-05-17 to nightly-2021-05-18
regressed nightly: nightly-2021-05-18
searched commits: from fe72845 to 3e99439
regressed commit: a55748f

bisected with cargo-bisect-rustc v0.6.0

Host triple: x86_64-pc-windows-msvc
Reproduce with:

cargo bisect-rustc --test-dir=rust-lib --start=2021-05-17 --end=2021-05-18 --preserve

@michaelwoerister
Copy link
Member

@apiraino I don't think #85702 can be relevant for this issue.

@pnkfelix
Copy link
Member

Okay, so the subsequent bisection claims that PR #84993 is the injection point.

@Mark-Simulacrum
Copy link
Member

cc @eddyb @nagisa -- was this an expected possible result from #84993? It sounds like it's plausible that LLVM depends on block order in order to generate optimal code, though a little unfortunate in this particular case.

This is likely to hit stable in 1.54, given that it's next week... but I doubt we can do anything in time for that.

@eddyb
Copy link
Member

eddyb commented Jul 24, 2021

It's possible to dig a bit more into this by diffing -C llvm-args=-print-after-all -Z no-parallel-llvm output from before and after the PR. I think it might be possible to reproduce the old order by adding this loop:

    for (bb, _) in traversal::reverse_postorder(&mir) {
        fx.llbb(bb);
    }

before this loop:

// Codegen the body of each block using reverse postorder
// FIXME(eddyb) reuse RPO iterator between `analysis` and this.
for (bb, _) in traversal::reverse_postorder(&mir) {
fx.codegen_block(bb);
}

This should be correct (even if a bit slower without reusing the result of reverse_postorder), because fx.llbb will just cause the basic block to be created (and cached) earlier, without changing the contents of any basic block.

@pietroalbini
Copy link
Member

This will slip into 1.54 stable, as the release is going to be built today.

@apiraino
Copy link
Contributor

apiraino commented Jul 29, 2021

Assigning priority as discussed in the Zulip thread of the Prioritization Working Group.

@rustbot label -I-prioritize +P-high

@rustbot rustbot added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Jul 29, 2021
@yume-chan
Copy link
Author

correct Zulip thread

@Mark-Simulacrum Mark-Simulacrum added regression-from-stable-to-stable Performance or correctness regression from one stable version to another. and removed regression-from-stable-to-beta Performance or correctness regression from stable to beta. labels Aug 5, 2021
@pnkfelix
Copy link
Member

If anyone is interested in looking into this further, the steps I would recommend would be:

  1. first confirm if eddyb's instructions above do indeed reproduce the old order.
  2. second, check if that actually addresses the regression.

Feel free to ping me on zulip if you are interested in looking into this, especially if you need help with the steps outlined above.

@pnkfelix pnkfelix added the E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. label Aug 24, 2021
@gcohara
Copy link
Contributor

gcohara commented Sep 20, 2021

I spent some time investigating this today.
I found that the above instructions didn't fix the problem on the most recent commit.
However, they did work when applied to commit a55748f.

@eddyb
Copy link
Member

eddyb commented Sep 20, 2021

Could it be a LLVM 13 change, to make both orders worse?

@gcohara
Copy link
Contributor

gcohara commented Sep 21, 2021

I'm not sure - I don't really have the knowledge which is necessary to investigate this too much further (it's my first issue and I don't really know much about LLVM).

Could I check if it's an LLVM issue by taking the MIR output from the fixed version of a55748f and feeding it through LLVM 12 and 13, then comparing the outputs?

What I could also try if that doesn't work is to find the specific change on mod.rs where the fix stops working, although it could be (could it?) as a result of changes in another file.

@gcohara
Copy link
Contributor

gcohara commented Sep 21, 2021

The fix stops working at commit db002a0.
Looks like it's related to LLVM 13 after all

@eddyb
Copy link
Member

eddyb commented Sep 21, 2021

cc @nikic

@nikic
Copy link
Contributor

nikic commented Sep 25, 2021

Could someone please provide a godbolt demonstrating the regression? In https://rust.godbolt.org/z/KnK7xPMoo codegen is pretty much the same between 1.53 and nightly. (The "match" version is still badly optimized -- but that's not new?)

@gcohara
Copy link
Contributor

gcohara commented Sep 27, 2021

This should do it. Interesting that there's no problem on your very similar version

@workingjubilee
Copy link
Member

workingjubilee commented Feb 21, 2022

Yes, for that one, the optimizer successfully normalizes the code into the same form, or thereabouts, emitting the same assembly on rustc 1.53. And it looks like the difference is that by about -Copt-level=2 the LLVM optimizer manages to optimize out the actual switch in the LLVMIR and normalize it into comparisons and jumps.

@pnkfelix
Copy link
Member

Visited for P-high review

It seems like there may be opportunities here for a broad set of optimizations, rather than focusing solely on the seeming regression.

I recommend two actions:

  1. We add a benchmark for this case to the runtime benchmarks being started up on perf.rust-lang.org
  2. WG-mir-opt should investigate whether there's some way to optimize this code, rather than leaving it as something that can only be resolved by LLVM.

having said that, even though there are useful actions, and potential owners for those actions, there doesn't seem to be much reason to prioritize this particular performance footgun over the other codegen issues we have. So, downgrading this to P-medium as well.

@rustbot label: -P-high +P-medium

@rustbot rustbot added P-medium Medium priority and removed P-high High priority labels Nov 18, 2022
@pnkfelix pnkfelix added I-slow Issue: Problems and improvements with respect to performance of generated code. A-mir-opt Area: MIR optimizations A-codegen Area: Code generation labels Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-mir-opt Area: MIR optimizations C-bug Category: This is a bug. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests