`match` an `std::cmp::Ordering` generates less optimized code in nightly #86511

yume-chan · 2021-06-21T10:32:17Z

I'm interested in this comment:

Lines 2204 to 2217 in 3824017

    
           let cmp = f(unsafe { self.get_unchecked(mid) }); 
        
           // The reason why we use if/else control flow rather than match 
        
           // is because match reorders comparison operations, which is perf sensitive. 
        
           // This is x86 asm for u8: https://rust.godbolt.org/z/8Y8Pra. 
        
           if cmp == Less { 
        
               left = mid + 1; 
        
           } else if cmp == Greater { 
        
               right = mid; 
        
           } else { 
        
               // SAFETY: same as the `get_unchecked` above 
        
               unsafe { crate::intrinsics::assume(mid < self.len()) }; 
        
               return Ok(mid); 
        
           }

So I did some testing:

The following code snippet is the "bad" one (match an std::cmp::Ordering) in above comment, but with f manually inlined.

use std::cmp::Ordering;
use std::cmp::Ordering::*;

pub fn custom_binary_search_1<'a>(slice: &'a [u8], x: &u8) -> Result<usize, usize> {
    let s = slice;
    let mut left = 0;
    let mut right = s.len();
    while left < right {
        // never overflow because `slice::len()` max is `isize::MAX`.
        let mid = (left + right) / 2;
        // SAFETY: the call is made safe by the following invariants:
        // - `mid >= 0`
        // - `mid < size`: `mid` is limited by `[left; right)` bound.
        let cmp = unsafe{ s.get_unchecked(mid).cmp(x) };
        match cmp {
            Less => left = mid + 1,
            Greater => right = mid,
            Equal => return Ok(mid),
        }
    }
    Err(left)
}

In version 1.53.0-beta.12 (2021-06-12 e7a67cc), release mode, it generates the following x86 assembly (related parts only):

        jmp	.LBB0_3
.LBB0_7:
	add	rdx, 1                     # mid += 1
	mov	rcx, rdx                   # left = mid
	cmp	rcx, rsi                   # while left < right
	jae	.LBB0_9                    # (end loop)
.LBB0_3:
	lea	rdx, [rcx + rsi]           # mid = left + right
	shr	rdx                        # mid /= 2
	cmp	byte ptr [rdi + rdx], r8b  # match s[mid].cmp(x)
	jb	.LBB0_7                    # Less => ...
	je	.LBB0_6                    # Equal => ...
	mov	rsi, rdx                   # Greater => right = mid
	cmp	rcx, rsi                   # while left < right
	jb	.LBB0_3                    # (continue loop)
.LBB0_6:
	xor	eax, eax                   # return (0, mid)
	ret

(Which is identical to manually expanding to if..else if..else, however it's not the point of this issue)

In version 1.55.0-nightly (2021-06-20 e82b650), release mode, the generated assembly code is:

	mov	r9, -1                       # constant of std::cmp::Ordering::Less
	jmp	.LBB0_3

.LBB0_5:                                #   in Loop: Header=BB0_3 Depth=1
	mov	rcx, rdx                     # left = mid         
	add	rcx, 1                       # left += 1
	mov	rdx, rsi                     # temp = right

.LBB0_6:                                #   in Loop: Header=BB0_3 Depth=1
	mov	rsi, rdx                     # right = mid
	cmp	rcx, rdx                     # while left < right
	jae	.LBB0_7                      # (end loop)

.LBB0_3:                                # =>This Inner Loop Header: Depth=1
	lea	rdx, [rcx + rsi]             # mid = left + right
	shr	rdx                          # mid /= 2
	xor	eax, eax                     # cmp = 0
	cmp	byte ptr [rdi + rdx], r8b
	setne	al                           # cmp = s[mid] == x ? 0 : 1
	cmovb	rax, r9                      # cmp = s[mid] < x ? -1 : temp
	cmp	rax, -1                      # if cmp == Less
	je	.LBB0_5                      # Less => ...
# %bb.4:                                #   in Loop: Header=BB0_3 Depth=1
	cmp	rax, 1                       # if cmp == Greater
	je	.LBB0_6                      # Greater => ...
# %bb.9:
	ret

It's same as the "bad" result in original comment, and obviously much unoptimized.

I think match an Ordering should be a quite common use case, so shouldn't be deoptimized.

Version it worked on

Rust Playground: 1.53.0

Rust Playground: 1.53.0-beta.12 (2021-06-12 e7a67cc)

Version with regression

a55748f

The text was updated successfully, but these errors were encountered:

fee1-dead · 2021-06-21T14:22:19Z

See also #86391 and #86354.

apiraino · 2021-06-24T08:36:09Z

thanks @yume-chan for debugging

paging @Aaron1011 @michaelwoerister as perhaps can share insights about #85702 - is the PR relevant to this issue and the mentioned ones (#86391, #86354)? Do we have a common root cause? Perhaps this one is a duplicate? thanks :-)

pnkfelix · 2021-07-08T13:53:22Z

(I find it hard to believe that #85702 is in any way truly related to the problem described here.)

Update: What I can believe is that the bisection from the description is only at the granularity of nightlies, not individual PR's. @yume-chan, is that correct? (I.e., is it true that the bisection you performed only went to the level of nightlies, not the finer grain of individual PR's?)

((cargo-bisect-rustc can help with traversing the finer-grain space of individual PR's, FYI.))

yume-chan · 2021-07-09T09:30:08Z

@pnkfelix Yes, I didn't bisect exact version. My versions are taken from Compiler Explorer and Rust Playgournd.

Thanks for pointing to cargo-bisect-rustc, the results are

searched nightlies: from nightly-2021-05-17 to nightly-2021-05-18
regressed nightly: nightly-2021-05-18
searched commits: from fe72845 to 3e99439
regressed commit: a55748f

bisected with cargo-bisect-rustc v0.6.0

Host triple: x86_64-pc-windows-msvc
Reproduce with:

cargo bisect-rustc --test-dir=rust-lib --start=2021-05-17 --end=2021-05-18 --preserve

michaelwoerister · 2021-07-15T14:58:51Z

@apiraino I don't think #85702 can be relevant for this issue.

pnkfelix · 2021-07-15T15:00:09Z

Okay, so the subsequent bisection claims that PR #84993 is the injection point.

Mark-Simulacrum · 2021-07-22T22:26:48Z

cc @eddyb @nagisa -- was this an expected possible result from #84993? It sounds like it's plausible that LLVM depends on block order in order to generate optimal code, though a little unfortunate in this particular case.

This is likely to hit stable in 1.54, given that it's next week... but I doubt we can do anything in time for that.

eddyb · 2021-07-24T17:08:10Z

It's possible to dig a bit more into this by diffing -C llvm-args=-print-after-all -Z no-parallel-llvm output from before and after the PR. I think it might be possible to reproduce the old order by adding this loop:

    for (bb, _) in traversal::reverse_postorder(&mir) {
        fx.llbb(bb);
    }

before this loop:

rust/compiler/rustc_codegen_ssa/src/mir/mod.rs

Lines 255 to 259 in 18840b0

    
           // Codegen the body of each block using reverse postorder 
        
           // FIXME(eddyb) reuse RPO iterator between `analysis` and this. 
        
           for (bb, _) in traversal::reverse_postorder(&mir) { 
        
               fx.codegen_block(bb); 
        
           }

This should be correct (even if a bit slower without reusing the result of reverse_postorder), because fx.llbb will just cause the basic block to be created (and cached) earlier, without changing the contents of any basic block.

pietroalbini · 2021-07-26T09:32:56Z

This will slip into 1.54 stable, as the release is going to be built today.

apiraino · 2021-07-29T08:53:29Z

Assigning priority as discussed in the Zulip thread of the Prioritization Working Group.

@rustbot label -I-prioritize +P-high

yume-chan · 2021-07-29T09:52:06Z

correct Zulip thread

pnkfelix · 2021-08-24T15:41:24Z

If anyone is interested in looking into this further, the steps I would recommend would be:

first confirm if eddyb's instructions above do indeed reproduce the old order.
second, check if that actually addresses the regression.

Feel free to ping me on zulip if you are interested in looking into this, especially if you need help with the steps outlined above.

gcohara · 2021-09-20T18:30:37Z

I spent some time investigating this today.
I found that the above instructions didn't fix the problem on the most recent commit.
However, they did work when applied to commit a55748f.

eddyb · 2021-09-20T21:47:18Z

Could it be a LLVM 13 change, to make both orders worse?

gcohara · 2021-09-21T10:04:40Z

I'm not sure - I don't really have the knowledge which is necessary to investigate this too much further (it's my first issue and I don't really know much about LLVM).

Could I check if it's an LLVM issue by taking the MIR output from the fixed version of a55748f and feeding it through LLVM 12 and 13, then comparing the outputs?

What I could also try if that doesn't work is to find the specific change on mod.rs where the fix stops working, although it could be (could it?) as a result of changes in another file.

gcohara · 2021-09-21T17:46:26Z

The fix stops working at commit db002a0.
Looks like it's related to LLVM 13 after all

eddyb · 2021-09-21T18:04:42Z

cc @nikic

nikic · 2021-09-25T18:44:22Z

Could someone please provide a godbolt demonstrating the regression? In https://rust.godbolt.org/z/KnK7xPMoo codegen is pretty much the same between 1.53 and nightly. (The "match" version is still badly optimized -- but that's not new?)

gcohara · 2021-09-27T14:12:44Z

This should do it. Interesting that there's no problem on your very similar version

workingjubilee · 2022-02-21T03:18:04Z

Yes, for that one, the optimizer successfully normalizes the code into the same form, or thereabouts, emitting the same assembly on rustc 1.53. And it looks like the difference is that by about -Copt-level=2 the LLVM optimizer manages to optimize out the actual switch in the LLVMIR and normalize it into comparisons and jumps.

pnkfelix · 2022-11-18T15:35:26Z

Visited for P-high review

It seems like there may be opportunities here for a broad set of optimizations, rather than focusing solely on the seeming regression.

I recommend two actions:

We add a benchmark for this case to the runtime benchmarks being started up on perf.rust-lang.org
WG-mir-opt should investigate whether there's some way to optimize this code, rather than leaving it as something that can only be resolved by LLVM.

having said that, even though there are useful actions, and potential owners for those actions, there doesn't seem to be much reason to prioritize this particular performance footgun over the other codegen issues we have. So, downgrading this to P-medium as well.

@rustbot label: -P-high +P-medium

yume-chan added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Jun 21, 2021

rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Jun 21, 2021

JohnTitor added regression-from-stable-to-beta Performance or correctness regression from stable to beta. and removed regression-untriaged Untriaged performance or correctness regression. labels Jun 21, 2021

LeSeulArtichaut added this to the 1.54.0 milestone Jun 22, 2021

apiraino added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jun 23, 2021

apiraino added the I-nominated label Jul 1, 2021

apiraino removed the I-nominated label Jul 15, 2021

rustbot added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Jul 29, 2021

Mark-Simulacrum added regression-from-stable-to-stable Performance or correctness regression from one stable version to another. and removed regression-from-stable-to-beta Performance or correctness regression from stable to beta. labels Aug 5, 2021

pnkfelix added the E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. label Aug 24, 2021

rustbot added P-medium Medium priority and removed P-high High priority labels Nov 18, 2022

pnkfelix added I-slow Issue: Problems and improvements with respect to performance of generated code. A-mir-opt Area: MIR optimizations A-codegen Area: Code generation labels Nov 18, 2022

chris-ha458 mentioned this issue Oct 6, 2023

New mess detection algorithm nickspring/charset-normalizer-rs#31

Merged

Amanieu mentioned this issue Jul 26, 2024

cmp+Ordering match versus if+else if+then for binary_search_by #86354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`match` an `std::cmp::Ordering` generates less optimized code in nightly #86511

`match` an `std::cmp::Ordering` generates less optimized code in nightly #86511

yume-chan commented Jun 21, 2021 •

edited

Loading

fee1-dead commented Jun 21, 2021

apiraino commented Jun 24, 2021

pnkfelix commented Jul 8, 2021 •

edited

Loading

yume-chan commented Jul 9, 2021 •

edited

Loading

michaelwoerister commented Jul 15, 2021

pnkfelix commented Jul 15, 2021

Mark-Simulacrum commented Jul 22, 2021

eddyb commented Jul 24, 2021

pietroalbini commented Jul 26, 2021

apiraino commented Jul 29, 2021 •

edited

Loading

yume-chan commented Jul 29, 2021

pnkfelix commented Aug 24, 2021

gcohara commented Sep 20, 2021

eddyb commented Sep 20, 2021

gcohara commented Sep 21, 2021

gcohara commented Sep 21, 2021 •

edited

Loading

eddyb commented Sep 21, 2021

nikic commented Sep 25, 2021

gcohara commented Sep 27, 2021

workingjubilee commented Feb 21, 2022 •

edited

Loading

pnkfelix commented Nov 18, 2022

match an std::cmp::Ordering generates less optimized code in nightly #86511

match an std::cmp::Ordering generates less optimized code in nightly #86511

Comments

yume-chan commented Jun 21, 2021 • edited Loading

Version it worked on

Version with regression

fee1-dead commented Jun 21, 2021

apiraino commented Jun 24, 2021

pnkfelix commented Jul 8, 2021 • edited Loading

yume-chan commented Jul 9, 2021 • edited Loading

michaelwoerister commented Jul 15, 2021

pnkfelix commented Jul 15, 2021

Mark-Simulacrum commented Jul 22, 2021

eddyb commented Jul 24, 2021

pietroalbini commented Jul 26, 2021

apiraino commented Jul 29, 2021 • edited Loading

yume-chan commented Jul 29, 2021

pnkfelix commented Aug 24, 2021

gcohara commented Sep 20, 2021

eddyb commented Sep 20, 2021

gcohara commented Sep 21, 2021

gcohara commented Sep 21, 2021 • edited Loading

eddyb commented Sep 21, 2021

nikic commented Sep 25, 2021

gcohara commented Sep 27, 2021

workingjubilee commented Feb 21, 2022 • edited Loading

pnkfelix commented Nov 18, 2022

`match` an `std::cmp::Ordering` generates less optimized code in nightly #86511

`match` an `std::cmp::Ordering` generates less optimized code in nightly #86511

yume-chan commented Jun 21, 2021 •

edited

Loading

pnkfelix commented Jul 8, 2021 •

edited

Loading

yume-chan commented Jul 9, 2021 •

edited

Loading

apiraino commented Jul 29, 2021 •

edited

Loading

gcohara commented Sep 21, 2021 •

edited

Loading

workingjubilee commented Feb 21, 2022 •

edited

Loading