Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collapsing two if-statements to a single if statement can result in a large performance decrease #111583

Open
ClementTsang opened this issue May 15, 2023 · 5 comments
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@ClementTsang
Copy link
Contributor

ClementTsang commented May 15, 2023

Apologies if this has already been reported.


Let's say I have some code that looks like this (this is a simplified version of some code a friend was writing):

pub fn fast(mut ret: u64) -> u64 {
    let mask = (1 << 38) - 1;

    for _ in 0..100_000 {
        let mut speed = 0.0;
        let mut z: f64 = speed;
        speed += 0.200000001;

        for _ in 2..14 {
            z += speed;

            if (z.to_bits() >> 8) & mask == 0 {
                if z % 0.0625 < 1e-13 {
                    println!("{}", z % 0.0625);
                    ret += 1;
                }
            }
        }
    }

    eprintln!("ret: {ret}");
    ret
}

I might be tempted to collapse the if-statement in the middle, since it shouldn't change anything - in fact, clippy will even recommend that I change it to this:

pub fn slow(mut ret: u64) -> u64 {
    let mask = (1 << 38) - 1;

    for _ in 0..100_000 {
        let mut speed = 0.0;
        let mut z: f64 = speed;
        speed += 0.200000001;

        for _ in 2..14 {
            z += speed;

            if (z.to_bits() >> 8) & mask == 0 && z % 0.0625 < 1e-13 {
                println!("{}", z % 0.0625);
                ret += 1;
            }
        }
    }

    eprintln!("ret: {ret}");
    ret
}

However, if I pit these two against each other using criterion, then when I run a bench (on 1.69.0):

➜ cargo bench 2> out.txt

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s


running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

slow                    time:   [7.5115 ms 7.5313 ms 7.5583 ms]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

fast                    time:   [577.02 µs 578.91 µs 581.29 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

For some reason, collapsing the if branch leads to a massive performance regression! This is surprising as well since from my testing, where I set z = 0, the if branch should never run. Putting the two bits of code on Godbolt seems to also show that there's a bit of a difference in terms of assembly generation (fast, slow).

Furthermore, from some testing, commenting out either the eprintln or the println on both would result in them having similar performance.

I can set up a repo with my exact setup if that will be helpful. Repo with code and benchmark: https://github.com/ClementTsang/collapse_if_slowdown

@workingjubilee workingjubilee added the I-slow Issue: Problems and improvements with respect to performance of generated code. label May 15, 2023
@workingjubilee
Copy link
Member

The usual suspect for this codegen difference, the distinction between & and &&, does not seem to be the culprit, as the following branch still gets different codegen than the original:

            if ((z.to_bits() >> 8) & mask == 0) & (z % 0.0625 < 1e-13) {

@azizghuloum
Copy link
Contributor

Looks like rustc compiles the slow version as if you've written

pub fn slower(mut ret: u64) -> u64 {
    let mask = (1 << 38) - 1;

    for _ in 0..100_000 {
        let mut speed = 0.0;
        let mut z: f64 = speed;
        speed += 0.200000001;

        for _ in 2..14 {
            z += speed;

            let tmp = (z.to_bits() >> 8) & mask == 0 && z % 0.0625 < 1e-13;
            if tmp {
                println!("{}", z % 0.0625);
                ret += 1;
            }
        }
    }

    eprintln!("ret: {ret}");
    ret
}

I compared the mir graphs to confirm and the benchmark numbers also confirm.

That extra variable incurs an additional conditional check at runtime.

@azizghuloum
Copy link
Contributor

azizghuloum commented May 16, 2023

It seems that during THIR -> MIR lowering, there is a case for handling && inside the if, but the && is obscured by a Use expression that is not handled (and gets deoptimized to a temporary variable).

@azizghuloum
Copy link
Contributor

@ClementTsang with the PR that I opened, the bench output is

slow                    time:   [706.99 µs 716.80 µs 727.74 µs]
                        change: [-11.173% -8.6408% -6.1600%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  7 (7.00%) high mild
  7 (7.00%) high severe

fast                    time:   [700.31 µs 714.30 µs 732.70 µs]
                        change: [-5.7267% -2.7412% +0.0544%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) high mild
  12 (12.00%) high severe

which is what you'd expect.

@ClementTsang
Copy link
Contributor Author

Nice!

azizghuloum added a commit to azizghuloum/rust that referenced this issue May 19, 2023
bors added a commit to rust-lang-ci/rust that referenced this issue Sep 1, 2023
…=cjgillot

Lower `Or` pattern without allocating place

cc `@azizghuloum` `@cjgillot`

Related to rust-lang#111583 and rust-lang#111644

While reviewing rust-lang#111644, it occurs to me that while we directly lower conjunctive predicates, which are connected with `&&`, into the desirable control flow, today we don't directly lower the disjunctive predicates, which are connected with `||`, in the similar fashion. Instead, we allocate a place for the boolean temporary to hold the result of evaluating the `||` expression.

Usually I would expect optimization at later stages to "inline" the evaluation of boolean predicates into simple CFG, but rust-lang#111583 is an example where `&&` is failing to be optimized away and the assembly shows that both the expensive operands are evaluated. Therefore, I would like to make a small change to make the CFG a bit more straight-forward without invoking the `as_temp` machinery, and plus avoid allocating the place to hold the boolean result as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
3 participants