Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor codegen for derived == on simple 2-field struct #117800

Open
scottmcm opened this issue Nov 11, 2023 · 4 comments
Open

Poor codegen for derived == on simple 2-field struct #117800

scottmcm opened this issue Nov 11, 2023 · 4 comments
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. T-lang Relevant to the language team, which will review and decide on the PR/issue. T-opsem Relevant to the opsem team

Comments

@scottmcm
Copy link
Member

Given a basic struct like this,

#[derive(Copy, Clone, PartialEq, Eq)]
pub struct Entity {
    g: u32,
    i: u32
}

The generated == is suboptimal:

#[no_mangle]
pub fn derived_eq(x: &Entity, y: &Entity) -> bool {
    x == y
}
derived_eq:
        movq    xmm0, qword ptr [rdi]
        movq    xmm1, qword ptr [rsi]
        pcmpeqd xmm1, xmm0
        pshufd  xmm0, xmm1, 80
        movmskpd        eax, xmm0
        cmp     eax, 3
        sete    al
        ret

https://rust.godbolt.org/z/1b1xsnzx6

For comparison, not using short-circuiting

#[no_mangle]
pub fn good_eq(x: &Entity, y: &Entity) -> bool {
    (x.g == y.g) & (x.i == y.i)
}

gives a much-simpler codegen

good_eq:
        mov     rax, qword ptr [rsi]
        cmp     qword ptr [rdi], rax
        sete    al
        ret

This appears to be related to LLVM not knowing whether the second field is poison, as Alive2 confirms that LLVM isn't allowed to convert the former into the latter (at least for the optimized forms): https://alive2.llvm.org/ce/z/bAsJGN

Is there maybe some metadata we could put on the parameter attributes to tell LLVM that reading them isn't poison? It appears that just reading them first, like (same godbolt link above)

#[no_mangle]
pub fn failed_workaround(x: &Entity, y: &Entity) -> bool {
    let Entity { g: g1, i: i1 } = *x;
    let Entity { g: g2, i: i2 } = *y;
    g1 == g2 && i1 == i2
}

still isn't enough for it to remove the short-circuiting, as even though that emits the !noundef loads first, it seems like LLVM's SROAPass moves them behind the branch from &&.


FWIW, clang(trunk) has the same codegen difference: https://cpp.godbolt.org/z/bbaz196GP

It might not have a choice, though, since C++ references are mostly just pointers.

@scottmcm scottmcm added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Nov 11, 2023
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Nov 11, 2023
@Jules-Bertholet
Copy link
Contributor

Jules-Bertholet commented Nov 11, 2023

Miri reports no UB on the following code (Playground):

use core::mem::MaybeUninit;

#[derive(Copy, Clone, PartialEq, Eq)]
#[repr(C)]
pub struct Entity {
    g: u32,
    i: u32,
}

#[no_mangle]
pub fn derived_eq(x: &Entity, y: &Entity) -> bool {
    x == y
}

#[derive(Copy, Clone)]
#[repr(C)]
pub struct EntityUninit {
    g: u32,
    i: MaybeUninit<u32>,
}

fn main() {
    let left_ref = &Entity { g: 1, i: 1 };
    let right_uninit = EntityUninit {
        g: 0,
        i: MaybeUninit::uninit(),
    };
    let right_ref = unsafe { &*(&right_uninit as *const EntityUninit as *const Entity) };
    dbg!(derived_eq(left_ref, right_ref));
}

So LLVM is completely correct in its conservatism here.

@rustbot label A-codegen T-opsem

@scottmcm
Copy link
Member Author

scottmcm commented Nov 11, 2023

Agreed that LLVM is correct to be conservative for the currently-derived code; that's why I opened it as a "what can rust do to let it distinguish these cases". But it's also unnecessarily-conservative in the failed_workaround example -- there we told it !noundef on the loads of all 4 fields before the branches, and MIRI confirms that there's UB for it if I update your example to use it instead of just ==: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=7af918d80f3d5347e661a381fe675489

Here's a zulip lang thread I was just typing up to talk about what the Rust references rules should be to allow us to do something better, which I opened this issue to use to seed a discussion https://rust-lang.zulipchat.com/#narrow/stream/213817-t-lang/topic/Poison-ness.20behind.20references/near/401455323

(The zulip thread ended up with a very similar example https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=159748907525ea9c1f7d0c568ae75eb1 🙂)

@scottmcm scottmcm added the T-lang Relevant to the language team, which will review and decide on the PR/issue. label Nov 11, 2023
@saethlin saethlin added I-slow Issue: Problems and improvements with respect to performance of generated code. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Nov 11, 2023
@rustbot rustbot added A-codegen Area: Code generation T-opsem Relevant to the opsem team labels Nov 11, 2023
@comex
Copy link
Contributor

comex commented Nov 11, 2023

Note that the two assembly snippets seem 100% equivalent. The longer version is still doing the loads unconditionally (which makes sense since the arguments are dereferenceable), and of course x86 has no concept of undef.

Thus, even if that particular IR transformation would be invalid, it's still suboptimal for LLVM not to find some way to optimize this.

(Perhaps by having the optimization insert freeze instructions?)

github-merge-queue bot pushed a commit to bevyengine/bevy that referenced this issue Nov 14, 2023
(This is my first PR here, so I've probably missed some things. Please
let me know what else I should do to help you as a reviewer!)

# Objective

Due to rust-lang/rust#117800, the `derive`'d
`PartialEq::eq` on `Entity` isn't as good as it could be. Since that's
used in hashtable lookup, let's improve it.

## Solution

The derived `PartialEq::eq` short-circuits if the generation doesn't
match. However, having a branch there is sub-optimal, especially on
64-bit systems like x64 that could just load the whole `Entity` in one
load anyway.

Due to complications around `poison` in LLVM and the exact details of
what unsafe code is allowed to do with reference in Rust
(rust-lang/unsafe-code-guidelines#346), LLVM
isn't allowed to completely remove the short-circuiting. `&Entity` is
marked `dereferencable(8)` so LLVM knows it's allowed to *load* all 8
bytes -- and does so -- but it has to assume that the `index` might be
undef/poison if the `generation` doesn't match, and thus while it finds
a way to do it without needing a branch, it has to do something slightly
more complicated than optimal to combine the results. (LLVM is allowed
to change non-short-circuiting code to use branches, but not the other
way around.)

Here's a link showing the codegen today:
<https://rust.godbolt.org/z/9WzjxrY7c>
```rust
#[no_mangle]
pub fn demo_eq_ref(a: &Entity, b: &Entity) -> bool {
    a == b
}
```
ends up generating the following assembly:
```asm
demo_eq_ref:
        movq    xmm0, qword ptr [rdi]
        movq    xmm1, qword ptr [rsi]
        pcmpeqd xmm1, xmm0
        pshufd  xmm0, xmm1, 80
        movmskpd        eax, xmm0
        cmp     eax, 3
        sete    al
        ret
```
(It's usually not this bad in real uses after inlining and LTO, but it
makes a strong demo.)

This PR manually implements `PartialEq::eq` *without* short-circuiting,
and because that tells LLVM that neither the generations nor the index
can be poison, it doesn't need to be so careful and can generate the
"just compare the two 64-bit values" code you'd have probably already
expected:
```asm
demo_eq_ref:
        mov     rax, qword ptr [rsi]
        cmp     qword ptr [rdi], rax
        sete    al
        ret
```

Since this doesn't change the representation of `Entity`, if it's
instead passed by *value*, then each `Entity` is two `u32` registers,
and the old and the new code do exactly the same thing. (Other
approaches, like changing `Entity` to be `[u32; 2]` or `u64`, affect
this case.)

This should hopefully merge easily with changes like
#9907 that also want to change
`Entity`.

## Benchmarks

I'm not super-confident that I got my machine fully consistent for
benchmarking, but whether I run the old or the new one first I get
reasonably consistent results.

Here's a fairly typical example of the benchmarks I added in this PR:

![image](https://github.com/bevyengine/bevy/assets/18526288/24226308-4616-4082-b0ff-88fc06285ef1)

Building the sets seems to be basically the same. It's usually reported
as noise, but sometimes I see a few percent slower or faster.

But lookup hits in particular -- since a hit checks that the key is
equal -- consistently shows around 10% improvement.

`cargo run --example many_cubes --features bevy/trace_tracy --release --
--benchmark` showed as slightly faster with this change, though if I had
to bet I'd probably say it's more noise than meaningful (but at least
it's not worse either):

![image](https://github.com/bevyengine/bevy/assets/18526288/58bb8c96-9c45-487f-a5ab-544bbfe9fba0)

This is my first PR here -- and my first time running Tracy -- so please
let me know what else I should run, or run things on your own more
reliable machines to double-check.

---

## Changelog

(probably not worth including)

Changed: micro-optimized `Entity::eq` to help LLVM slightly.

## Migration Guide

(I really hope nobody was using this on uninitialized entities where
sufficiently tortured `unsafe` could could technically notice that this
has changed.)
@Sp00ph
Copy link
Member

Sp00ph commented Nov 17, 2023

Interestingly, the codegen for this seems to be quite dependent on what exact target-cpu it's compiled for: https://rust.godbolt.org/z/ebPnGToW3 . It seems that enabling AVX doesn't change LLVM's approach of using vector loads and just uses more compact AVX instructions, but using x86-64-v4 then causes it to emit the right instructions. As already mentioned by @scottmcm, this isn't exclusive to Rust either, C++ has the exact same codegen for all three cases. Because of this I'm guessing it's just an LLVM bug, where it only chooses the best option on x86-64-v4, even though that same option doesn't actually require any special target features, and it'd probably be a good idea to open an LLVM bug report about it instead of endlessly fiddling around with the IR emitted by rustc to try and get LLVM to play along.

rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this issue Jan 9, 2024
(This is my first PR here, so I've probably missed some things. Please
let me know what else I should do to help you as a reviewer!)

# Objective

Due to rust-lang/rust#117800, the `derive`'d
`PartialEq::eq` on `Entity` isn't as good as it could be. Since that's
used in hashtable lookup, let's improve it.

## Solution

The derived `PartialEq::eq` short-circuits if the generation doesn't
match. However, having a branch there is sub-optimal, especially on
64-bit systems like x64 that could just load the whole `Entity` in one
load anyway.

Due to complications around `poison` in LLVM and the exact details of
what unsafe code is allowed to do with reference in Rust
(rust-lang/unsafe-code-guidelines#346), LLVM
isn't allowed to completely remove the short-circuiting. `&Entity` is
marked `dereferencable(8)` so LLVM knows it's allowed to *load* all 8
bytes -- and does so -- but it has to assume that the `index` might be
undef/poison if the `generation` doesn't match, and thus while it finds
a way to do it without needing a branch, it has to do something slightly
more complicated than optimal to combine the results. (LLVM is allowed
to change non-short-circuiting code to use branches, but not the other
way around.)

Here's a link showing the codegen today:
<https://rust.godbolt.org/z/9WzjxrY7c>
```rust
#[no_mangle]
pub fn demo_eq_ref(a: &Entity, b: &Entity) -> bool {
    a == b
}
```
ends up generating the following assembly:
```asm
demo_eq_ref:
        movq    xmm0, qword ptr [rdi]
        movq    xmm1, qword ptr [rsi]
        pcmpeqd xmm1, xmm0
        pshufd  xmm0, xmm1, 80
        movmskpd        eax, xmm0
        cmp     eax, 3
        sete    al
        ret
```
(It's usually not this bad in real uses after inlining and LTO, but it
makes a strong demo.)

This PR manually implements `PartialEq::eq` *without* short-circuiting,
and because that tells LLVM that neither the generations nor the index
can be poison, it doesn't need to be so careful and can generate the
"just compare the two 64-bit values" code you'd have probably already
expected:
```asm
demo_eq_ref:
        mov     rax, qword ptr [rsi]
        cmp     qword ptr [rdi], rax
        sete    al
        ret
```

Since this doesn't change the representation of `Entity`, if it's
instead passed by *value*, then each `Entity` is two `u32` registers,
and the old and the new code do exactly the same thing. (Other
approaches, like changing `Entity` to be `[u32; 2]` or `u64`, affect
this case.)

This should hopefully merge easily with changes like
bevyengine#9907 that also want to change
`Entity`.

## Benchmarks

I'm not super-confident that I got my machine fully consistent for
benchmarking, but whether I run the old or the new one first I get
reasonably consistent results.

Here's a fairly typical example of the benchmarks I added in this PR:

![image](https://github.com/bevyengine/bevy/assets/18526288/24226308-4616-4082-b0ff-88fc06285ef1)

Building the sets seems to be basically the same. It's usually reported
as noise, but sometimes I see a few percent slower or faster.

But lookup hits in particular -- since a hit checks that the key is
equal -- consistently shows around 10% improvement.

`cargo run --example many_cubes --features bevy/trace_tracy --release --
--benchmark` showed as slightly faster with this change, though if I had
to bet I'd probably say it's more noise than meaningful (but at least
it's not worse either):

![image](https://github.com/bevyengine/bevy/assets/18526288/58bb8c96-9c45-487f-a5ab-544bbfe9fba0)

This is my first PR here -- and my first time running Tracy -- so please
let me know what else I should run, or run things on your own more
reliable machines to double-check.

---

## Changelog

(probably not worth including)

Changed: micro-optimized `Entity::eq` to help LLVM slightly.

## Migration Guide

(I really hope nobody was using this on uninitialized entities where
sufficiently tortured `unsafe` could could technically notice that this
has changed.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. T-lang Relevant to the language team, which will review and decide on the PR/issue. T-opsem Relevant to the opsem team
Projects
None yet
Development

No branches or pull requests

6 participants