Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe slowdown when wrapping a [MaybeUninit<T>; N] in a struct #68484

Open
timvermeulen opened this issue Jan 23, 2020 · 5 comments
Open

Severe slowdown when wrapping a [MaybeUninit<T>; N] in a struct #68484

timvermeulen opened this issue Jan 23, 2020 · 5 comments
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@timvermeulen
Copy link
Contributor

timvermeulen commented Jan 23, 2020

Consider the following function:

#![feature(maybe_uninit_extra)]

use std::{mem::MaybeUninit, ops::Range};

const N: usize = 10_000;
const RANGE: Range<u32> = 2500..7500;

fn foo() -> u32 {
    unsafe {
        let mut array = MaybeUninit::<[MaybeUninit<u32>; N]>::uninit().assume_init();
        let mut len = 0;

        for value in RANGE {
            array.get_unchecked_mut(len).write(value);
            len += 1;
        }

        (0..len).map(|i| array.get_unchecked(i).read()).sum()
    }
}

This runs as fast as I would expect. But if I put array and len in a struct, like this:

struct S {
    array: [MaybeUninit<u32>; N],
    len: usize,
}

pub fn bar() -> u32 {
    unsafe {
        let mut s = S {
            array: MaybeUninit::uninit().assume_init(),
            len: 0,
        };

        for value in RANGE {
            s.array.get_unchecked_mut(s.len).write(value);
            s.len += 1;
        }

        (0..s.len).map(|i| s.array.get_unchecked(i).read()).sum()
    }
}

This runs about 15x as slowly using (although these didn't change anything)

[profile.bench]
lto = true
codegen-units = 1

with the 2020-01-22 nightly toolchain. This difference can be observed with much smaller values of N, too, and blackboxing values didn't make a difference.

Playground with benchmarks

@jonas-schievink jonas-schievink added A-codegen Area: Code generation I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 23, 2020
@JohnTitor JohnTitor added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Mar 8, 2020
@workingjubilee
Copy link
Member

workingjubilee commented Jul 6, 2022

So, read() is now spelled assume_init_read() but otherwise this seems to work for almost the same disparity. Around 8.3 times.

running 2 tests
test bench1 ... bench:          11 ns/iter (+/- 0)
test bench2 ... bench:          91 ns/iter (+/- 0)

However, you claim that this cannot be black boxed away. I took a look at the generated assembly, and it seems rustc and LLVM take a look at the first array and just constant evaluate the entire thing into the binary in the first case. That means it just reads a bunch of constant data and sums it. Frankly, it's a miracle there's any work with an array at all! But in the latter case, the compilers seem to have trouble seeing what they can do, and instead try to do the math "honestly".

So, targeting LLVM's ability to predict what the array will consist of, I black boxed things. Obscuring the value is not enough, since it can still see the RANGE and then has the advantage of emitting exactly the right number of vector adds to hit that entire chunk of memory, not having to waste any time on calculating how much further it can go. So instead, I black box both value and RANGE, which makes the code like so:

#![feature(bench_black_box)]
use std::mem::MaybeUninit;
use std::ops::Range;
use std::hint::black_box;

const N: usize = 400;
const RANGE: Range<u32> = 100..300;

pub fn foo() -> u32 {
    unsafe {
        let mut array = MaybeUninit::<[MaybeUninit<u32>; N]>::uninit().assume_init();
        let mut len = 0;

        for value in black_box(RANGE) {
            array.get_unchecked_mut(len).write(black_box(value));
            len += 1;
        }

        (0..len).map(|i| array.get_unchecked(i).assume_init_read()).sum()
    }
}

struct S {
    array: [MaybeUninit<u32>; N],
    len: usize,
}

pub fn bar() -> u32 {
    unsafe {
        let mut s = S {
            array: MaybeUninit::uninit().assume_init(),
            len: 0,
        };

        for value in black_box(RANGE) {
            s.array.get_unchecked_mut(s.len).write(black_box(value));
            s.len += 1;
        }

        (0..s.len).map(|i| s.array.get_unchecked(i).assume_init_read()).sum()
    }
}

Doing so collapses the disparity almost entirely, as now both functions are being forced to actually think about how much memory they can read:

running 2 tests
test bench1 ... bench:         100 ns/iter (+/- 0)
test bench2 ... bench:         135 ns/iter (+/- 1)

This is still a nonzero gap, but I don't feel it's clearly worth investigating this further without a much better benchmark in this vein. Feel free to reopen this with one.

@workingjubilee workingjubilee closed this as not planned Won't fix, can't repro, duplicate, stale Jul 6, 2022
@nikic
Copy link
Contributor

nikic commented Jul 6, 2022

I believe the problem here is that LLVM doesn't know that the accesses to s.len and s.array do not alias. We don't tell it, and this fact is very hard to recover through analysis in this example.

@workingjubilee
Copy link
Member

That would make this a duplicate, effectively, of #16515, no?

@timvermeulen
Copy link
Contributor Author

Doing so collapses the disparity almost entirely

Thanks so much for looking into this! This unfortunately does not really match what I'm seeing locally, although it's certainly no where near a 15x difference anymore. Using the 2022-07-17 nightly toolchain, I'm seeing a ~4.5x difference on aarch64-apple-darwin and a ~2.25x difference on x86_64-apple-darwin.

@workingjubilee
Copy link
Member

Oooh. A 4x disparity on a relatively high-power aarch64 machine, even after fully black boxing things, makes this a lot more Interesting In Particular.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants