Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New optimization: Move non-mutable array of Copy type to .rodata #73825

Closed
tesuji opened this issue Jun 28, 2020 · 12 comments
Closed

New optimization: Move non-mutable array of Copy type to .rodata #73825

tesuji opened this issue Jun 28, 2020 · 12 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-mir-opt Area: MIR optimizations C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@tesuji
Copy link
Contributor

tesuji commented Jun 28, 2020

Continue the discussion in #73780 (comment).

nbdd0121 said that:

let bar: [u32; 10] = [0,10,20,30,40,50,60,70,80,90];

Semantically just says: put this on the stack. Rust is doing exactly what you ask it to do;
it cannot place foo inside rodata because it is a value.


Consider this code: https://rust.godbolt.org/z/6TBstc

pub fn foo(x: usize) -> i32 {
    let base: [i32; 64] = [
        67, 754, 860, 559, 368, 870, 548, 972,
        141, 731, 351, 664, 32, 4, 996, 741,
        203, 292, 237, 480, 151, 940, 777, 540,
        143, 587, 747, 65, 152, 517, 882, 880,
        712, 595, 370, 901, 237, 53, 789, 785,
        912, 650, 896, 367, 316, 392, 62, 473,
        675, 691, 281, 192, 445, 970, 225, 425,
        628, 324, 322, 206, 912, 867, 462, 92
    ];
    base[x % 64]
}

The Rust compiler puts base on the stack in both opt-level 2 and 3:

The same code in C (I know that C array is mostly pointer): https://godbolt.org/z/WYbKs97E6

#include <stdlib.h>
#include <stdint.h>

uint32_t square(size_t x) {
    uint32_t bases[64] = {
        67, 754, 860, 559, 368, 870, 548, 972,
        141, 731, 351, 664, 32, 4, 996, 741,
        203, 292, 237, 480, 151, 940, 777, 540,
        143, 587, 747, 65, 152, 517, 882, 880,
        712, 595, 370, 901, 237, 53, 789, 785,
        912, 650, 896, 367, 316, 392, 62, 473,
        675, 691, 281, 192, 445, 970, 225, 425,
        628, 324, 322, 206, 912, 867, 462, 92
    };
    return bases[x % 64];
}

clang (with -O1) optimizes out the bases array to .rodata and use that static array directly.
gcc also fails to optimized. In -Oz gcc optimizes it to a memcpy.


Could the same optimization be done in Rust?

@tesuji tesuji changed the title New optimization: Move non-mutable array of COPY type to .rodata New optimization: Move non-mutable array of Copy type to .rodata Jun 28, 2020
@nbdd0121
Copy link
Contributor

nbdd0121 commented Jun 28, 2020

The optimisation is certainly possible in many cases but it is not trivial. E.g. A non-mutable array can still be moved. In C you cannot move an array.

@jonas-schievink jonas-schievink added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 28, 2020
@Mark-Simulacrum
Copy link
Member

Interestingly, it looks like the array is put in rodata in Rust too, if the array type is change to i32.

@tesuji

This comment was marked as resolved.

@nbdd0121
Copy link
Contributor

@lzutao That's not putting the array on rodata. You can see it contains just 4 elements; it is indeed just LLVM groups 4 u32 stores into one u128 store and that needs the constant on memory. It is very clear from LLVM IR:

Here's the LLVM IR for the debug version:

define i32 @_ZN10playground3foo17ha3dfdbe3e3d033d9E(i64 %x) unnamed_addr #0 !dbg !6 {
start:
  %x.dbg.spill = alloca i64, align 8
  %base = alloca [5 x i32], align 4
  store i64 %x, i64* %x.dbg.spill, align 8
  call void @llvm.dbg.declare(metadata i64* %x.dbg.spill, metadata !14, metadata !DIExpression()), !dbg !20
  call void @llvm.dbg.declare(metadata [5 x i32]* %base, metadata !15, metadata !DIExpression()), !dbg !21
  %0 = bitcast [5 x i32]* %base to i32*, !dbg !22
  store i32 1, i32* %0, align 4, !dbg !22
  %1 = getelementptr inbounds [5 x i32], [5 x i32]* %base, i32 0, i32 1, !dbg !22
  store i32 3, i32* %1, align 4, !dbg !22
  %2 = getelementptr inbounds [5 x i32], [5 x i32]* %base, i32 0, i32 2, !dbg !22
  store i32 4, i32* %2, align 4, !dbg !22
  %3 = getelementptr inbounds [5 x i32], [5 x i32]* %base, i32 0, i32 3, !dbg !22
  store i32 5, i32* %3, align 4, !dbg !22
  %4 = getelementptr inbounds [5 x i32], [5 x i32]* %base, i32 0, i32 4, !dbg !22
  store i32 2, i32* %4, align 4, !dbg !22
  %_3 = urem i64 %x, 5, !dbg !23
  %_6 = icmp ult i64 %_3, 5, !dbg !24
  %5 = call i1 @llvm.expect.i1(i1 %_6, i1 true), !dbg !24
  br i1 %5, label %bb1, label %panic, !dbg !24

bb1:                                              ; preds = %start
  %6 = getelementptr inbounds [5 x i32], [5 x i32]* %base, i64 0, i64 %_3, !dbg !24
  %7 = load i32, i32* %6, align 4, !dbg !24
  ret i32 %7, !dbg !25

panic:                                            ; preds = %start
; call core::panicking::panic_bounds_check
  call void @_ZN4core9panicking18panic_bounds_check17h4b3d0dcda831e378E(i64 %_3, i64 5, %"core::panic::Location"* noalias readonly align 8 dereferenceable(24) bitcast (<{ i8*, [16 x i8] }>* @alloc3 to %"core::panic::Location"*)), !dbg !24
  unreachable, !dbg !24
}

for the release version

define i32 @_ZN10playground3foo17hf1f5714ed87b6704E(i64 %x) unnamed_addr #0 {
start:
  %base = alloca [5 x i32], align 16
  %0 = bitcast [5 x i32]* %base to i8*
  call void @llvm.lifetime.start.p0i8(i64 20, i8* nonnull %0)
  %1 = bitcast [5 x i32]* %base to <4 x i32>*
  store <4 x i32> <i32 1, i32 3, i32 4, i32 5>, <4 x i32>* %1, align 16
  %2 = getelementptr inbounds [5 x i32], [5 x i32]* %base, i64 0, i64 4
  store i32 2, i32* %2, align 16
  %_3 = urem i64 %x, 5
  %3 = getelementptr inbounds [5 x i32], [5 x i32]* %base, i64 0, i64 %_3
  %4 = load i32, i32* %3, align 4
  call void @llvm.lifetime.end.p0i8(i64 20, i8* nonnull %0)
  ret i32 %4
}

So clearly the IR we generate initialises array elements one by one. On the other hand, Clang generates this in O0:

@__const.foo.bases = private unnamed_addr constant [5 x i32] [i32 1, i32 3, i32 4, i32 5, i32 2], align 16

define dso_local i32 @foo(i64 %0) #0 !dbg !7 {
  %2 = alloca i64, align 8
  %3 = alloca [5 x i32], align 16
  store i64 %0, i64* %2, align 8
  call void @llvm.dbg.declare(metadata i64* %2, metadata !19, metadata !DIExpression()), !dbg !20
  call void @llvm.dbg.declare(metadata [5 x i32]* %3, metadata !21, metadata !DIExpression()), !dbg !25
  %4 = bitcast [5 x i32]* %3 to i8*, !dbg !25
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 16 %4, i8* align 16 bitcast ([5 x i32]* @__const.foo.bases to i8*), i64 20, i1 false), !dbg !25
  %5 = load i64, i64* %2, align 8, !dbg !26
  %6 = urem i64 %5, 5, !dbg !27
  %7 = getelementptr inbounds [5 x i32], [5 x i32]* %3, i64 0, i64 %6, !dbg !28
  %8 = load i32, i32* %7, align 4, !dbg !28
  ret i32 %8, !dbg !29
}

So it seems that clang has take care of the const promotion of arrays. LLVM is smart enough to observe that memcpy is not necessary and optimise that away, but it is not yet smart enough to do a const promotion.

@nbdd0121
Copy link
Contributor

nbdd0121 commented Jun 29, 2020

Idea: we can just add an optimisation pass to turn any expr: [T;N] into *&expr when T:Copy

Example:

pub fn foo(x: usize) -> u32 {
    let base: [u32; 5] = [1,3,4,5,2];
    base[x % 5]
}

turns into:

pub fn foo(x: usize) -> u32 {
    let base: [u32; 5] = *&[1,3,4,5,2];
    base[x % 5]
}

This will enable the optimisation!

@nbdd0121
Copy link
Contributor

Probably this is an optimisation that we can make for all large const values. I can totally see large structs being benefits by this as memcpy should be faster than initialising fields one by one. I am not sure whether this is something Rust or LLVM should do. It might be easier to implement on Rust side though because we have more information.

One caveat would be const arrays like [42; 1000], which is cheaper to initialise than copying from rodata.

@bdonlan
Copy link
Contributor

bdonlan commented Jun 29, 2020

Counterpoint: Do rust semantics guarantee that "stack-allocated" arrays have different pointer locations? That is, is this code guaranteed to not panic?

fn foo() {
  let x: [u32; 5] = [1,2,3,4,5];
  let y: [u32; 5] = [1,2,3,4,5];

  assert_ne!(&x as *const [u32; 5], &y as *const [u32; 5])
}

If so, the proposed optimization pass must take this into account and ensure that, if the pointer location is observed, then the array is actually reified into a copy on the stack.

@nbdd0121
Copy link
Contributor

nbdd0121 commented Jul 1, 2020

@bdonlan My proposed optimisation will turn that into

fn foo() {
  let x: [u32; 5] = *&[1,2,3,4,5];
  let y: [u32; 5] = *&[1,2,3,4,5];

  assert_ne!(&x as *const [u32; 5], &y as *const [u32; 5])
}

and that shouldn't change the semantics.

@tesuji
Copy link
Contributor Author

tesuji commented Aug 28, 2020

There are missed-optimizations like #75978. One has to borrow the static array to make optimization happens.
But there are plans to demote static slice,which may cause optimization losses.
Not sure what to do, but I will cc @RalfJung about MIR.

@rustbot modify labels: A-mir-opt A-LLVM

@rustbot rustbot added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-mir-opt Area: MIR optimizations labels Aug 28, 2020
@RalfJung
Copy link
Member

But there are plans to demote static slice,which may cause optimization losses.

No, only mutable things are planned to not be promoted any more (and fallible operations, but not slice creation).

@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
@rustbot rustbot added the I-heavy Issue: Problems and improvements with respect to binary size of generated code. label May 20, 2024
bors added a commit to rust-lang-ci/rust that referenced this issue Jun 3, 2024
…=<try>

[WIP] mir-opt: promoting const read-only arrays

Modified from a copy of PromoteTemps. It's kind of a hack so nothing fancy and easy to follow and review.
I'll attempt to reuse structures from PromoteTemps when there is [consensus for this pass][zulip].

Compiler is doing more work now with this opt. So I don't think this pass improves compiler performance.
But anyway, for statistics, can I get a perf run?

cc rust-lang#73825

r? ghost

[zulip]: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Could.20const.20read-only.20arrays.20be.20const.20promoted.3F
bors added a commit to rust-lang-ci/rust that referenced this issue Jun 4, 2024
…=<try>

[WIP] mir-opt: promoting const read-only arrays

Modified from a copy of PromoteTemps. It's kind of a hack so nothing fancy or easy to follow and review.
I'll  to reuse structures from PromoteTemps when there is [consensus for this pass][zulip].

Compiler is doing more work now with this opt. So I don't think this pass improves compiler performance.
But anyway, for statistics, can I get a perf run?

cc rust-lang#73825

r? ghost

### Current status
- Waiting for [consensus][zulip].
- Fail simd tests: tests/assembly/simd-intrinsic-mask-load.rs#x86-avx512
- *~Fail test on nested arrays~*: hack fix, may possibly fail on struct containings arrays.
- Maybe rewrite to [use GVN with mentor from oli][mentor]

[zulip]: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Could.20const.20read-only.20arrays.20be.20const.20promoted.3F
[mentor]: rust-lang#125916 (comment)
bors added a commit to rust-lang-ci/rust that referenced this issue Jun 10, 2024
…=<try>

[WIP] mir-opt: promoting const read-only arrays

Modified from a copy of PromoteTemps. It's kind of a hack so nothing fancy or easy to follow and review.
I'll  to reuse structures from PromoteTemps when there is [consensus for this pass][zulip].

Compiler is doing more work now with this opt. So I don't think this pass improves compiler performance.
But anyway, for statistics, can I get a perf run?

cc rust-lang#73825

r? ghost

### Current status
- [ ] Waiting for [consensus][zulip].
- [ ] Maybe rewrite to [use GVN with mentor from oli][mentor]
- [x] ~ICE on unstable feature:  tests/assembly/simd-intrinsic-mask-load.rs#x86-avx512.~
  In particular `Simd([literal array])` now transformed to `Simd(array_var)`. Maybe I should ignore array in constructor.
- [x] *~Fail test on nested arrays~*

[zulip]: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Could.20const.20read-only.20arrays.20be.20const.20promoted.3F
[mentor]: rust-lang#125916 (comment)
bors added a commit to rust-lang-ci/rust that referenced this issue Jun 13, 2024
[WIP] gvn: Promote/propagate const local array

Rewriting of rust-lang#125916 which used PromoteTemps pass.

Fix rust-lang#73825

### Current status

- [ ] Waiting for [consensus](https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Could.20const.20read-only.20arrays.20be.20const.20promoted.3F).

r? ghost
bors added a commit to rust-lang-ci/rust that referenced this issue Jun 14, 2024
[WIP] gvn: Promote/propagate const local array

Rewriting of rust-lang#125916 which used PromoteTemps pass.

Fix rust-lang#73825

### Current status

- [ ] Waiting for [consensus](https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Could.20const.20read-only.20arrays.20be.20const.20promoted.3F).

r? ghost
bors added a commit to rust-lang-ci/rust that referenced this issue Jul 14, 2024
gvn: Promote/propagate const local array

Rewriting of rust-lang#125916 which used `PromoteTemps` pass.

This allows promoting constant local arrays as anonymous constants. So that's in codegen for
a local array, rustc outputs `llvm.memcpy` (which is easy for LLVM to optimize) instead of a series
of `store` on stack (a.k.a in-place initialization). This makes rustc on par with clang on this specific case.
See more in rust-lang#73825 or [zulip][opsem] for more info.

[Here is a simple micro benchmark][bench] that shows the performance differences between promoting arrays or not.

[Prior discussions on zulip][opsem].

This patch [saves about 600 KB][perf] (~0.5%) of `librustc_driver.so`.
![image](https://github.com/rust-lang/rust/assets/15225902/0e37559c-f5d9-4cdf-b7e3-a2956fd17bc1)

Fix rust-lang#73825

r? cjgillot

### Unresolved questions
- [ ] Should we ignore nested arrays?
    I think that promoting nested arrays is bloating codegen.
- [ ] Should stack_threshold be at least 32 bytes? Like the benchmark showed.
    If yes, the test should be updated to make arrays larger than 32 bytes.
- [x] ~Is this concerning that  `call(move _1)` is now `call(const [array])`?~
  It reverted back to `call(move _1)`

[opsem]: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Could.20const.20read-only.20arrays.20be.20const.20promoted.3F
[bench]: rust-lang/rust-clippy#12854 (comment)
[perf]: https://perf.rust-lang.org/compare.html?start=f9515fdd5aa132e27d9b580a35b27f4b453251c1&end=7e160d4b55bb5a27be0696f45db247ccc2e166d9&stat=size%3Alinked_artifact&tab=artifact-size
bors added a commit to rust-lang-ci/rust that referenced this issue Jul 14, 2024
gvn: Promote/propagate const local array

Rewriting of rust-lang#125916 which used `PromoteTemps` pass.

This allows promoting constant local arrays as anonymous constants. So that's in codegen for
a local array, rustc outputs `llvm.memcpy` (which is easy for LLVM to optimize) instead of a series
of `store` on stack (a.k.a in-place initialization). This makes rustc on par with clang on this specific case.
See more in rust-lang#73825 or [zulip][opsem] for more info.

[Here is a simple micro benchmark][bench] that shows the performance differences between promoting arrays or not.

[Prior discussions on zulip][opsem].

This patch [saves about 600 KB][perf] (~0.5%) of `librustc_driver.so`.
![image](https://github.com/rust-lang/rust/assets/15225902/0e37559c-f5d9-4cdf-b7e3-a2956fd17bc1)

Fix rust-lang#73825

r? cjgillot

### Unresolved questions
- [ ] Should we ignore nested arrays?
    I think that promoting nested arrays is bloating codegen.
- [ ] Should stack_threshold be at least 32 bytes? Like the benchmark showed.
    If yes, the test should be updated to make arrays larger than 32 bytes.
- [x] ~Is this concerning that  `call(move _1)` is now `call(const [array])`?~
  It reverted back to `call(move _1)`

[opsem]: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Could.20const.20read-only.20arrays.20be.20const.20promoted.3F
[bench]: rust-lang/rust-clippy#12854 (comment)
[perf]: https://perf.rust-lang.org/compare.html?start=f9515fdd5aa132e27d9b580a35b27f4b453251c1&end=7e160d4b55bb5a27be0696f45db247ccc2e166d9&stat=size%3Alinked_artifact&tab=artifact-size
@tesuji tesuji closed this as completed Aug 22, 2024
@RalfJung
Copy link
Member

RalfJung commented Aug 22, 2024

Why is this closed now?

To be clear, I am fine with saying "we don't care enough about this optimization to track it in an issue", but closing an issue should come with a comment explaining the reason. :)

@tesuji
Copy link
Contributor Author

tesuji commented Aug 22, 2024

It seems to me that I am the only one who're interested enough for the optimization.
I worked a bit on that but the changes are in limbo state since then.
I lost my interested now so this issue might be not useful anymore.
Feel free to reopen it if I'm wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-mir-opt Area: MIR optimizations C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants