-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New optimization: Move non-mutable array of Copy type to .rodata #73825
Comments
The optimisation is certainly possible in many cases but it is not trivial. E.g. A non-mutable array can still be moved. In C you cannot move an array. |
Interestingly, it looks like the array is put in rodata in Rust too, if the array type is change to i32. |
This comment was marked as resolved.
This comment was marked as resolved.
@lzutao That's not putting the array on rodata. You can see it contains just 4 elements; it is indeed just LLVM groups 4 u32 stores into one u128 store and that needs the constant on memory. It is very clear from LLVM IR: Here's the LLVM IR for the debug version: define i32 @_ZN10playground3foo17ha3dfdbe3e3d033d9E(i64 %x) unnamed_addr #0 !dbg !6 {
start:
%x.dbg.spill = alloca i64, align 8
%base = alloca [5 x i32], align 4
store i64 %x, i64* %x.dbg.spill, align 8
call void @llvm.dbg.declare(metadata i64* %x.dbg.spill, metadata !14, metadata !DIExpression()), !dbg !20
call void @llvm.dbg.declare(metadata [5 x i32]* %base, metadata !15, metadata !DIExpression()), !dbg !21
%0 = bitcast [5 x i32]* %base to i32*, !dbg !22
store i32 1, i32* %0, align 4, !dbg !22
%1 = getelementptr inbounds [5 x i32], [5 x i32]* %base, i32 0, i32 1, !dbg !22
store i32 3, i32* %1, align 4, !dbg !22
%2 = getelementptr inbounds [5 x i32], [5 x i32]* %base, i32 0, i32 2, !dbg !22
store i32 4, i32* %2, align 4, !dbg !22
%3 = getelementptr inbounds [5 x i32], [5 x i32]* %base, i32 0, i32 3, !dbg !22
store i32 5, i32* %3, align 4, !dbg !22
%4 = getelementptr inbounds [5 x i32], [5 x i32]* %base, i32 0, i32 4, !dbg !22
store i32 2, i32* %4, align 4, !dbg !22
%_3 = urem i64 %x, 5, !dbg !23
%_6 = icmp ult i64 %_3, 5, !dbg !24
%5 = call i1 @llvm.expect.i1(i1 %_6, i1 true), !dbg !24
br i1 %5, label %bb1, label %panic, !dbg !24
bb1: ; preds = %start
%6 = getelementptr inbounds [5 x i32], [5 x i32]* %base, i64 0, i64 %_3, !dbg !24
%7 = load i32, i32* %6, align 4, !dbg !24
ret i32 %7, !dbg !25
panic: ; preds = %start
; call core::panicking::panic_bounds_check
call void @_ZN4core9panicking18panic_bounds_check17h4b3d0dcda831e378E(i64 %_3, i64 5, %"core::panic::Location"* noalias readonly align 8 dereferenceable(24) bitcast (<{ i8*, [16 x i8] }>* @alloc3 to %"core::panic::Location"*)), !dbg !24
unreachable, !dbg !24
} for the release version define i32 @_ZN10playground3foo17hf1f5714ed87b6704E(i64 %x) unnamed_addr #0 {
start:
%base = alloca [5 x i32], align 16
%0 = bitcast [5 x i32]* %base to i8*
call void @llvm.lifetime.start.p0i8(i64 20, i8* nonnull %0)
%1 = bitcast [5 x i32]* %base to <4 x i32>*
store <4 x i32> <i32 1, i32 3, i32 4, i32 5>, <4 x i32>* %1, align 16
%2 = getelementptr inbounds [5 x i32], [5 x i32]* %base, i64 0, i64 4
store i32 2, i32* %2, align 16
%_3 = urem i64 %x, 5
%3 = getelementptr inbounds [5 x i32], [5 x i32]* %base, i64 0, i64 %_3
%4 = load i32, i32* %3, align 4
call void @llvm.lifetime.end.p0i8(i64 20, i8* nonnull %0)
ret i32 %4
} So clearly the IR we generate initialises array elements one by one. On the other hand, Clang generates this in O0: @__const.foo.bases = private unnamed_addr constant [5 x i32] [i32 1, i32 3, i32 4, i32 5, i32 2], align 16
define dso_local i32 @foo(i64 %0) #0 !dbg !7 {
%2 = alloca i64, align 8
%3 = alloca [5 x i32], align 16
store i64 %0, i64* %2, align 8
call void @llvm.dbg.declare(metadata i64* %2, metadata !19, metadata !DIExpression()), !dbg !20
call void @llvm.dbg.declare(metadata [5 x i32]* %3, metadata !21, metadata !DIExpression()), !dbg !25
%4 = bitcast [5 x i32]* %3 to i8*, !dbg !25
call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 16 %4, i8* align 16 bitcast ([5 x i32]* @__const.foo.bases to i8*), i64 20, i1 false), !dbg !25
%5 = load i64, i64* %2, align 8, !dbg !26
%6 = urem i64 %5, 5, !dbg !27
%7 = getelementptr inbounds [5 x i32], [5 x i32]* %3, i64 0, i64 %6, !dbg !28
%8 = load i32, i32* %7, align 4, !dbg !28
ret i32 %8, !dbg !29
} So it seems that clang has take care of the const promotion of arrays. LLVM is smart enough to observe that memcpy is not necessary and optimise that away, but it is not yet smart enough to do a const promotion. |
Idea: we can just add an optimisation pass to turn any Example: pub fn foo(x: usize) -> u32 {
let base: [u32; 5] = [1,3,4,5,2];
base[x % 5]
} turns into: pub fn foo(x: usize) -> u32 {
let base: [u32; 5] = *&[1,3,4,5,2];
base[x % 5]
} This will enable the optimisation! |
Probably this is an optimisation that we can make for all large const values. I can totally see large structs being benefits by this as memcpy should be faster than initialising fields one by one. I am not sure whether this is something Rust or LLVM should do. It might be easier to implement on Rust side though because we have more information. One caveat would be const arrays like |
Counterpoint: Do rust semantics guarantee that "stack-allocated" arrays have different pointer locations? That is, is this code guaranteed to not panic? fn foo() {
let x: [u32; 5] = [1,2,3,4,5];
let y: [u32; 5] = [1,2,3,4,5];
assert_ne!(&x as *const [u32; 5], &y as *const [u32; 5])
} If so, the proposed optimization pass must take this into account and ensure that, if the pointer location is observed, then the array is actually reified into a copy on the stack. |
@bdonlan My proposed optimisation will turn that into fn foo() {
let x: [u32; 5] = *&[1,2,3,4,5];
let y: [u32; 5] = *&[1,2,3,4,5];
assert_ne!(&x as *const [u32; 5], &y as *const [u32; 5])
} and that shouldn't change the semantics. |
No, only mutable things are planned to not be promoted any more (and fallible operations, but not slice creation). |
…=<try> [WIP] mir-opt: promoting const read-only arrays Modified from a copy of PromoteTemps. It's kind of a hack so nothing fancy and easy to follow and review. I'll attempt to reuse structures from PromoteTemps when there is [consensus for this pass][zulip]. Compiler is doing more work now with this opt. So I don't think this pass improves compiler performance. But anyway, for statistics, can I get a perf run? cc rust-lang#73825 r? ghost [zulip]: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Could.20const.20read-only.20arrays.20be.20const.20promoted.3F
…=<try> [WIP] mir-opt: promoting const read-only arrays Modified from a copy of PromoteTemps. It's kind of a hack so nothing fancy or easy to follow and review. I'll to reuse structures from PromoteTemps when there is [consensus for this pass][zulip]. Compiler is doing more work now with this opt. So I don't think this pass improves compiler performance. But anyway, for statistics, can I get a perf run? cc rust-lang#73825 r? ghost ### Current status - Waiting for [consensus][zulip]. - Fail simd tests: tests/assembly/simd-intrinsic-mask-load.rs#x86-avx512 - *~Fail test on nested arrays~*: hack fix, may possibly fail on struct containings arrays. - Maybe rewrite to [use GVN with mentor from oli][mentor] [zulip]: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Could.20const.20read-only.20arrays.20be.20const.20promoted.3F [mentor]: rust-lang#125916 (comment)
…=<try> [WIP] mir-opt: promoting const read-only arrays Modified from a copy of PromoteTemps. It's kind of a hack so nothing fancy or easy to follow and review. I'll to reuse structures from PromoteTemps when there is [consensus for this pass][zulip]. Compiler is doing more work now with this opt. So I don't think this pass improves compiler performance. But anyway, for statistics, can I get a perf run? cc rust-lang#73825 r? ghost ### Current status - [ ] Waiting for [consensus][zulip]. - [ ] Maybe rewrite to [use GVN with mentor from oli][mentor] - [x] ~ICE on unstable feature: tests/assembly/simd-intrinsic-mask-load.rs#x86-avx512.~ In particular `Simd([literal array])` now transformed to `Simd(array_var)`. Maybe I should ignore array in constructor. - [x] *~Fail test on nested arrays~* [zulip]: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Could.20const.20read-only.20arrays.20be.20const.20promoted.3F [mentor]: rust-lang#125916 (comment)
[WIP] gvn: Promote/propagate const local array Rewriting of rust-lang#125916 which used PromoteTemps pass. Fix rust-lang#73825 ### Current status - [ ] Waiting for [consensus](https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Could.20const.20read-only.20arrays.20be.20const.20promoted.3F). r? ghost
[WIP] gvn: Promote/propagate const local array Rewriting of rust-lang#125916 which used PromoteTemps pass. Fix rust-lang#73825 ### Current status - [ ] Waiting for [consensus](https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Could.20const.20read-only.20arrays.20be.20const.20promoted.3F). r? ghost
gvn: Promote/propagate const local array Rewriting of rust-lang#125916 which used `PromoteTemps` pass. This allows promoting constant local arrays as anonymous constants. So that's in codegen for a local array, rustc outputs `llvm.memcpy` (which is easy for LLVM to optimize) instead of a series of `store` on stack (a.k.a in-place initialization). This makes rustc on par with clang on this specific case. See more in rust-lang#73825 or [zulip][opsem] for more info. [Here is a simple micro benchmark][bench] that shows the performance differences between promoting arrays or not. [Prior discussions on zulip][opsem]. This patch [saves about 600 KB][perf] (~0.5%) of `librustc_driver.so`. ![image](https://github.com/rust-lang/rust/assets/15225902/0e37559c-f5d9-4cdf-b7e3-a2956fd17bc1) Fix rust-lang#73825 r? cjgillot ### Unresolved questions - [ ] Should we ignore nested arrays? I think that promoting nested arrays is bloating codegen. - [ ] Should stack_threshold be at least 32 bytes? Like the benchmark showed. If yes, the test should be updated to make arrays larger than 32 bytes. - [x] ~Is this concerning that `call(move _1)` is now `call(const [array])`?~ It reverted back to `call(move _1)` [opsem]: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Could.20const.20read-only.20arrays.20be.20const.20promoted.3F [bench]: rust-lang/rust-clippy#12854 (comment) [perf]: https://perf.rust-lang.org/compare.html?start=f9515fdd5aa132e27d9b580a35b27f4b453251c1&end=7e160d4b55bb5a27be0696f45db247ccc2e166d9&stat=size%3Alinked_artifact&tab=artifact-size
gvn: Promote/propagate const local array Rewriting of rust-lang#125916 which used `PromoteTemps` pass. This allows promoting constant local arrays as anonymous constants. So that's in codegen for a local array, rustc outputs `llvm.memcpy` (which is easy for LLVM to optimize) instead of a series of `store` on stack (a.k.a in-place initialization). This makes rustc on par with clang on this specific case. See more in rust-lang#73825 or [zulip][opsem] for more info. [Here is a simple micro benchmark][bench] that shows the performance differences between promoting arrays or not. [Prior discussions on zulip][opsem]. This patch [saves about 600 KB][perf] (~0.5%) of `librustc_driver.so`. ![image](https://github.com/rust-lang/rust/assets/15225902/0e37559c-f5d9-4cdf-b7e3-a2956fd17bc1) Fix rust-lang#73825 r? cjgillot ### Unresolved questions - [ ] Should we ignore nested arrays? I think that promoting nested arrays is bloating codegen. - [ ] Should stack_threshold be at least 32 bytes? Like the benchmark showed. If yes, the test should be updated to make arrays larger than 32 bytes. - [x] ~Is this concerning that `call(move _1)` is now `call(const [array])`?~ It reverted back to `call(move _1)` [opsem]: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-opsem/topic/Could.20const.20read-only.20arrays.20be.20const.20promoted.3F [bench]: rust-lang/rust-clippy#12854 (comment) [perf]: https://perf.rust-lang.org/compare.html?start=f9515fdd5aa132e27d9b580a35b27f4b453251c1&end=7e160d4b55bb5a27be0696f45db247ccc2e166d9&stat=size%3Alinked_artifact&tab=artifact-size
Why is this closed now? To be clear, I am fine with saying "we don't care enough about this optimization to track it in an issue", but closing an issue should come with a comment explaining the reason. :) |
It seems to me that I am the only one who're interested enough for the optimization. |
Continue the discussion in #73780 (comment).
nbdd0121 said that:
Consider this code: https://rust.godbolt.org/z/6TBstc
The Rust compiler puts
base
on the stack in both opt-level 2 and 3:The same code in C (I know that C array is mostly pointer): https://godbolt.org/z/WYbKs97E6
clang (with
-O1
) optimizes out thebases
array to.rodata
and use that static array directly.gcc also fails to optimized. In
-Oz
gcc optimizes it to a memcpy.Could the same optimization be done in Rust?
The text was updated successfully, but these errors were encountered: