Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify logical operations CFG #83663

Merged
merged 1 commit into from
Apr 2, 2021
Merged

Simplify logical operations CFG #83663

merged 1 commit into from
Apr 2, 2021

Conversation

AngelicosPhosphoros
Copy link
Contributor

This is basically same commit as e38e954 which was reverted later in 676953f
In both cases, this changes weren't benchmarked.
e38e954 leads to missed optimization described in this issue
676953f leads to missed optimization described in this issue

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @estebank (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 29, 2021
@AngelicosPhosphoros
Copy link
Contributor Author

I would like to benchmark this changes.

This code like to fix the issue with failed SIMD (#83623)

Code:

pub struct Blueprint {
    pub fuel_tank_size: u32,
    pub payload: u32,
    pub wheel_diameter: u32,
    pub wheel_width: u32,
    pub storage: u32,
}
impl PartialEq for Blueprint{
   fn eq(&self, other: &Self)->bool{
      (self.fuel_tank_size == other.fuel_tank_size)
           && (self.payload == other.payload)
           && (self.wheel_diameter == other.wheel_diameter)
           && (self.wheel_width == other.wheel_width)
           && (self.storage == other.storage)
   }
}

LLVM IR:

; ModuleID = 'eq_test.3a1fbbbh-cgu.0'
source_filename = "eq_test.3a1fbbbh-cgu.0"
target datalayout = "e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-windows-msvc"

%Blueprint = type { [0 x i32], i32, [0 x i32], i32, [0 x i32], i32, [0 x i32], i32, [0 x i32], i32, [0 x i32] }

; <eq_test::Blueprint as core::cmp::PartialEq>::eq
; Function Attrs: norecurse nounwind readonly uwtable willreturn
define zeroext i1 @"_ZN59_$LT$eq_test..Blueprint$u20$as$u20$core..cmp..PartialEq$GT$2eq17he4e85086691c6c20E"(%Blueprint* noalias nocapture readonly align 4 dereferenceable(20) %self, %Blueprint* noalias nocapture readonly align 4 dereferenceable(20) %other) unnamed_addr #0 {
start:
  %0 = bitcast %Blueprint* %self to <4 x i32>*
  %1 = load <4 x i32>, <4 x i32>* %0, align 4
  %2 = bitcast %Blueprint* %other to <4 x i32>*
  %3 = load <4 x i32>, <4 x i32>* %2, align 4
  %4 = icmp eq <4 x i32> %1, %3
  %5 = getelementptr inbounds %Blueprint, %Blueprint* %self, i64 0, i32 9
  %_19 = load i32, i32* %5, align 4
  %6 = getelementptr inbounds %Blueprint, %Blueprint* %other, i64 0, i32 9
  %_20 = load i32, i32* %6, align 4
  %_18 = icmp eq i32 %_19, %_20
  %7 = call i1 @llvm.vector.reduce.and.v4i1(<4 x i1> %4)
  %8 = and i1 %7, %_18
  ret i1 %8
}

; Function Attrs: nofree nosync nounwind readnone willreturn
declare i1 @llvm.vector.reduce.and.v4i1(<4 x i1>) #1

attributes #0 = { norecurse nounwind readonly uwtable willreturn "target-cpu"="znver1" }
attributes #1 = { nofree nosync nounwind readnone willreturn }

!llvm.module.flags = !{!0}

!0 = !{i32 7, !"PIC Level", i32 2}

And ASM:

_ZN59_$LT$eq_test..Blueprint$u20$as$u20$core..cmp..PartialEq$GT$2eq17he4e85086691c6c20E:
	vmovdqu	(%rcx), %xmm0
	movl	16(%rcx), %eax
	vpcmpeqd	(%rdx), %xmm0, %xmm0
	cmpl	16(%rdx), %eax
	vmovmskps	%xmm0, %eax
	sete	%cl
	cmpb	$15, %al
	sete	%al
	andb	%cl, %al
	retq

@bjorn3
Copy link
Member

bjorn3 commented Mar 29, 2021

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 29, 2021
@bors
Copy link
Contributor

bors commented Mar 29, 2021

⌛ Trying commit c343862c31c77abdb223df0c6de9a09f17c1557c with merge 6b7dccdb4b4fbb645bdfc176bf01a0f0b9941bb5...

@nagisa
Copy link
Member

nagisa commented Mar 29, 2021

Before this lands an addition of a codegen test will be necessary so that we don't regress this case again.

@bors
Copy link
Contributor

bors commented Mar 29, 2021

☀️ Try build successful - checks-actions
Build commit: 6b7dccdb4b4fbb645bdfc176bf01a0f0b9941bb5 (6b7dccdb4b4fbb645bdfc176bf01a0f0b9941bb5)

@rust-timer
Copy link
Collaborator

Queued 6b7dccdb4b4fbb645bdfc176bf01a0f0b9941bb5 with parent 2917eda, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (6b7dccdb4b4fbb645bdfc176bf01a0f0b9941bb5): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 30, 2021
@mati865
Copy link
Contributor

mati865 commented Mar 30, 2021

Benchmark result shows slight improvement in compilation time for MIR and LLVM passes.
It's too close to the noise to determine runtime changes.

@estebank
Copy link
Contributor

I don't feel comfortable reviewing this. @nagisa would you mind taking this?

@nagisa
Copy link
Member

nagisa commented Mar 30, 2021

r? @nagisa

@rust-highfive rust-highfive assigned nagisa and unassigned estebank Mar 30, 2021
@AngelicosPhosphoros
Copy link
Contributor Author

Benchmark result shows slight improvement in compilation time for MIR and LLVM passes.
It's too close to the noise to determine runtime changes.

Should we merge this and should I continue to work on this then?
@nagisa

@AngelicosPhosphoros
Copy link
Contributor Author

Hacky code from eq2 case (#83623) and #[derive(PartialEq)] cannot be vectorized when comparing arrays of structure on current compiler.
godbolt link

Code from this PR handles vectorizing of slices of such struct fine:

code
#[derive(PartialEq)]
pub struct Blueprint {
    pub fuel_tank_size: u32,
    pub payload: u32,
    pub wheel_diameter: u32,
    pub wheel_width: u32,
    pub storage: u32,
}

#[no_mangle]
pub fn compare_two_arrays(a: &[Blueprint], b: &[Blueprint])->bool{
    a==b
}
compare_two_arrays:
	cmpq	%r9, %rdx
	jne	.LBB0_1
	incq	%rdx
	movl	$16, %r9d
	.p2align	4, 0x90
.LBB0_3:
	decq	%rdx
	sete	%al
	je	.LBB0_6
	vmovdqu	-16(%rcx,%r9), %xmm0
	vpcmpeqd	-16(%r8,%r9), %xmm0, %xmm0
	vmovmskps	%xmm0, %r10d
	cmpb	$15, %r10b
	jne	.LBB0_6
	movl	(%r8,%r9), %r10d
	leaq	20(%r9), %r11
	cmpl	%r10d, (%rcx,%r9)
	movq	%r11, %r9
	je	.LBB0_3
.LBB0_6:
	retq
.LBB0_1:
	xorl	%eax, %eax
	retq

@nagisa
Copy link
Member

nagisa commented Mar 30, 2021

I think it is worthwhile to pursue this based on the assembly & IR that gets produced once this is applied. I would be surprised if the compiler exercised the kinds of code paths we're seeing here to produce good motivation for or against merging this. Microbenchmarks would be a significantly better tool in this situation, I suspect.

I'm somewhat surprised that #62993 was sufficient motivation to revert this in the first place, but from what I can tell it might have happened because there was no information suggesting that the code derived by PartialEq becomes significantly worse without it ("reduces number of basic blocks" seems "meh" enough of an improvement if it regresses an actual code example)

@AngelicosPhosphoros
Copy link
Contributor Author

I wrote a benchmark for more cases.
I run it on toolchains from HEAD~1 and HEAD for my working branch, and used target-cpu=znver1.

So, results:

Legend:

u32s struct:
   5 us32 fields struct
u32s struct/Self - comparison with same cloned vec
u32s struct/Random field - comparison with elements differ only in one random field
u32s struct/First field - comparison with elements differ only in first field
u32s struct/Last field - comparison with elements differ only in last field
u32s struct/Every Second - comparison with vec which every even element is equal and every odd is inequal

String struct:
  Struct with 4 u32 fields and one string field. String fields generated as random char from A to Z.
cmp/u32s struct/Self - with same cloned Vec
cmp/String struct/u32 field - with random u32 field
cmp/String struct/String field - with String field.
HEAD~1
     Running unittests (target\release\deps\bench_u32s-967e14898f5691eb.exe)

Gnuplot not found, using plotters backend
cmp/u32s struct/Self    time:   [6.7752 us 6.7880 us 6.8033 us]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
cmp/u32s struct/Random field
                        time:   [15.176 us 15.204 us 15.234 us]
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe
cmp/u32s struct/First field
                        time:   [2.6016 us 2.6084 us 2.6149 us]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
cmp/u32s struct/Last field
                        time:   [6.7269 us 6.7357 us 6.7459 us]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
cmp/u32s struct/Every Second
                        time:   [12.925 us 12.940 us 12.957 us]
Found 18 outliers among 100 measurements (18.00%)
  18 (18.00%) high severe

     Running unittests (target\release\deps\bench_with_strings-77b5d3b11469681f.exe)

Gnuplot not found, using plotters backend
cmp/String struct/Self  time:   [65.674 us 65.750 us 65.823 us]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
cmp/String struct/u32 field
                        time:   [17.712 us 17.733 us 17.756 us]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
cmp/String struct/String field
                        time:   [10.087 us 10.102 us 10.116 us]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
With PR commit
     Running unittests (target\release\deps\bench_u32s-967e14898f5691eb.exe)
Gnuplot not found, using plotters backend
cmp/u32s struct/Self    time:   [7.1867 us 7.2374 us 7.2920 us]
                        change: [+6.1698% +7.0979% +7.9865%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
cmp/u32s struct/Random field
                        time:   [6.7970 us 6.8126 us 6.8319 us]
                        change: [-54.664% -53.917% -53.142%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
cmp/u32s struct/First field
                        time:   [6.7780 us 6.7868 us 6.7966 us]
                        change: [+158.65% +159.32% +160.07%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe
cmp/u32s struct/Last field
                        time:   [6.7810 us 6.7919 us 6.8034 us]
                        change: [+0.3407% +0.6019% +0.9169%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
cmp/u32s struct/Every Second
                        time:   [6.7663 us 6.7758 us 6.7861 us]
                        change: [-48.465% -48.193% -47.936%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

     Running unittests (target\release\deps\bench_with_strings-77b5d3b11469681f.exe)

Gnuplot not found, using plotters backend
cmp/String struct/Self  time:   [59.658 us 59.743 us 59.833 us]
                        change: [-9.1402% -8.9131% -8.6605%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low severe
  3 (3.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
cmp/String struct/u32 field
                        time:   [3.9043 us 3.9140 us 3.9253 us]
                        change: [-77.864% -77.797% -77.726%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe
cmp/String struct/String field
                        time:   [6.9662 us 6.9746 us 6.9833 us]
                        change: [-31.134% -30.958% -30.788%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

So, it looks like that code generated by trunk version is faster when it's execution finishes on first branch and branch predictor correctly predicts this. When branch predictor fails to do this, it is slower 7 times.
SIMD version shows much less variance and it is significantly slower in First field case and faster in worst Random field.
First field case slowed down by 4.1784 microseconds and Random field sped up by 8.3914

I think, it is a clear win.

For anyone who want to look benchmarks code or run it on own machine, I add zipped project.
bench_changes.zip

@nagisa I would add codegen tests now and I wonder if I need to squash my commits myself or it would be done automatically.

@nagisa
Copy link
Member

nagisa commented Mar 30, 2021 via email

@@ -0,0 +1,45 @@
// This test checks that jumps generated by logical operators can be optimized away
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to write codegen tests so I made this by example.
I can miss something.

@rust-log-analyzer

This comment has been minimized.

@@ -0,0 +1,45 @@
// This test checks that jumps generated by logical operators can be optimized away

// compile-flags: -O3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-Copt-level=3 or perhaps just -O should work.

// This test checks that jumps generated by logical operators can be optimized away

// compile-flags: -O3
// only-64bit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test might fail for some of the 64-bit targets, and may need to be limited to just x86_64, but lets see if this works out, first.

@nagisa
Copy link
Member

nagisa commented Mar 30, 2021

@bors r+ rollup=never

@richkadel
Copy link
Contributor

You may need to set profiler = true in your config.toml

Sorry. I wish that was a default.

@richkadel
Copy link
Contributor

I recommend Ubuntu for generating the new test results. Not Windows.

And if it keeps ignoring the tests, you may need to remove the test results in your build directory and run again.

@richkadel
Copy link
Contributor

Just thinking, you may need to build LLVM, and it has to build with 'optimize = false"

Some constraints I only recently realized. Thanks.

@AngelicosPhosphoros
Copy link
Contributor Author

Can you help me with building LLVM please?

Missed headers
>grep 'No such file or directory'  build/x86_64-unknown-linux-gnu/llvm/buil
d/CMakeFiles/CMakeError.log
CheckIncludeFile.c:1:10: fatal error: malloc/malloc.h: No such file or directory
CheckIncludeFile.c:1:10: fatal error: valgrind/valgrind.h: No such file or directory
CheckIncludeFile.c:1:10: fatal error: mach/mach.h: No such file or directory
CheckIncludeFile.c:1:10: fatal error: histedit.h: No such file or directory
CheckIncludeFile.c:1:10: fatal error: CrashReporterClient.h: No such file or directory
CheckSymbolExists.c:2:10: fatal error: malloc_np.h: No such file or directory
CheckSymbolExists.c:2:10: fatal error: malloc/malloc.h: No such file or directory
CheckSymbolExists.c:2:10: fatal error: os/signpost.h: No such file or directory

@richkadel
Copy link
Contributor

I don't normally have to do anything special, but best practice (for me) in this situation would be:

  1. Completely delete your build directory
  2. Copy the current config.toml.example to config.toml and apply the following:
[llvm]
...
optimize = false # required for `llvm-cov show --debug`
...
[build]
...
profiler = true

I also generally enable [rust] debug = true but that's probably not necessary.

Then:

$ ./x.py test src/test/run-make-fulldeps/coverage --bless

should work.

@AngelicosPhosphoros
Copy link
Contributor Author

AngelicosPhosphoros commented Apr 1, 2021

@richkadel @nagisa

I finally managed to make it compile.
I looks like that I got expected results:

It looks like the only problems are basic block numbers (bb) changed. Should be OK to re-bless if that's the case.

$ ./x.py test src/test/run-make-fulldeps/coverage --bless

As long as the only changed files are spanview files, you're probably good to go (adding the changed expected results to your PR).

If it changes any of the coverage-reports/* expected results, we can take a closer look, but I don't see that in the above failure.

JYI to make it work I changed this:

[llvm]
...
optimize = false
...
[build]
...
profiler = true
...
[target.x86_64-unknown-linux-gnu]
cc = "clang-11"
cxx = "clang++-11"

So, I failed to build LLVM because it used distribution provided C/C++ compiler (gcc) at first.
Your words about optimize = false are true too because using optimize = true deletes coverage reports.

Also, it was very incomfortable to build LLVM on my machine because I needed to limit parallelism since my build constantly killed by OOM on 55 gb of used memory (especially, during linking). Also, rustc compiles very long time (4 hours) if it uses LLVM without optimizations.

Are there any thoughts to move generation of such changes to some build agents? I think, we cut some possible contributors from work because they just haven't such good machines (I have 64 gb RAM on my PC but I don't know anyone other who has).
It woulds be nice if we can call bors to generate commit with such report.

@nagisa
Copy link
Member

nagisa commented Apr 1, 2021

Also, it was very incomfortable to build LLVM on my machine because I needed to limit parallelism since my build constantly killed by OOM on 55 gb of used memory (especially, during linking). Also, rustc compiles very long time (4 hours) if it uses LLVM without optimizations.

Right. Linking will take a ton of memory because debug info is just that huge. There's definitely some improvements to be had around the coverage tests – having to build a unoptimized LLVM any time you need to bless a test suite potentially prevents contributors from making changes at all (what if they only have a machine with, say, 8GB of memory?)

@nagisa
Copy link
Member

nagisa commented Apr 1, 2021

@bors r+

@bors
Copy link
Contributor

bors commented Apr 1, 2021

📌 Commit 87264fa002169ac242f0c74f69a66032274cd88d has been approved by nagisa

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 1, 2021
This is basically same commit as e38e954 which was reverted later in 676953f
In both cases, this changes weren't benchmarked.
e38e954 leads to missed optimization described in [this issue](#62993)
676953f leads to missed optimization described in [this issue](#83623)

Also it changes some src/test/run-make-fulldeps/coverage-spanview/expected_mir_dump* files automatically.
@AngelicosPhosphoros
Copy link
Contributor Author

@nagisa
Just force-pushed with little change:
Removed #[no_mangle] here because it's not valid attribute on impl.

// && chains should not prevent SIMD optimizations for primitives
#[no_mangle]
impl PartialEq for Blueprint{

@richkadel
Copy link
Contributor

having to build a unoptimized LLVM any time you need to bless a test suite potentially prevents contributors from making changes at all (what if they only have a machine with, say, 8GB of memory?)

I agree. I only just made the connection in my head. I'm going to push a PR later this week that removes spanview files and removes the debug output.

The spanview files are no longer as valuable as they used to be (in tests) so that doesn't hurt much.

It's frustrating that I can't call llvm-cov show --debug to show the counters without setting optimize = false for the whole LLVM build. There must be a way to be more granular than that but I'm not aware of it.

But that's not your problem. I think it's worth removing.

cc: @wesleywiser @tmandry

@AngelicosPhosphoros
Copy link
Contributor Author

@nagisa
Copy link
Member

nagisa commented Apr 1, 2021

@bors r+

It's frustrating that I can't call llvm-cov show --debug to show the counters without setting optimize = false

In my experience RelWithDebInfo builds still retain --debug output at least in llc and opt, so that may be an option to pursue. Doesn't solve the issue with linkers though as debug info is a major contributor to the linkers' memory use.

@bors
Copy link
Contributor

bors commented Apr 1, 2021

📌 Commit 4464cc2 has been approved by nagisa

@richkadel
Copy link
Contributor

I assume #83663 will land before #83755, in which case, I'll need to rebase my changes.

But if for some reason #83755 lands first, you can just wipe out the spanview directory:

src/test/run-make-fulldeps/coverage-spanview/

You shouldn't need to re-bless since (AFAICT) only spanview files were changed in this PR. But if you ever do need to rebless, you no longer need to build your own LLVM (once my PR lands). The LLVM debug options are no longer required.

Thanks!

@nikic
Copy link
Contributor

nikic commented Apr 1, 2021

FYI usually debug flags in LLVM are controlled by assertions = true. It shouldn't be necessary to use a debug build of LLVM.

@bors
Copy link
Contributor

bors commented Apr 2, 2021

⌛ Testing commit 4464cc2 with merge d1065e6...

@bors
Copy link
Contributor

bors commented Apr 2, 2021

☀️ Test successful - checks-actions
Approved by: nagisa
Pushing d1065e6 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Apr 2, 2021
@bors bors merged commit d1065e6 into rust-lang:master Apr 2, 2021
@rustbot rustbot added this to the 1.53.0 milestone Apr 2, 2021
@AngelicosPhosphoros AngelicosPhosphoros deleted the simplify_binary_and_to_get_better_asm branch April 2, 2021 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.