Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLVM pointer range loop / autovectorization regression #35662

Closed
bluss opened this issue Aug 14, 2016 · 30 comments
Closed

LLVM pointer range loop / autovectorization regression #35662

bluss opened this issue Aug 14, 2016 · 30 comments
Assignees
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html I-slow Issue: Problems and improvements with respect to performance of generated code. P-high High priority regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@bluss
Copy link
Member

bluss commented Aug 14, 2016

The benchmarks in crate matrixmultiply version 0.1.8 degrade with MIR enabled. (commit bluss/matrixmultiply@3d83647)

Tested using rustc 1.12.0-nightly (1deb02ea6 2016-08-12).

Typical output:

// -C target-cpu=native
test mat_mul_f32::m127             ... bench:   2,703,773 ns/iter (+/- 636,432)
// -Z orbit=off -C target-cpu=native
test mat_mul_f32::m127             ... bench:     648,817 ns/iter (+/- 22,379)

Sure, the matrix multiplication kernel uses some major muckery that it expects the compiler to optimize down and autovectorize, but since it technically is a regression, it gets a report.

@pmarcelll
Copy link
Contributor

This might be caused by the LLVM upgrade.

rustc 1.12.0-nightly (7333c4a 2016-07-31) (before the LLVM upgrade):

// -C target-cpu=native
test mat_mul_f32::m127             ... bench:     162,334 ns/iter (+/- 16,429)
// -Z orbit=on -C target-cpu=native
test mat_mul_f32::m127             ... bench:     162,855 ns/iter (+/- 7,302)

rustc 1.12.0-nightly (28ce3e8 2016-08-01) (after the LLVM upgrade, old-trans is still the default):

// -C target-cpu=native
test mat_mul_f32::m127             ... bench:     169,562 ns/iter (+/- 11,118)
// -Z orbit=on -C target-cpu=native
test mat_mul_f32::m127             ... bench:     570,056 ns/iter (+/- 34,4)

Measured on an Intel Haswell processor.

@bluss
Copy link
Member Author

bluss commented Aug 14, 2016

Thank you, great info.

@eddyb eddyb added I-nominated regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Aug 14, 2016
@eddyb eddyb changed the title MIR autovectorization regression LLVM autovectorization regression Aug 14, 2016
@eddyb
Copy link
Member

eddyb commented Aug 14, 2016

Thanks to @dikaiosune I produced this minimization (for a different case): https://godbolt.org/g/3Nuofl. Seems to reproduce with clang 3.8 vs running LLVM 3.9's opt -O3.

@eddyb
Copy link
Member

eddyb commented Aug 14, 2016

@majnemer on IRC pointed out https://reviews.llvm.org/rL268972.

@eddyb eddyb changed the title LLVM autovectorization regression LLVM/MIR autovectorization regression Aug 14, 2016
@eddyb
Copy link
Member

eddyb commented Aug 14, 2016

@pmarcelll I misread some of those reports, so it's a combination of LLVM 3.9 and MIR trans being used?

On a Haswell, -C target-cpu=native should imply sse4.2 AFAIK, so I'm not sure it's the linked issue for the original problem being observed here, only @dikaiosune's.

@eddyb eddyb added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html labels Aug 14, 2016
@bluss
Copy link
Member Author

bluss commented Aug 15, 2016

@eddyb Haswell has avx + avx2 too doesn't it, so it should imply those too

@pmarcelll
Copy link
Contributor

pmarcelll commented Aug 15, 2016

If my benchmarking is correct and i386 means no SIMD at all, then it's not just an autovectorization regression.

rustc 1.12.0-nightly (7333c4a 2016-07-31):

// -C target-cpu=i386
test mat_mul_f32::m127             ... bench:     209,399 ns/iter (+/- 13,055)
// -Z orbit=on -C target-cpu=i386
test mat_mul_f32::m127             ... bench:     205,028 ns/iter (+/- 11,386)

rustc 1.12.0-nightly (28ce3e8 2016-08-01):

// -C target-cpu=i386
test mat_mul_f32::m127             ... bench:     205,863 ns/iter (+/- 14,618)
// -Z orbit=on -C target-cpu=i386
test mat_mul_f32::m127             ... bench:     512,995 ns/iter (+/- 14,309)

The last one is especially interesting because the i386 version is faster than the haswell version (512,995 ns/iter vs. 570,056 ns/iter).

EDIT: same on the latest nightly.

@eddyb
Copy link
Member

eddyb commented Aug 15, 2016

Can you reduce this to something that fits on playpen but still exhibits a performance difference?

@bluss
Copy link
Member Author

bluss commented Aug 15, 2016

This shows the same kind of difference, even though it's quite long, since I wanted to keep the same kernel

https://play.rust-lang.org/?gist=528273eaf93142b0ce84c5f24de1ccd3&version=nightly&backtrace=0

In this reduced example, just like in matrixmultiply originally, it's still using vector instructions in the regressed version just not the same ones, so it's not very effective.

I found a regression from orbit=off to orbit=on using these command lines on an avx platform.

rustc -Ctarget-cpu=native -Zorbit=on -C opt-level=3 --test mini.rs
rustc -Ctarget-cpu=native -Zorbit=off -C opt-level=3 --test mini.rs

@eddyb
Copy link
Member

eddyb commented Aug 15, 2016

I've reduced it to a reproducing (5ns and 7ns) and non-reproducing (5ns and 5ns) version.
The difference between old trans and MIR trans is that MIR trans doesn't aggressively promote everything to copies from .rodata. Even old trans shouldn't, now that I think about it.

But it seems that LLVM has regressed in its ability to optimize out the loop initializing the array literal.

@eddyb
Copy link
Member

eddyb commented Aug 15, 2016

Looks like one -arg-promotion and -scalar-evolution -demanded-bits -slp-vectorizer are missing from rustc's pass setup: https://gist.github.com/eddyb/ec63fd2df8fffb22575f3ada7f5a8e93.
Running opt on the unoptimized IR from rustc results in vectorization.

EDIT: But... only locally? playpen seems to vectorize just fine on "nightly".

EDIT2: I am dumb, "Release" mode is -C opt-level=3, not -O aka -C opt-level=2.

@eddyb
Copy link
Member

eddyb commented Aug 15, 2016

I've reduced the MIR trans regression to this:

#![crate_type = "lib"]

pub fn kernel() -> f32 {
    let a = [0.; 4];
    a[0]
}

This is optimized out into nothing with old trans but not with MIR trans.

Can someone confirm that before the LLVM update, both old trans and MIR trans produce an empty function (with --emit=llvm-ir -C opt-level=3)?

EDIT: Rewritten to trigger even without llvm.lifetime intrinsics.

EDIT2: Reproduces in clang 3.9, filed https://llvm.org/bugs/show_bug.cgi?id=28987.

@bluss
Copy link
Member Author

bluss commented Aug 15, 2016

Yes, using the pre llvm upgrade nightly rustc 1.12.0-nightly (7333c4ac2 2016-07-31)

command rustc --emit=llvm-ir -C opt-level=3 -Z orbit nano2.rs

It does have an empty / simple function without the array init

; ModuleID = 'nano2.cgu-0.rs'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: norecurse nounwind uwtable
define float @_ZN5nano26kernel17h90705cd48502cdafE() unnamed_addr #0 {
entry-block:
  ret float 0.000000e+00
}

attributes #0 = { norecurse nounwind uwtable }

@eddyb
Copy link
Member

eddyb commented Aug 16, 2016

May be fixed by backporting these 4 commits made by @majnemer (old -> new):

EDIT: I've just checked and these do indeed remove observable performance differences in the reduced 2x2 matrix multiplication I linked above, however, the dead memset is still there, just with a constant length, which presumably is cheap at small sizes, as it just writes XMM registers directly, most likely, whereas the non-constant memset would have to go into an actual loop writing to memory.

The original 4x4 multiplication is still pretty bad, but at least some smaller cases are less affected.

@eddyb eddyb changed the title LLVM/MIR autovectorization regression LLVM pointer range loop / autovectorization regression Aug 17, 2016
eddyb added a commit to eddyb/rust that referenced this issue Aug 18, 2016
Update LLVM to include 4 backported commits by @majnemer.

Partial fix for rust-lang#35662, should help at least loops on small arrays.

Nominated for backporting into the new beta (not the one that's being released as stable this week).

r? @alexcrichton
@arielb1
Copy link
Contributor

arielb1 commented Aug 18, 2016

What cases remain after the LLVM improvement?

@nikomatsakis
Copy link
Contributor

triage: P-medium

@rust-highfive rust-highfive added P-medium Medium priority and removed I-nominated labels Aug 18, 2016
@nikomatsakis
Copy link
Contributor

triage: P-high

Since this is a regression, we'll call it P-high for now, though it's primarily an LLVM problem.

@rust-highfive rust-highfive added P-high High priority and removed P-medium Medium priority labels Aug 18, 2016
@arielb1 arielb1 added regression-from-stable-to-beta Performance or correctness regression from stable to beta. and removed regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. labels Aug 18, 2016
@eddyb
Copy link
Member

eddyb commented Aug 18, 2016

@arielb1 LLVM can now again figure out that a memset/memcpy has a constant length, but that information comes too late in the pass pipeline and it's not cleaned up properly.
We could mess with pass ordering, e.g. moving the GVN after the InstCombine did help in one other (memcpy) instance, before I looked into this current issue, and it seems to be the same thing here, but this is risky as we could lose optimizations elsewhere.

Running GVN twice is safer, but could result in slower compile-times.
@nrc Could we investigate the performance implications of changing our LLVM fork to run GVN twice?

@arielb1
Copy link
Contributor

arielb1 commented Aug 21, 2016

@eddyb

Again. What is the specific code that is slow with the new rustc/llvm and fast otherwise?

BTW, the LLVM function passes are already pretty fast.

@bluss
Copy link
Member Author

bluss commented Aug 21, 2016

The code in #35662 (comment) is still slow with MIR and fast with -Z orbit=off using rustc 1.13.0-nightly (f883b0bba 2016-08-19), which includes the four llvm patches.

@bluss
Copy link
Member Author

bluss commented Aug 21, 2016

The crate matrixmultiply itself has had a workaround applied and has restored performance in version 0.1.9.

@arielb1
Copy link
Contributor

arielb1 commented Aug 21, 2016

The crate matrixmultiply itself has had a workaround applied and has restored performance in version 0.1.9.

Nice to hear it's not blocking you.

@arielb1
Copy link
Contributor

arielb1 commented Aug 21, 2016

Problem code:

#![feature(test)]
extern crate test;
use test::Bencher;

pub type T = f32;

const MR: usize = 4;
const NR: usize = 4;

macro_rules! loop4 {
    ($i:ident, $e:expr) => {{
        let $i = 0; $e;
        let $i = 1; $e;
        let $i = 2; $e;
        let $i = 3; $e;
    }}
}

/// 4x4 matrix multiplication kernel
///
/// This does the matrix multiplication:
///
/// C ← α A B
///
/// + k: length of data in a, b
/// + a, b are packed
/// + c has general strides
/// + rsc: row stride of c
/// + csc: col stride of c
#[inline(never)]
pub unsafe fn kernel(k: usize, alpha: T, a: *const T, b: *const T,
                     c: *mut T, rsc: isize, csc: isize)
{
    let mut ab = [[0.; NR]; MR];
    let mut a = a;
    let mut b = b;

    // Compute matrix multiplication into ab[i][j]
    for _ in 0..k {
        let v0: [_; MR] = [at(a, 0), at(a, 1), at(a, 2), at(a, 3)];
        let v1: [_; NR] = [at(b, 0), at(b, 1), at(b, 2), at(b, 3)];
        loop4!(i, loop4!(j, ab[i][j] += v0[i] * v1[j]));

        a = a.offset(MR as isize);
        b = b.offset(NR as isize);
    }

    macro_rules! c {
        ($i:expr, $j:expr) => (*c.offset(rsc * $i as isize + csc * $j as isize));
    }

    // set C = α A B
    for i in 0..MR {
        for j in 0..NR {
            c![i, j] = alpha * ab[i][j];
        }
    }
}

#[inline(always)]
unsafe fn at(ptr: *const T, i: usize) -> T {
    *ptr.offset(i as isize)
}

#[test]
fn test_gemm_kernel() {
    let k = 4;
    let mut a = [1.; 16];
    let mut b = [0.; 16];
    for (i, x) in a.iter_mut().enumerate() {
        *x = i as f32;
    }

    for i in 0..4 {
        b[i + i * 4] = 1.;
    }
    let mut c = [0.; 16];
    unsafe {
        kernel(k, 1., &a[0], &b[0], &mut c[0], 1, 4);
        // col major C
    }
    assert_eq!(&a, &c);
}

#[bench]
fn bench_gemm(bench: &mut Bencher) {
    const K: usize = 32;
    let mut a = [1.; MR * K];
    let mut b = [0.; NR * K];
    for (i, x) in a.iter_mut().enumerate() {
        *x = i as f32;
    }

    for i in 0..NR {
        b[i + i * K] = 1.;
    }
    let mut c = [0.; NR * MR];
    bench.iter(|| {
        unsafe {
            kernel(K, 1., &a[0], &b[0], &mut c[0], 1, 4);
        }
        c
    });
}

@eddyb
Copy link
Member

eddyb commented Aug 21, 2016

@arielb1 The root problem can be seen in the IR generated by #35662 (comment) - which is left with a constant-length memset that isn't removed due to pass ordering problems. Solving that should help the more complex matrix multiplication code.

@brson
Copy link
Contributor

brson commented Aug 25, 2016

Is this fixed after #35740?

Edit: Seems no.

@brson brson added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Aug 25, 2016
@eddyb
Copy link
Member

eddyb commented Aug 26, 2016

I've experimented with this change to LLVM:

diff --git a/lib/Transforms/IPO/PassManagerBuilder.cpp b/lib/Transforms/IPO/PassManagerBuilder.cpp
index df6a48e..da420f3 100644
--- a/lib/Transforms/IPO/PassManagerBuilder.cpp
+++ b/lib/Transforms/IPO/PassManagerBuilder.cpp
@@ -317,6 +317,9 @@ void PassManagerBuilder::addFunctionSimplificationPasses(
   // Run instcombine after redundancy elimination to exploit opportunities
   // opened up by them.
   addInstructionCombiningPass(MPM);
+  if (OptLevel > 1) {
+    MPM.add(createGVNPass(DisableGVNLoadPRE));  // Remove redundancies
+  }
   addExtensionsToPM(EP_Peephole, MPM);
   MPM.add(createJumpThreadingPass());         // Thread jumps
   MPM.add(createCorrelatedValuePropagationPass());

It seems to result in the constant-length memset being removed in the simpler cases.

However, I ended up re-doing the reduction and ended up with something similar.
That testcase does 2ns with old trans (beta) and 9ns with MIR trans + the modified LLVM.
The only real difference is the nesting of the ab array, which optimizes really poorly.

@eddyb
Copy link
Member

eddyb commented Aug 29, 2016

I found the remaining problem: initializing the arrays right now uses < (UGT) while our iterators and C++ use != (NE) for the stop condition of the pointer.
Fixing that and running GVN twice fixes the performance of @bluss' benchmark.

@nikomatsakis
Copy link
Contributor

I found the remaining problem: initializing the arrays right now uses < (UGT) while our iterators and C++ use != (NE) for the stop condition of the pointer.

This is so beautifully fragile.

@eddyb
Copy link
Member

eddyb commented Aug 31, 2016

@nikomatsakis See #36124 (comment) for a quick explanation of why LLVM's reluctance is correct in general (even though it has enough information to optimize nested < loops working on nested local arrays)

bors added a commit that referenced this issue Sep 4, 2016
Fix optimization regressions for operations on [x; n]-initialized arrays.

Fixes #35662 by using `!=` instead of `<` as the stop condition for `[x; n]` initialization loops.
Also included is eddyb/llvm@cc2009f, a hack to run the GVN pass twice, another time after InstCombine.
This hack results in removal of redundant `memset` and `memcpy` calls (from loops over arrays).

cc @nrc Can we get performance numbers on this? Not sure if it regresses anything else.
@bluss
Copy link
Member Author

bluss commented Oct 19, 2016

More or less reopened this issue as #37276. It's not affecting matrixmultiply because I think the uninitialized + assignments workaround is sound (until they take uninitialized away from us).

This issue is left closed since it did end up finding & fixing a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html I-slow Issue: Problems and improvements with respect to performance of generated code. P-high High priority regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants