LLVM pointer range loop / autovectorization regression #35662

bluss · 2016-08-14T11:09:02Z

The benchmarks in crate matrixmultiply version 0.1.8 degrade with MIR enabled. (commit bluss/matrixmultiply@3d83647)

Tested using rustc 1.12.0-nightly (1deb02ea6 2016-08-12).

Typical output:

// -C target-cpu=native
test mat_mul_f32::m127             ... bench:   2,703,773 ns/iter (+/- 636,432)
// -Z orbit=off -C target-cpu=native
test mat_mul_f32::m127             ... bench:     648,817 ns/iter (+/- 22,379)

Sure, the matrix multiplication kernel uses some major muckery that it expects the compiler to optimize down and autovectorize, but since it technically is a regression, it gets a report.

The text was updated successfully, but these errors were encountered:

pmarcelll · 2016-08-14T15:55:28Z

This might be caused by the LLVM upgrade.

rustc 1.12.0-nightly (7333c4a 2016-07-31) (before the LLVM upgrade):

// -C target-cpu=native
test mat_mul_f32::m127             ... bench:     162,334 ns/iter (+/- 16,429)
// -Z orbit=on -C target-cpu=native
test mat_mul_f32::m127             ... bench:     162,855 ns/iter (+/- 7,302)

rustc 1.12.0-nightly (28ce3e8 2016-08-01) (after the LLVM upgrade, old-trans is still the default):

// -C target-cpu=native
test mat_mul_f32::m127             ... bench:     169,562 ns/iter (+/- 11,118)
// -Z orbit=on -C target-cpu=native
test mat_mul_f32::m127             ... bench:     570,056 ns/iter (+/- 34,4)

Measured on an Intel Haswell processor.

bluss · 2016-08-14T16:39:35Z

Thank you, great info.

eddyb · 2016-08-14T20:33:16Z

Thanks to @dikaiosune I produced this minimization (for a different case): https://godbolt.org/g/3Nuofl. Seems to reproduce with clang 3.8 vs running LLVM 3.9's opt -O3.

eddyb · 2016-08-14T23:37:07Z

@majnemer on IRC pointed out https://reviews.llvm.org/rL268972.

eddyb · 2016-08-14T23:43:23Z

@pmarcelll I misread some of those reports, so it's a combination of LLVM 3.9 and MIR trans being used?

On a Haswell, -C target-cpu=native should imply sse4.2 AFAIK, so I'm not sure it's the linked issue for the original problem being observed here, only @dikaiosune's.

bluss · 2016-08-15T05:28:22Z

@eddyb Haswell has avx + avx2 too doesn't it, so it should imply those too

pmarcelll · 2016-08-15T09:55:33Z

If my benchmarking is correct and i386 means no SIMD at all, then it's not just an autovectorization regression.

rustc 1.12.0-nightly (7333c4a 2016-07-31):

// -C target-cpu=i386
test mat_mul_f32::m127             ... bench:     209,399 ns/iter (+/- 13,055)
// -Z orbit=on -C target-cpu=i386
test mat_mul_f32::m127             ... bench:     205,028 ns/iter (+/- 11,386)

rustc 1.12.0-nightly (28ce3e8 2016-08-01):

// -C target-cpu=i386
test mat_mul_f32::m127             ... bench:     205,863 ns/iter (+/- 14,618)
// -Z orbit=on -C target-cpu=i386
test mat_mul_f32::m127             ... bench:     512,995 ns/iter (+/- 14,309)

The last one is especially interesting because the i386 version is faster than the haswell version (512,995 ns/iter vs. 570,056 ns/iter).

EDIT: same on the latest nightly.

eddyb · 2016-08-15T16:21:26Z

Can you reduce this to something that fits on playpen but still exhibits a performance difference?

bluss · 2016-08-15T17:00:13Z

This shows the same kind of difference, even though it's quite long, since I wanted to keep the same kernel

https://play.rust-lang.org/?gist=528273eaf93142b0ce84c5f24de1ccd3&version=nightly&backtrace=0

In this reduced example, just like in matrixmultiply originally, it's still using vector instructions in the regressed version just not the same ones, so it's not very effective.

I found a regression from orbit=off to orbit=on using these command lines on an avx platform.

rustc -Ctarget-cpu=native -Zorbit=on -C opt-level=3 --test mini.rs
rustc -Ctarget-cpu=native -Zorbit=off -C opt-level=3 --test mini.rs

eddyb · 2016-08-15T17:48:40Z

I've reduced it to a reproducing (5ns and 7ns) and non-reproducing (5ns and 5ns) version.
The difference between old trans and MIR trans is that MIR trans doesn't aggressively promote everything to copies from .rodata. Even old trans shouldn't, now that I think about it.

~~But it seems that LLVM has regressed in its ability to optimize out the loop initializing the array literal.~~

eddyb · 2016-08-15T18:23:08Z

Looks like one -arg-promotion and -scalar-evolution -demanded-bits -slp-vectorizer are missing from rustc's pass setup: https://gist.github.com/eddyb/ec63fd2df8fffb22575f3ada7f5a8e93.
Running opt on the unoptimized IR from rustc results in vectorization.

EDIT: But... only locally? playpen seems to vectorize just fine on "nightly".

EDIT2: I am dumb, "Release" mode is -C opt-level=3, not -O aka -C opt-level=2.

eddyb · 2016-08-15T18:35:08Z

I've reduced the MIR trans regression to this:

#![crate_type = "lib"]

pub fn kernel() -> f32 {
    let a = [0.; 4];
    a[0]
}

This is optimized out into nothing with old trans but not with MIR trans.

Can someone confirm that before the LLVM update, both old trans and MIR trans produce an empty function (with --emit=llvm-ir -C opt-level=3)?

EDIT: Rewritten to trigger even without llvm.lifetime intrinsics.

EDIT2: Reproduces in clang 3.9, filed https://llvm.org/bugs/show_bug.cgi?id=28987.

bluss · 2016-08-15T19:05:27Z

Yes, using the pre llvm upgrade nightly rustc 1.12.0-nightly (7333c4ac2 2016-07-31)

command rustc --emit=llvm-ir -C opt-level=3 -Z orbit nano2.rs

It does have an empty / simple function without the array init

; ModuleID = 'nano2.cgu-0.rs'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: norecurse nounwind uwtable
define float @_ZN5nano26kernel17h90705cd48502cdafE() unnamed_addr #0 {
entry-block:
  ret float 0.000000e+00
}

attributes #0 = { norecurse nounwind uwtable }

eddyb · 2016-08-16T14:51:44Z

May be fixed by backporting these 4 commits made by @majnemer (old -> new):

llvm-mirror/llvm@0c8a344 [InstCombine] Infer inbounds on geps of allocas
llvm-mirror/llvm@2a07d9f [InstSimplify] Try hard to simplify pointer comparisons
llvm-mirror/llvm@7428509 [InstSimplify] Fold gep (gep V, C), (sub 0, V) to C
llvm-mirror/llvm@0bf24e0 [InstSimplify] Fold gep (gep V, C), (xor V, -1) to C-1

EDIT: I've just checked and these do indeed remove observable performance differences in the reduced 2x2 matrix multiplication I linked above, however, the dead memset is still there, just with a constant length, which presumably is cheap at small sizes, as it just writes XMM registers directly, most likely, whereas the non-constant memset would have to go into an actual loop writing to memory.

The original 4x4 multiplication is still pretty bad, but at least some smaller cases are less affected.

@majnemer

Update LLVM to include 4 backported commits by @majnemer. Partial fix for rust-lang#35662, should help at least loops on small arrays. Nominated for backporting into the new beta (not the one that's being released as stable this week). r? @alexcrichton

arielb1 · 2016-08-18T20:16:18Z

What cases remain after the LLVM improvement?

nikomatsakis · 2016-08-18T20:16:22Z

triage: P-medium

nikomatsakis · 2016-08-18T20:17:03Z

triage: P-high

Since this is a regression, we'll call it P-high for now, though it's primarily an LLVM problem.

eddyb · 2016-08-18T20:23:36Z

@arielb1 LLVM can now again figure out that a memset/memcpy has a constant length, but that information comes too late in the pass pipeline and it's not cleaned up properly.
We could mess with pass ordering, e.g. moving the GVN after the InstCombine did help in one other (memcpy) instance, before I looked into this current issue, and it seems to be the same thing here, but this is risky as we could lose optimizations elsewhere.

Running GVN twice is safer, but could result in slower compile-times.
@nrc Could we investigate the performance implications of changing our LLVM fork to run GVN twice?

arielb1 · 2016-08-21T08:52:15Z

@eddyb

Again. What is the specific code that is slow with the new rustc/llvm and fast otherwise?

BTW, the LLVM function passes are already pretty fast.

bluss · 2016-08-21T09:51:22Z

The code in #35662 (comment) is still slow with MIR and fast with -Z orbit=off using rustc 1.13.0-nightly (f883b0bba 2016-08-19), which includes the four llvm patches.

bluss · 2016-08-21T09:52:00Z

The crate matrixmultiply itself has had a workaround applied and has restored performance in version 0.1.9.

arielb1 · 2016-08-21T12:50:13Z

The crate matrixmultiply itself has had a workaround applied and has restored performance in version 0.1.9.

Nice to hear it's not blocking you.

arielb1 · 2016-08-21T12:50:52Z

Problem code:

#![feature(test)]
extern crate test;
use test::Bencher;

pub type T = f32;

const MR: usize = 4;
const NR: usize = 4;

macro_rules! loop4 {
    ($i:ident, $e:expr) => {{
        let $i = 0; $e;
        let $i = 1; $e;
        let $i = 2; $e;
        let $i = 3; $e;
    }}
}

/// 4x4 matrix multiplication kernel
///
/// This does the matrix multiplication:
///
/// C ← α A B
///
/// + k: length of data in a, b
/// + a, b are packed
/// + c has general strides
/// + rsc: row stride of c
/// + csc: col stride of c
#[inline(never)]
pub unsafe fn kernel(k: usize, alpha: T, a: *const T, b: *const T,
                     c: *mut T, rsc: isize, csc: isize)
{
    let mut ab = [[0.; NR]; MR];
    let mut a = a;
    let mut b = b;

    // Compute matrix multiplication into ab[i][j]
    for _ in 0..k {
        let v0: [_; MR] = [at(a, 0), at(a, 1), at(a, 2), at(a, 3)];
        let v1: [_; NR] = [at(b, 0), at(b, 1), at(b, 2), at(b, 3)];
        loop4!(i, loop4!(j, ab[i][j] += v0[i] * v1[j]));

        a = a.offset(MR as isize);
        b = b.offset(NR as isize);
    }

    macro_rules! c {
        ($i:expr, $j:expr) => (*c.offset(rsc * $i as isize + csc * $j as isize));
    }

    // set C = α A B
    for i in 0..MR {
        for j in 0..NR {
            c![i, j] = alpha * ab[i][j];
        }
    }
}

#[inline(always)]
unsafe fn at(ptr: *const T, i: usize) -> T {
    *ptr.offset(i as isize)
}

#[test]
fn test_gemm_kernel() {
    let k = 4;
    let mut a = [1.; 16];
    let mut b = [0.; 16];
    for (i, x) in a.iter_mut().enumerate() {
        *x = i as f32;
    }

    for i in 0..4 {
        b[i + i * 4] = 1.;
    }
    let mut c = [0.; 16];
    unsafe {
        kernel(k, 1., &a[0], &b[0], &mut c[0], 1, 4);
        // col major C
    }
    assert_eq!(&a, &c);
}

#[bench]
fn bench_gemm(bench: &mut Bencher) {
    const K: usize = 32;
    let mut a = [1.; MR * K];
    let mut b = [0.; NR * K];
    for (i, x) in a.iter_mut().enumerate() {
        *x = i as f32;
    }

    for i in 0..NR {
        b[i + i * K] = 1.;
    }
    let mut c = [0.; NR * MR];
    bench.iter(|| {
        unsafe {
            kernel(K, 1., &a[0], &b[0], &mut c[0], 1, 4);
        }
        c
    });
}

eddyb · 2016-08-21T14:23:14Z

@arielb1 The root problem can be seen in the IR generated by #35662 (comment) - which is left with a constant-length memset that isn't removed due to pass ordering problems. Solving that should help the more complex matrix multiplication code.

brson · 2016-08-25T16:13:22Z

Is this fixed after #35740?

Edit: Seems no.

eddyb · 2016-08-26T09:37:57Z

I've experimented with this change to LLVM:

diff --git a/lib/Transforms/IPO/PassManagerBuilder.cpp b/lib/Transforms/IPO/PassManagerBuilder.cpp
index df6a48e..da420f3 100644
--- a/lib/Transforms/IPO/PassManagerBuilder.cpp
+++ b/lib/Transforms/IPO/PassManagerBuilder.cpp
@@ -317,6 +317,9 @@ void PassManagerBuilder::addFunctionSimplificationPasses(
   // Run instcombine after redundancy elimination to exploit opportunities
   // opened up by them.
   addInstructionCombiningPass(MPM);
+  if (OptLevel > 1) {
+    MPM.add(createGVNPass(DisableGVNLoadPRE));  // Remove redundancies
+  }
   addExtensionsToPM(EP_Peephole, MPM);
   MPM.add(createJumpThreadingPass());         // Thread jumps
   MPM.add(createCorrelatedValuePropagationPass());

It seems to result in the constant-length memset being removed in the simpler cases.

However, I ended up re-doing the reduction and ended up with something similar.
That testcase does 2ns with old trans (beta) and 9ns with MIR trans + the modified LLVM.
The only real difference is the nesting of the ab array, which optimizes really poorly.

eddyb · 2016-08-29T23:23:37Z

I found the remaining problem: initializing the arrays right now uses < (UGT) while our iterators and C++ use != (NE) for the stop condition of the pointer.
Fixing that and running GVN twice fixes the performance of @bluss' benchmark.

nikomatsakis · 2016-08-31T17:22:26Z

I found the remaining problem: initializing the arrays right now uses < (UGT) while our iterators and C++ use != (NE) for the stop condition of the pointer.

This is so beautifully fragile.

eddyb · 2016-08-31T21:36:53Z

@nikomatsakis See #36124 (comment) for a quick explanation of why LLVM's reluctance is correct in general (even though it has enough information to optimize nested < loops working on nested local arrays)

@nrc

Fix optimization regressions for operations on [x; n]-initialized arrays. Fixes #35662 by using `!=` instead of `<` as the stop condition for `[x; n]` initialization loops. Also included is eddyb/llvm@cc2009f, a hack to run the GVN pass twice, another time after InstCombine. This hack results in removal of redundant `memset` and `memcpy` calls (from loops over arrays). cc @nrc Can we get performance numbers on this? Not sure if it regresses anything else.

bluss · 2016-10-19T12:47:46Z

More or less reopened this issue as #37276. It's not affecting matrixmultiply because I think the uninitialized + assignments workaround is sound (until they take uninitialized away from us).

This issue is left closed since it did end up finding & fixing a problem.

eddyb added I-nominated regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Aug 14, 2016

bluss mentioned this issue Aug 14, 2016

Performance regression on nightly bluss/matrixmultiply#9

Closed

eddyb changed the title ~~MIR autovectorization regression~~ LLVM autovectorization regression Aug 14, 2016

eddyb changed the title ~~LLVM autovectorization regression~~ LLVM/MIR autovectorization regression Aug 14, 2016

eddyb added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html labels Aug 14, 2016

MagaTailor mentioned this issue Aug 15, 2016

x86(64) runtime performance irregularities #31503

Closed

eddyb mentioned this issue Aug 15, 2016

Fix a bunch of warnings (fixes #565) image-rs/image#566

Merged

eddyb mentioned this issue Aug 17, 2016

Backport 4 commits to optimize pointer range loops better. rust-lang/llvm#47

Merged

eddyb changed the title ~~LLVM/MIR autovectorization regression~~ LLVM pointer range loop / autovectorization regression Aug 17, 2016

eddyb mentioned this issue Aug 17, 2016

Update LLVM to include 4 backported commits by @majnemer. #35740

Merged

arkpar mentioned this issue Aug 17, 2016

Conditional -Zorbit compilation for maximum performance openethereum/parity-ethereum#1941

Closed

rust-highfive added P-medium Medium priority and removed I-nominated labels Aug 18, 2016

rust-highfive added P-high High priority and removed P-medium Medium priority labels Aug 18, 2016

nikomatsakis assigned eddyb Aug 18, 2016

arielb1 added regression-from-stable-to-beta Performance or correctness regression from stable to beta. and removed regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. labels Aug 18, 2016

brson added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Aug 25, 2016

eddyb mentioned this issue Aug 29, 2016

Fix optimization regressions for operations on [x; n]-initialized arrays. #36124

Merged

bors closed this as completed in #36124 Sep 4, 2016

bluss mentioned this issue Oct 19, 2016

LLVM pointer range loop / autovectorization regression part two #37276

Open

bluss mentioned this issue Dec 26, 2016

memset not being optimized out when initializing array contents #38613

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLVM pointer range loop / autovectorization regression #35662

LLVM pointer range loop / autovectorization regression #35662

bluss commented Aug 14, 2016 •

edited

Loading

pmarcelll commented Aug 14, 2016

bluss commented Aug 14, 2016

eddyb commented Aug 14, 2016 •

edited

Loading

eddyb commented Aug 14, 2016

eddyb commented Aug 14, 2016

bluss commented Aug 15, 2016

pmarcelll commented Aug 15, 2016 •

edited

Loading

eddyb commented Aug 15, 2016

bluss commented Aug 15, 2016 •

edited

Loading

eddyb commented Aug 15, 2016 •

edited

Loading

eddyb commented Aug 15, 2016 •

edited

Loading

eddyb commented Aug 15, 2016 •

edited

Loading

bluss commented Aug 15, 2016

eddyb commented Aug 16, 2016 •

edited

Loading

arielb1 commented Aug 18, 2016

nikomatsakis commented Aug 18, 2016

nikomatsakis commented Aug 18, 2016

eddyb commented Aug 18, 2016

arielb1 commented Aug 21, 2016 •

edited

Loading

bluss commented Aug 21, 2016

bluss commented Aug 21, 2016

arielb1 commented Aug 21, 2016

arielb1 commented Aug 21, 2016

eddyb commented Aug 21, 2016

brson commented Aug 25, 2016 •

edited

Loading

eddyb commented Aug 26, 2016

eddyb commented Aug 29, 2016

nikomatsakis commented Aug 31, 2016

eddyb commented Aug 31, 2016

bluss commented Oct 19, 2016

LLVM pointer range loop / autovectorization regression #35662

LLVM pointer range loop / autovectorization regression #35662

Comments

bluss commented Aug 14, 2016 • edited Loading

pmarcelll commented Aug 14, 2016

bluss commented Aug 14, 2016

eddyb commented Aug 14, 2016 • edited Loading

eddyb commented Aug 14, 2016

eddyb commented Aug 14, 2016

bluss commented Aug 15, 2016

pmarcelll commented Aug 15, 2016 • edited Loading

eddyb commented Aug 15, 2016

bluss commented Aug 15, 2016 • edited Loading

eddyb commented Aug 15, 2016 • edited Loading

eddyb commented Aug 15, 2016 • edited Loading

eddyb commented Aug 15, 2016 • edited Loading

bluss commented Aug 15, 2016

eddyb commented Aug 16, 2016 • edited Loading

arielb1 commented Aug 18, 2016

nikomatsakis commented Aug 18, 2016

nikomatsakis commented Aug 18, 2016

eddyb commented Aug 18, 2016

arielb1 commented Aug 21, 2016 • edited Loading

bluss commented Aug 21, 2016

bluss commented Aug 21, 2016

arielb1 commented Aug 21, 2016

arielb1 commented Aug 21, 2016

eddyb commented Aug 21, 2016

brson commented Aug 25, 2016 • edited Loading

eddyb commented Aug 26, 2016

eddyb commented Aug 29, 2016

nikomatsakis commented Aug 31, 2016

eddyb commented Aug 31, 2016

bluss commented Oct 19, 2016

bluss commented Aug 14, 2016 •

edited

Loading

eddyb commented Aug 14, 2016 •

edited

Loading

pmarcelll commented Aug 15, 2016 •

edited

Loading

bluss commented Aug 15, 2016 •

edited

Loading

eddyb commented Aug 15, 2016 •

edited

Loading

eddyb commented Aug 15, 2016 •

edited

Loading

eddyb commented Aug 15, 2016 •

edited

Loading

eddyb commented Aug 16, 2016 •

edited

Loading

arielb1 commented Aug 21, 2016 •

edited

Loading

brson commented Aug 25, 2016 •

edited

Loading