-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLVM pointer range loop / autovectorization regression #35662
Comments
This might be caused by the LLVM upgrade. rustc 1.12.0-nightly (7333c4a 2016-07-31) (before the LLVM upgrade):
rustc 1.12.0-nightly (28ce3e8 2016-08-01) (after the LLVM upgrade, old-trans is still the default):
Measured on an Intel Haswell processor. |
Thank you, great info. |
Thanks to @dikaiosune I produced this minimization (for a different case): https://godbolt.org/g/3Nuofl. Seems to reproduce with clang 3.8 vs running LLVM 3.9's |
@majnemer on IRC pointed out https://reviews.llvm.org/rL268972. |
@pmarcelll I misread some of those reports, so it's a combination of LLVM 3.9 and MIR trans being used? On a Haswell, |
@eddyb Haswell has avx + avx2 too doesn't it, so it should imply those too |
If my benchmarking is correct and i386 means no SIMD at all, then it's not just an autovectorization regression. rustc 1.12.0-nightly (7333c4a 2016-07-31):
rustc 1.12.0-nightly (28ce3e8 2016-08-01):
The last one is especially interesting because the EDIT: same on the latest nightly. |
Can you reduce this to something that fits on playpen but still exhibits a performance difference? |
This shows the same kind of difference, even though it's quite long, since I wanted to keep the same kernel https://play.rust-lang.org/?gist=528273eaf93142b0ce84c5f24de1ccd3&version=nightly&backtrace=0 In this reduced example, just like in matrixmultiply originally, it's still using vector instructions in the regressed version just not the same ones, so it's not very effective. I found a regression from orbit=off to orbit=on using these command lines on an avx platform.
|
I've reduced it to a reproducing (5ns and 7ns) and non-reproducing (5ns and 5ns) version.
|
EDIT: But... only locally? playpen seems to vectorize just fine on "nightly". EDIT2: I am dumb, "Release" mode is |
I've reduced the MIR trans regression to this: #![crate_type = "lib"]
pub fn kernel() -> f32 {
let a = [0.; 4];
a[0]
} This is optimized out into nothing with old trans but not with MIR trans. Can someone confirm that before the LLVM update, both old trans and MIR trans produce an empty function (with EDIT: Rewritten to trigger even without EDIT2: Reproduces in clang 3.9, filed https://llvm.org/bugs/show_bug.cgi?id=28987. |
Yes, using the pre llvm upgrade nightly command It does have an empty / simple function without the array init
|
May be fixed by backporting these 4 commits made by @majnemer (old -> new):
EDIT: I've just checked and these do indeed remove observable performance differences in the reduced 2x2 matrix multiplication I linked above, however, the dead The original 4x4 multiplication is still pretty bad, but at least some smaller cases are less affected. |
Update LLVM to include 4 backported commits by @majnemer. Partial fix for rust-lang#35662, should help at least loops on small arrays. Nominated for backporting into the new beta (not the one that's being released as stable this week). r? @alexcrichton
What cases remain after the LLVM improvement? |
triage: P-medium |
triage: P-high Since this is a regression, we'll call it P-high for now, though it's primarily an LLVM problem. |
@arielb1 LLVM can now again figure out that a memset/memcpy has a constant length, but that information comes too late in the pass pipeline and it's not cleaned up properly. Running GVN twice is safer, but could result in slower compile-times. |
Again. What is the specific code that is slow with the new rustc/llvm and fast otherwise? BTW, the LLVM function passes are already pretty fast. |
The code in #35662 (comment) is still slow with MIR and fast with -Z orbit=off using |
The crate matrixmultiply itself has had a workaround applied and has restored performance in version 0.1.9. |
Nice to hear it's not blocking you. |
Problem code: #![feature(test)]
extern crate test;
use test::Bencher;
pub type T = f32;
const MR: usize = 4;
const NR: usize = 4;
macro_rules! loop4 {
($i:ident, $e:expr) => {{
let $i = 0; $e;
let $i = 1; $e;
let $i = 2; $e;
let $i = 3; $e;
}}
}
/// 4x4 matrix multiplication kernel
///
/// This does the matrix multiplication:
///
/// C ← α A B
///
/// + k: length of data in a, b
/// + a, b are packed
/// + c has general strides
/// + rsc: row stride of c
/// + csc: col stride of c
#[inline(never)]
pub unsafe fn kernel(k: usize, alpha: T, a: *const T, b: *const T,
c: *mut T, rsc: isize, csc: isize)
{
let mut ab = [[0.; NR]; MR];
let mut a = a;
let mut b = b;
// Compute matrix multiplication into ab[i][j]
for _ in 0..k {
let v0: [_; MR] = [at(a, 0), at(a, 1), at(a, 2), at(a, 3)];
let v1: [_; NR] = [at(b, 0), at(b, 1), at(b, 2), at(b, 3)];
loop4!(i, loop4!(j, ab[i][j] += v0[i] * v1[j]));
a = a.offset(MR as isize);
b = b.offset(NR as isize);
}
macro_rules! c {
($i:expr, $j:expr) => (*c.offset(rsc * $i as isize + csc * $j as isize));
}
// set C = α A B
for i in 0..MR {
for j in 0..NR {
c![i, j] = alpha * ab[i][j];
}
}
}
#[inline(always)]
unsafe fn at(ptr: *const T, i: usize) -> T {
*ptr.offset(i as isize)
}
#[test]
fn test_gemm_kernel() {
let k = 4;
let mut a = [1.; 16];
let mut b = [0.; 16];
for (i, x) in a.iter_mut().enumerate() {
*x = i as f32;
}
for i in 0..4 {
b[i + i * 4] = 1.;
}
let mut c = [0.; 16];
unsafe {
kernel(k, 1., &a[0], &b[0], &mut c[0], 1, 4);
// col major C
}
assert_eq!(&a, &c);
}
#[bench]
fn bench_gemm(bench: &mut Bencher) {
const K: usize = 32;
let mut a = [1.; MR * K];
let mut b = [0.; NR * K];
for (i, x) in a.iter_mut().enumerate() {
*x = i as f32;
}
for i in 0..NR {
b[i + i * K] = 1.;
}
let mut c = [0.; NR * MR];
bench.iter(|| {
unsafe {
kernel(K, 1., &a[0], &b[0], &mut c[0], 1, 4);
}
c
});
} |
@arielb1 The root problem can be seen in the IR generated by #35662 (comment) - which is left with a constant-length memset that isn't removed due to pass ordering problems. Solving that should help the more complex matrix multiplication code. |
Is this fixed after #35740? Edit: Seems no. |
I've experimented with this change to LLVM: diff --git a/lib/Transforms/IPO/PassManagerBuilder.cpp b/lib/Transforms/IPO/PassManagerBuilder.cpp
index df6a48e..da420f3 100644
--- a/lib/Transforms/IPO/PassManagerBuilder.cpp
+++ b/lib/Transforms/IPO/PassManagerBuilder.cpp
@@ -317,6 +317,9 @@ void PassManagerBuilder::addFunctionSimplificationPasses(
// Run instcombine after redundancy elimination to exploit opportunities
// opened up by them.
addInstructionCombiningPass(MPM);
+ if (OptLevel > 1) {
+ MPM.add(createGVNPass(DisableGVNLoadPRE)); // Remove redundancies
+ }
addExtensionsToPM(EP_Peephole, MPM);
MPM.add(createJumpThreadingPass()); // Thread jumps
MPM.add(createCorrelatedValuePropagationPass()); It seems to result in the constant-length However, I ended up re-doing the reduction and ended up with something similar. |
I found the remaining problem: initializing the arrays right now uses |
This is so beautifully fragile. |
@nikomatsakis See #36124 (comment) for a quick explanation of why LLVM's reluctance is correct in general (even though it has enough information to optimize nested |
Fix optimization regressions for operations on [x; n]-initialized arrays. Fixes #35662 by using `!=` instead of `<` as the stop condition for `[x; n]` initialization loops. Also included is eddyb/llvm@cc2009f, a hack to run the GVN pass twice, another time after InstCombine. This hack results in removal of redundant `memset` and `memcpy` calls (from loops over arrays). cc @nrc Can we get performance numbers on this? Not sure if it regresses anything else.
More or less reopened this issue as #37276. It's not affecting matrixmultiply because I think the uninitialized + assignments workaround is sound (until they take uninitialized away from us). This issue is left closed since it did end up finding & fixing a problem. |
The benchmarks in crate matrixmultiply version 0.1.8 degrade with MIR enabled. (commit bluss/matrixmultiply@3d83647)
Tested using
rustc 1.12.0-nightly (1deb02ea6 2016-08-12)
.Typical output:
Sure, the matrix multiplication kernel uses some major muckery that it expects the compiler to optimize down and autovectorize, but since it technically is a regression, it gets a report.
The text was updated successfully, but these errors were encountered: