Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miscompilation under target-cpu >= haswell #63791

Closed
alecmocatta opened this issue Aug 21, 2019 · 13 comments · Fixed by #64317
Closed

Miscompilation under target-cpu >= haswell #63791

alecmocatta opened this issue Aug 21, 2019 · 13 comments · Fixed by #64317
Assignees
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness O-x86_64 Target: x86-64 processors (like x86_64-*) P-high High priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@alecmocatta
Copy link
Contributor

alecmocatta commented Aug 21, 2019

This example fails when compiled with target-cpu "haswell" or more recent:

[dependencies]
aes-soft = "=0.3.3"
use aes_soft::{block_cipher_trait::BlockCipher, Aes128};

fn main() {
    let plain = [127, 0, 0, 1, 174, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
    let key = [0; 16];
    let encrypted = [222, 157, 168, 71, 195, 237, 77, 237, 182, 194, 17, 235, 182, 214, 204, 80];

    let output = encrypt(plain, key);
    assert_eq!(output, encrypted);

    println!("success");
}

fn encrypt(input: [u8; 16], key: [u8; 16]) -> [u8; 16] {
    let key = key.into();
    let mut block = input.into();
    let cipher = Aes128::new(&key);
    cipher.encrypt_block(&mut block);
    block.into()
}
> RUSTFLAGS='-C target-cpu=ivybridge' cargo run --release
success
> RUSTFLAGS='-C target-cpu=haswell' cargo run --release
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `[103, 175, 2, 16, 66, 180, 192, 20, 55, 121, 111, 21, 82, 184, 106, 59]`,
 right: `[222, 157, 168, 71, 195, 237, 77, 237, 182, 194, 17, 235, 182, 214, 204, 80]`', src/main.rs:9:5
> RUSTFLAGS='-C target-cpu=broadwell' cargo run --release
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `[103, 175, 2, 16, 66, 180, 192, 20, 55, 121, 111, 21, 82, 184, 106, 59]`,
 right: `[222, 157, 168, 71, 195, 237, 77, 237, 182, 194, 17, 235, 182, 214, 204, 80]`', src/main.rs:9:5
> RUSTFLAGS='-C target-cpu=skylake' cargo run --release
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `[103, 175, 2, 16, 66, 180, 192, 20, 55, 121, 111, 21, 82, 184, 106, 59]`,
 right: `[222, 157, 168, 71, 195, 237, 77, 237, 182, 194, 17, 235, 182, 214, 204, 80]`', src/main.rs:9:5

I bumped into this when compiling with target-cpu=native and assumed it was related to #54688, but after minimising the testcase I don't think it is. My next guess is an llvm bug but I thought I'd make an issue here in case anyone else bumps into it or wants to help investigate.

I've created a self-contained godbolt: https://rust.godbolt.org/z/ChdU6w

The two uses of unsafe I believe are sound. I think I've ruled out target-features being the cause, because this succeeds:

RUSTFLAGS='-C target-cpu=ivybridge -C target-feature=-slow-unaligned-mem-32,+avx2,+bmi,+bmi2,+ermsb,+fma,+invpcid,+lzcnt,+movbe,+false-deps-lzcnt-tzcnt' cargo run --release

Which should enable identical features to haswell given https://github.com/llvm-mirror/llvm/blob/release_90/lib/Target/X86/X86.td#L542-L568.

This occurs on nightly and stables back to 1.34.0, though 1.33.0 and prior seem to work correctly.

Tested on Broadwell (Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz) and Skylake (Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz).

Same issue filed against aes_soft crate: RustCrypto/block-ciphers#51

@jonas-schievink jonas-schievink added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-nominated I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Aug 21, 2019
@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 24, 2019

It would help tremendously if you could reduce that example further.

@mati865
Copy link
Contributor

mati865 commented Aug 24, 2019

You can use C-Reduce to achieve that.

@jonas-schievink jonas-schievink added the E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example label Aug 24, 2019
@alecmocatta
Copy link
Contributor Author

I'll look into using bugpoint and -opt-bisect-limit to narrow this down. Any other suggestions welcome.

@mati865 Thanks for the suggestion, I didn't know C-Reduce worked for Rust! My suspicion is in llvm's optimization passes, i.e. LLVM IR -> assembly, so I'll give llvm's bugpoint a go first.

@bjorn3
Copy link
Member

bjorn3 commented Aug 24, 2019

I didn't know C-Reduce worked for Rust!

It works for pretty much any language. It just does changes like removing a line or changing it a bit, so it often generates invalid C as well. Only a few passes assume C, but those will just not reduce the test case for Rust. See https://embed.cs.utah.edu/creduce/pldi12_talk.pdf

@Aaron1011
Copy link
Member

Unforunately, C-Reduce proved to be too clever when I ran it: it replaced:

if output != encrypted {
    std::process::abort();
}

with

std::process::abort();

While this is technically a 'minimization', it's not a very useful one.

@JakubValtar
Copy link

Not sure if it helps, but git bisect revealed df0466d to be the first bad commit.

@Aaron1011
Copy link
Member

Aaron1011 commented Aug 24, 2019

THe problem appears to be related to the deconstruct function inside un_bit_slice_4x4_with_u16. For some reason, the high-order bit in a (computed as pb(x7, bit + 12, 31)) seems to be getting left off.

The earliest point at which this occurs appears to be the call un_bit_slice_4x4_with_u16(Bs8State(61439, 65535, 0, 4369, 273, 65535, 61439, 4369)). The correct result is (2868640763, 1667457891, 1667457891, 1667457891), but with -C target-cpu=haswell, it gets miscompiled somehow to return (721157115, 1667457891, 1667457891, 1667457891). Note that 2868640763 = 721157115 | (1 << 31).

Unfortunately, the problem disappeared when I made a standalone project with un_bit_slice_4x4_with_u16 and Bs8State.

@JakubValtar
Copy link

JakubValtar commented Aug 25, 2019

I decided to give it a try and managed to simplify the code:

fn main() {
    let x = [0, 0, 0, 0, 0, 0, 0, 1];
    let x = unslice(x).0;
    assert_eq!(x, 1 << 31);
}

#[inline(never)] // can't reproduce without inline(never)
fn unslice(x: [u16; 8]) -> (u32, u32, u32) {
    let a = deconstruct(x, 0);
    let b = deconstruct(x, 1); // needs two calls with different parameter
    (a, b, 0) // can't reproduce without the third value in the tuple
}

#[inline(never)] // can't reproduce without inline(never)
fn deconstruct(x: [u16; 8], bit: u32) -> u32 {
    // this part needs to get vectorized
    pb(x[0], bit + 1, 16) | pb(x[1], bit + 1, 17)
        | pb(x[2], bit + 1, 18) | pb(x[3], bit + 1, 19)
        | pb(x[4], bit + 1, 20) | pb(x[5], bit + 1, 21)
        | pb(x[6], bit + 1, 22) | pb(x[7], bit + 1, 23)
        | pb(x[0], bit, 24) | pb(x[1], bit, 25)
        | pb(x[2], bit, 26) | pb(x[3], bit, 27)
        | pb(x[4], bit, 28) | pb(x[5], bit, 29)
        | pb(x[6], bit, 30) | pb(x[7], bit, 31)
}

fn pb(x: u16, bit: u32, shift: u32) -> u32 {
    u32::from((x >> bit) & 1) << shift // can't reproduce with different order (e.g. mask then shift)
}

Assembly output from 1.33.0 and 1.34.0 is identical, except for this block of AVX instructions:

1.33.0
vpunpckhwd ymm4, ymm3, ymm1
vpunpckhwd ymm0, ymm0, ymm3
vpsrlvd ymm0, ymm4, ymm0
vpsrld ymm0, ymm0, 16
vpunpcklwd ymm1, ymm3, ymm1
vpsrld ymm1, ymm1, 1
vpsrld ymm1, ymm1, 16
vpackusdw ymm0, ymm1, ymm0

1.34.0
vpunpckhwd ymm1, ymm3, ymm1
vpunpckhwd ymm0, ymm0, ymm3
vpsrlvd ymm0, ymm1, ymm0

Looks like some of them got incorrectly optimized away. I'm not familiar with how LLVM vectorizes and optimizes things, so sadly this is as far as I can get on my own.

Edit: Just to confirm, running with -opt-bisect-limit points to SLP Vectorizer on function (deconstruct).

@jonas-schievink jonas-schievink added regression-from-stable-to-stable Performance or correctness regression from one stable version to another. O-x86_64 Target: x86-64 processors (like x86_64-*) and removed E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example labels Aug 25, 2019
@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 27, 2019

I managed to produce a smaller but slightly different reduction. I'm not sure if it also triggers the exact same issue reported here, but a similar asm diff is observed, see the diff in https://gcc.godbolt.org/z/KZVz_O:

pub fn deconstruct(x: [u16; 8], bit: u32) -> u32 {
    fn pb(x: u16, bit: u32, shift: u32) -> u32 {
        // can't reproduce with different order (e.g. mask then shift)
        u32::from((x >> bit) & 1) << shift 
    }
    // this part needs to get vectorized
    pb(x[0], bit + 1, 16) | pb(x[1], bit + 1, 17)
        | pb(x[2], bit + 1, 18) | pb(x[3], bit + 1, 19)
        | pb(x[4], bit + 1, 20) | pb(x[5], bit + 1, 21)
        | pb(x[6], bit + 1, 22) | pb(x[7], bit + 1, 23)
        | pb(x[0], bit, 24) | pb(x[1], bit, 25)
        | pb(x[2], bit, 26) | pb(x[3], bit, 27)
        | pb(x[4], bit, 28) | pb(x[5], bit, 29)
        | pb(x[6], bit, 30) | pb(x[7], bit, 31)
}

@nikomatsakis
Copy link
Contributor

Check-in from the compiler triage meeting:

Summarizing, the previous comments, it seems we have minimized the example and we have bisected to a commit that upgrades LLVM.

This looks likely to be an LLVM bug. I'm going to cc @nikic and @nagisa in the hopes that one of them can help in pushing this to resolution.

I'm marking it as P-high.

@nikomatsakis nikomatsakis added P-high High priority and removed I-nominated labels Aug 29, 2019
@nikic
Copy link
Contributor

nikic commented Sep 1, 2019

The IR for the functions is the same, so this is an issue in the backend, not the SLP vectorizer.

It looks like LLVM correctly determines that one of the vpsrld's is dead based on demanded elements and replaces the packus with a pshufb. A little later we see:

Replacing.2 t289: v32i8 = BUILD_VECTOR undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, Constant:i8<2>, Constant:i8<3>, Constant:i8<6>, Constant:i8<7>, Constant:i8<10>, Constant:i8<11>, Constant:i8<14>, Constant:i8<15>, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, Constant:i8<18>, Constant:i8<19>, Constant:i8<22>, Constant:i8<23>, Constant:i8<26>, Constant:i8<27>, Constant:i8<30>, Constant:i8<31>

With: t292: v32i8 = BUILD_VECTOR undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, Constant:i8<26>, Constant:i8<27>, undef:i8, undef:i8


Combining: t290: v32i8 = X86ISD::PSHUFB t288, t292
Creating new node: t293: v16i16 = bitcast t261
Creating constant: t294: i8 = Constant<-28>
Creating new node: t295: v16i16 = X86ISD::PSHUFLW t293, Constant:i8<-28>
Creating new node: t296: v32i8 = bitcast t295
 ... into: t296: v32i8 = bitcast t295

Which doesn't look right... pshuflw -28 is a no-op operation, but the previous pshufb is not. This combine is performed by matchUnaryPermuteShuffle(), but I haven't looked further than that yet.

@nikic
Copy link
Contributor

nikic commented Sep 7, 2019

LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=43230

@nikic nikic self-assigned this Sep 7, 2019
@nikic
Copy link
Contributor

nikic commented Sep 9, 2019

Fixed & backported to LLVM 9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness O-x86_64 Target: x86-64 processors (like x86_64-*) P-high High priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants