Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD intrinsics often fail to inline #53069

Closed
pcwalton opened this issue Aug 4, 2018 · 3 comments
Closed

SIMD intrinsics often fail to inline #53069

pcwalton opened this issue Aug 4, 2018 · 3 comments
Labels
A-SIMD Area: SIMD (Single Instruction Multiple Data) I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@pcwalton
Copy link
Contributor

pcwalton commented Aug 4, 2018

Here's my code:

#[bench]
fn simd_paeth_1(bench: &mut Bencher) {
    // pi
    unsafe {
        let a = [0x32, 0x43, 0xf6, 0xa8];
        let b = [0x88, 0x5a, 0x30, 0x8d];
        let c = [0x31, 0x31, 0x98, 0xa2];
        let ba = x86::_mm_set_epi16(b[3], b[2], b[1], b[0], a[3], a[2], a[1], a[0]);
        let cc = x86::_mm_set_epi16(c[3], c[2], c[1], c[0], c[3], c[2], c[1], c[0]);
        let all_ones = x86::_mm_set1_epi16(-1);
        bench.iter(|| {
            test::black_box(a);
            test::black_box(b);
            test::black_box(c);
            for i in 0..1000 {
                test::black_box(i);

                // Compute signed distances.
                let spapb = x86::_mm_sub_epi16(ba, cc);
                let spbpa = x86::_mm_shuffle_epi32(spapb, 0b01001110);    // swap dwords
                let spcpc = x86::_mm_add_epi16(spbpa, spapb);

                // Compute absolute distances.
                let papb = x86::_mm_abs_epi16(spapb);
                let pcpc = x86::_mm_abs_epi16(spcpc);
                let pbpa = x86::_mm_shuffle_epi32(papb, 0b01001110);

                // Compute minima.
                let min_bc = x86::_mm_min_epi16(papb, pcpc);
                let min_abc = x86::_mm_min_epi16(pbpa, min_bc);

                // Choose b or c.
                let pick_b_or_c = x86::_mm_cmpeq_epi16(papb, min_bc);
                let b = x86::_mm_slli_si128(ba, 4);
                let b_or_c = x86::_mm_blendv_epi8(b, cc, pick_b_or_c);

                // Choose a if necessary.
                let pick_a = x86::_mm_cmpeq_epi16(pbpa, min_abc);
                let result = x86::_mm_blendv_epi8(ba, b_or_c, pick_a);
                test::black_box(result);
            }
        });
    }
}

Note that mm_blendv_epi8 fails to inline, ruining performance.

This happens a lot and it makes using SIMD intrinsics very annoying. I have to start using inline asm.

@pcwalton pcwalton added I-slow Issue: Problems and improvements with respect to performance of generated code. A-SIMD Area: SIMD (Single Instruction Multiple Data) labels Aug 4, 2018
@hanna-kruppe
Copy link
Contributor

As far as I can tell all the intrinsics that fail to inline require target_features that aren't enabled by default (SSE3 for abs_epi16, SSE4.1 for blendv_epi8). So the fix is to add the appropriate #[target_feature] annotations (though it appears you can't do that on closures).

Not inlining between functions with different target_feature sets is generally required for correctness, but in this case (as in many others) it's unfortunately a big performance footgun.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 6, 2018

This is working as intended, and the things that we could improve to make this better already have issues / RFCs filled, so we should probably close this as working as intended / dupplicate.


Note that mm_blendv_epi8 fails to inline, ruining performance.

Executing SSE3 and SSE4.1 intrinsics on hardware that does not support them is undefined behavior and not inlining these intrinsics is required for correctness when, for example, doing run-time feature detection. Otherwise, code like this could fail:

// sse2 function
if has_sse3 {  // detect sse3 at run-time
    // executing this on sse2 hardware is UB
    // so this cannot be speculatively executed, re-ordered out of the if, etc. 
    sse3_intrinsics(); 
}

If you know, your function will only be executed on SSE4.1 hardware, you can use #[target_feature(enable = "4.1")] to compile it for SSE4.1 hardware. If you know your binary will only run on SSE4.1 hardware you can use RUSTFLAGS=-C target-feature=+sse4.1 to compile it for SSE4.1 hardware. Both will fix the performance issue reported here, and both will invoke undefined behavior if, at run-time, your code gets executed on hardware does not support SSE4.1.

These solutions are far from perfect, but the top docs of stdsimd do specifically address this issue, and there is "RFC: target-feature 1.1" which would make it easier to write safe Rust code without performance and safety issues but it hasn't been merged yet: rust-lang/rfcs#2396 Maybe @pcwalton could help pushing it forward.

Also, we should obviously be warning about this, but it was decided that doing so would be the job of the portability lint, and warning about this is hard:

fn foo(x: bool) {  // SSE2
    avx(); // WARNING
    if detect("avx") {
        avx(); // OK (NO WARNING)
    }
    let b = detect("avx");
    if b { avx(); } // OK (NO WARNING)
    if x { avx() } // ??? MIGHT BE OK
}

but that should probably be raised there: https://github.com/rust-lang-nursery/portability-wg (EDIT: reported this here: rust-lang-nursery/portability-wg#17 (comment))

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 27, 2018

This can be closed as working as intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-SIMD Area: SIMD (Single Instruction Multiple Data) I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

4 participants