SIMD intrinsics often fail to inline #53069

pcwalton · 2018-08-04T18:49:31Z

Here's my code:

#[bench]
fn simd_paeth_1(bench: &mut Bencher) {
    // pi
    unsafe {
        let a = [0x32, 0x43, 0xf6, 0xa8];
        let b = [0x88, 0x5a, 0x30, 0x8d];
        let c = [0x31, 0x31, 0x98, 0xa2];
        let ba = x86::_mm_set_epi16(b[3], b[2], b[1], b[0], a[3], a[2], a[1], a[0]);
        let cc = x86::_mm_set_epi16(c[3], c[2], c[1], c[0], c[3], c[2], c[1], c[0]);
        let all_ones = x86::_mm_set1_epi16(-1);
        bench.iter(|| {
            test::black_box(a);
            test::black_box(b);
            test::black_box(c);
            for i in 0..1000 {
                test::black_box(i);

                // Compute signed distances.
                let spapb = x86::_mm_sub_epi16(ba, cc);
                let spbpa = x86::_mm_shuffle_epi32(spapb, 0b01001110);    // swap dwords
                let spcpc = x86::_mm_add_epi16(spbpa, spapb);

                // Compute absolute distances.
                let papb = x86::_mm_abs_epi16(spapb);
                let pcpc = x86::_mm_abs_epi16(spcpc);
                let pbpa = x86::_mm_shuffle_epi32(papb, 0b01001110);

                // Compute minima.
                let min_bc = x86::_mm_min_epi16(papb, pcpc);
                let min_abc = x86::_mm_min_epi16(pbpa, min_bc);

                // Choose b or c.
                let pick_b_or_c = x86::_mm_cmpeq_epi16(papb, min_bc);
                let b = x86::_mm_slli_si128(ba, 4);
                let b_or_c = x86::_mm_blendv_epi8(b, cc, pick_b_or_c);

                // Choose a if necessary.
                let pick_a = x86::_mm_cmpeq_epi16(pbpa, min_abc);
                let result = x86::_mm_blendv_epi8(ba, b_or_c, pick_a);
                test::black_box(result);
            }
        });
    }
}

Note that mm_blendv_epi8 fails to inline, ruining performance.

This happens a lot and it makes using SIMD intrinsics very annoying. I have to start using inline asm.

The text was updated successfully, but these errors were encountered:

hanna-kruppe · 2018-08-05T11:51:36Z

As far as I can tell all the intrinsics that fail to inline require target_features that aren't enabled by default (SSE3 for abs_epi16, SSE4.1 for blendv_epi8). So the fix is to add the appropriate #[target_feature] annotations (though it appears you can't do that on closures).

Not inlining between functions with different target_feature sets is generally required for correctness, but in this case (as in many others) it's unfortunately a big performance footgun.

gnzlbg · 2018-08-06T07:51:11Z

This is working as intended, and the things that we could improve to make this better already have issues / RFCs filled, so we should probably close this as working as intended / dupplicate.

Note that mm_blendv_epi8 fails to inline, ruining performance.

Executing SSE3 and SSE4.1 intrinsics on hardware that does not support them is undefined behavior and not inlining these intrinsics is required for correctness when, for example, doing run-time feature detection. Otherwise, code like this could fail:

// sse2 function
if has_sse3 {  // detect sse3 at run-time
    // executing this on sse2 hardware is UB
    // so this cannot be speculatively executed, re-ordered out of the if, etc. 
    sse3_intrinsics(); 
}

If you know, your function will only be executed on SSE4.1 hardware, you can use #[target_feature(enable = "4.1")] to compile it for SSE4.1 hardware. If you know your binary will only run on SSE4.1 hardware you can use RUSTFLAGS=-C target-feature=+sse4.1 to compile it for SSE4.1 hardware. Both will fix the performance issue reported here, and both will invoke undefined behavior if, at run-time, your code gets executed on hardware does not support SSE4.1.

These solutions are far from perfect, but the top docs of stdsimd do specifically address this issue, and there is "RFC: target-feature 1.1" which would make it easier to write safe Rust code without performance and safety issues but it hasn't been merged yet: rust-lang/rfcs#2396 Maybe @pcwalton could help pushing it forward.

Also, we should obviously be warning about this, but it was decided that doing so would be the job of the portability lint, and warning about this is hard:

fn foo(x: bool) {  // SSE2
    avx(); // WARNING
    if detect("avx") {
        avx(); // OK (NO WARNING)
    }
    let b = detect("avx");
    if b { avx(); } // OK (NO WARNING)
    if x { avx() } // ??? MIGHT BE OK
}

but that should probably be raised there: https://github.com/rust-lang-nursery/portability-wg (EDIT: reported this here: rust-lang-nursery/portability-wg#17 (comment))

gnzlbg · 2018-08-27T14:23:07Z

This can be closed as working as intended.

pcwalton added I-slow Issue: Problems and improvements with respect to performance of generated code. A-SIMD Area: SIMD (Single Instruction Multiple Data) labels Aug 4, 2018

gnzlbg mentioned this issue Aug 6, 2018

Portability checking and target_feature rust-lang-nursery/portability-wg#17

Open

shepmaster closed this as completed Aug 27, 2018

jack-signal mentioned this issue Oct 13, 2020

aesni: Add target_feature annotations to allow intrinsic inlining RustCrypto/block-ciphers#165

Merged

calebzulawski mentioned this issue Jul 16, 2023

Fix #[inline(always)] on closures with target feature 1.1 #111836

Merged

gendx mentioned this issue Sep 8, 2023

p64 load/store intrinsics not properly inlined on arm rust-lang/stdarch#1236

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD intrinsics often fail to inline #53069

SIMD intrinsics often fail to inline #53069

pcwalton commented Aug 4, 2018

hanna-kruppe commented Aug 5, 2018

gnzlbg commented Aug 6, 2018 •

edited

Loading

gnzlbg commented Aug 27, 2018

SIMD intrinsics often fail to inline #53069

SIMD intrinsics often fail to inline #53069

Comments

pcwalton commented Aug 4, 2018

hanna-kruppe commented Aug 5, 2018

gnzlbg commented Aug 6, 2018 • edited Loading

gnzlbg commented Aug 27, 2018

gnzlbg commented Aug 6, 2018 •

edited

Loading