Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement AVX-512 intrinsics #310

Open
alexcrichton opened this issue Jan 29, 2018 · 86 comments
Open

Implement AVX-512 intrinsics #310

alexcrichton opened this issue Jan 29, 2018 · 86 comments
Labels

Comments

@alexcrichton
Copy link
Member

General instructions for this can be found at #40, but the list of AVX-512 intrinsics is quite large! This is intended to help track progress but you'll likely want to talk to us out of band to ensure that everything is coordinated.

Intrinsic lists: https://gist.github.com/alexcrichton/3281adb58af7f465cebee49759ae3164

@alexcrichton
Copy link
Member Author

I think the best instruction set to get started with is probably avx512f as it has the constructors for types that we can use for all the other sets:

["AVX512F"]

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 16, 2018

Dissecting one of the interesting intrinsics here:

/// Compute the absolute value of packed 8-bit integers in a, 
/// and store the unsigned results in dst using writemask k (elements 
/// are copied from src when the corresponding mask bit is not set).
__m512i _mm512_mask_abs_epi8 (__m512i src, __mmask64 k, __m512i a);

the __mmask64 type appears, which is a 64-bit mask where LLVM requires us to implement it as a <64 x i1> vector that must be allocated to a k64 registers. In AVX-512 k registers are mask registers, and what seems to be more interesting is that AVX-512 does seem to support i1 as a type that is legal to lower to a cleared k register with the first bit either set or unset...

So it would be nice to know how does exactly all of this works in LLVM because i1 types are illegal in all other x86 "targets" (e.g. AVX2). Does anybody know?

Another difference with AVX2 is that if we want to use a mask in AVX2 to select values from two u8x32, the mask is an i8x32 with each byte either set or unset but IIUC AVX-512 __mmask32 is also usable for this, but it requires 32bits instead. It would be nice to know if these two (i8x32 as a mask and __mmask32) can interact, and if so, how. going from __mmask32 to i8x32 can probably be done in LLVM as sext <32 x i1> to <32 x i8> and the opposite with a trunc <32 x i8> to <32 x i1> but maybe there is a different way in which these things must be done.

This affects boolean vectors / masks, because bool8x32 would need to be casteable to bool1x32 and vice-versa.

@hdevalence
Copy link
Contributor

So it would be nice to know how does exactly all of this works in LLVM because i1 types are illegal in all other x86 "targets" (e.g. AVX2). Does anybody know?

I don't understand how it works, but there are possibly relevant slides from the 2017 LLVM meeting, maybe they are useful: https://llvm.org/devmtg/2017-03//assets/slides/avx512_mask_registers_code_generation_challenges_in_llvm.pdf

Another possibly relevant point is that AVX512VL extends the mask registers (and the corresponding intrinsics) to 128- and 256-bit vectors. But, at least when using these from C, LLVM will currently just use blend instructions instead of masks: https://godbolt.org/g/FjU1Xn

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 6, 2018

But, at least when using these from C, LLVM will currently just use blend instructions instead of masks: https://godbolt.org/g/FjU1Xn

That's a really nice test. Do you know if there is an LLVM bug open for it? I haven't been able to find any.

@hdevalence
Copy link
Contributor

Hi, has there been any new developments since this was last active? I would like to contribute AVX-512 intrinsics, but I'm not sure what (if anything) is blocking it, so if anyone has any pointers I'd be happy to help!

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 15, 2018

You can add any intrinsic that does not use __mmask.. types without issues.

If you want to add an intrinsic that uses __mmask..., you would need to add the mask types first. It is unclear what that would take. A #[repr(simd)] struct __mmask64(i64); might just work, or it might fail spectacularly. AFAIK nobody has tried yet.

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 15, 2018

Clang defines masks like __mmask16 as just (https://github.com/llvm-mirror/clang/blob/master/lib/Headers/avx512fintrin.h#L48):

typedef unsigned char __mmask8;
typedef unsigned short __mmask16;
typedef unsigned int __mmask32;
typedef unsigned long long __mmask64;

So maybe just a wrapper struct without #[repr(simd)] would be enough:

pub struct __mmask8(u8);
pub struct __mmask16(u16);
pub struct __mmask32(u32);
pub struct __mmask64(u64);

@hdevalence
Copy link
Contributor

Cool! I'll give it a try some time this week, RustConf permitting.

@hdevalence
Copy link
Contributor

Should AVX-512 intrinsics be split into modules corresponding to their feature flag?

This seems sensible except that I'm not sure how it should interact with the AVX512VL extension, since it seems weird to have the 512/256/128-bit versions of the same intrinsic in different places.

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 11, 2018

@hdevalence we currently split the functionality in modules corresponding to their target-feature flag and/or cpuid flag. I expect avx512f, avx512vl, etc. to be their own modules like they are in clang.

This stuff is decided on a 1:1 basis though, whoever sets the PR can get the conversation started. Are there any technical reasons to split it in any other way?

@hdevalence
Copy link
Contributor

Hmm, but the VL flag is orthogonal to the other flags, so for instance the _mm256_madd52hi_epu64 intrinsic requires IFMA and VL. Where should it live?

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 11, 2018

@hdevalence in clang they live in an avx512ifmavl header... avx-512 is complicated :/ many intrinsics require two features...

EDIT: typically the ones that require avx512f + avx512{something_else} live in the {something_else} module though.

@hdevalence
Copy link
Contributor

avx-512 is complicated :/

no kidding... looking at the AVX-512 Venn diagram:
image

it seems like the only CPUs that don't have VL extensions for all of their supported AVX-512 instructions are the Xeon Phi cores, which I think are all cancelled now, so it seems like the common case will be that if a CPU supports an instruction it will almost certainly support the VL extensions for it.

In that case, maybe it makes sense to split the intrinsics into modules avx512f, avx512ifma, etc., and then within those modules separately gate the VL variants on the avx512vl flag. This is still correct in the edge case that VL is not present, but seems like a more logical grouping... I think clang maybe can't really do this because C doesn't have a module system.

Does this seem like a sensible arrangement?

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 11, 2018

it seems like the only CPUs that don't have VL extensions for all of their supported AVX-512 instructions are the Xeon Phi cores, which I think are all cancelled now,

I don't think we should worry about these. Some of these did not support SSE4.2 and IIRC AVX2 either (only AVX-512), and we can't target them with LLVM IIRC.

Does this seem like a sensible arrangement?

Sure. If once we start this way we discover that putting these into their own modules makes things clearer, we can always do that later.

@hdevalence
Copy link
Contributor

I started working on this in this branch: https://github.com/hdevalence/stdsimd/tree/avx512 (very rough work).

I'm not sure how to encode the masks. Considering _mm512_abs_epi32 as an example, there's masked versions _mm512_mask_abs_epi32 and _mm512_maskz_abs_epi32 which take a __mmask16 writemask.

In the allintrinsics gist these appear around L20966-20970 as int_x86_avx512_mask_pabs_d_512 (the AVX2 versions are above). However the clang headers implement the intrinsic by calling __builtin_ia32_selectd_512 with either the src vector (for mask) or the zero vector (for maskz).

Does anyone know if there's an equivalent of __builtin_ia32_selectd_512/__builtin_ia32_selectd_256 accessible in Rust?

@hanna-kruppe
Copy link

Took a quick look, some notes:

  • Experimentally, __builtin_ia32_selectd_* seem to just generate ordinary select instructions in LLVM IR, I'm not up to date with stdsimd internally but I'm sure there's a way to generate that.
  • For the mask type, Clang effectively passes it around as a u16 (or however many mask bit are needed) and bitcasts that to the matching <n x i1> vector when doing passing it to operations that expect a mask, such as select.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 29, 2018

Experimentally, _builtin_ia32_selectd* seem to just generate ordinary select instructions in LLVM IR, I'm not up to date with stdsimd internally but I'm sure there's a way to generate that.

We have a simd_select platform intrinsic that's used in packed_simd: https://github.com/rust-lang-nursery/packed_simd/blob/master/src/codegen/llvm.rs#L89

For the mask type, Clang effectively passes it around as a u16 (or however many mask bit are needed) and bitcasts that to the matching vector when doing passing it to operations that expect a mask, such as select.

We could add a select intrinsic that takes an integer instead of a vector and does this (or extend the current simd_select intrinsic to support an integer mask).

@hdevalence
Copy link
Contributor

Update, I found the simd_select intrinsic, which seems like the answer to my last question. After some more digging, I realized that the allintrinsics gist seems out of date, and that the pabs intrinsics were removed by llvm-mirror/llvm@a01e768#diff-e468fd7732e17e9d51db49c62038ecfe (I guess you are supposed to do a compare and select and rely on that being lowered correctly in the backend).

It seems like the general pattern is to remove masked versions of builtins and to use selects instead, so maybe it would be good to write a macro that generates the masked versions, and also maybe a macro that generates the VL versions.

@hdevalence
Copy link
Contributor

Oops, I didn't refresh the page before I posted that, sorry for the confusion.

@hdevalence
Copy link
Contributor

I updated that branch with definitions of set[r]_epi[8,16,32,64] and _mm512_add_epi[8,16,32,64], next I'll try adding mask, maskz, and VL variants.

@hdevalence
Copy link
Contributor

For the mask type, Clang effectively passes it around as a u16 (or however many mask bit are needed) and bitcasts that to the matching vector when doing passing it to operations that expect a mask, such as select.

We could add a select intrinsic that takes an integer instead of a vector and does this (or extend the current simd_select intrinsic to support an integer mask).

Would this have to be implemented lower in rustc, or is it something that could be done in stdsimd?

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 30, 2018

Would this have to be implemented lower in rustc, or is it something that could be done in stdsimd?

IIRC the simd_select intrinsic will error if the mask is not a vector type that has the same number of lanes as the other vectors being passed (not necessarily the same bitwidth). So if those checks work correctly, I don't think those checks will work as is.

There is no support in Rust for vectors of i1, so I don't think this can be done with link_llvm_intrinsics either.

The easiest thing would be to add a new intrinsic, e.g., fn simd_select_bitmask<I, V>(integer_mask: I, vec0: V, vec1: V) -> V that requires I to be e.g. an unsigned integer type whose bit-width equals the number of lanes in V. Then you can just bitcast I to an <i1 x N>, and do a select. I could maybe add such an intrinsic later today. @rkruppe thoughts ?

@hanna-kruppe
Copy link

Sounds good to me.

@hdevalence
Copy link
Contributor

If anyone has time to implement such an intrinsic (I don't know how to do it myself), I'd like to start adding some AVX512 intrinsics.

@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 12, 2018

Sorry, it is on my backlog. I tried to get started with it a couple of times, but always ran out of HDD space trying to compile rustc. I'll try to make some space and get it done today.

@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 13, 2018

@hdevalence update: i can't compile rustc anymore - takes too long (freezes my pc at some point), too much ram (8gb of ram is not enough), too much hdd space (50 Gb of free space isn't enough apparently), etc. so I can't implement anything there properly anymore. It was always a pain to modify rustc due to the high requirements, but I've tried now for almost two days to get a full stage1 build done of master without success, so I'm giving up.

@Daniel-B-Smith
Copy link
Contributor

After additional work, I realized that the problem was that I had the constification wrong. I had tried constifying both const args reaching out but had apparently gotten it wrong. The PR isn't finished, but the comparisons are linking properly now.

@minybot
Copy link
Contributor

minybot commented Aug 8, 2020

Hi, I try to implement _mm512_and_epi32 in crates/core_arch/src/x86/avx512f.rs

pub unsafe fn _mm512_and_epi32(a: __m512i, b: __m512i) -> __m512i {
let r = vpandd(a.as_i32x16(), b.as_i32x16());
transmute(r)
}

#[link_name = "llvm.x86.avx512.mask.pand.d.512"]
fn vpandd(a: i32x16, b: i32x16) -> i32x16;

The test is
unsafe fn test_mm512_and_epi32() {
let a = _mm512_set_epi32(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1);
let b = _mm512_set_epi32(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1);

    let r = _mm512_and_epi32(a, b);
    let e = _mm512_set_epi32(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1);

    assert_eq_m512i(r, e);
}

When I run cargo test, it shows "(signal: 11, SIGSEGV: invalid memory reference)"
Any ideas?

I tried to compile _mm512_and_epi32 with clang, and it works.

@Amanieu
Copy link
Member

Amanieu commented Aug 8, 2020

I suggest using a debugger to look at the disassembly of the crashing code.

@minybot
Copy link
Contributor

minybot commented Aug 10, 2020

Hi, I try to implement _mm512_and_epi32 in crates/core_arch/src/x86/avx512f.rs

pub unsafe fn _mm512_and_epi32(a: __m512i, b: __m512i) -> __m512i {
let r = vpandd(a.as_i32x16(), b.as_i32x16());
transmute(r)
}

Update:
After modifying it to

transmute(simd_and(a.as_i32x16(), b.as_i32x16()))
It works.

The rustc generate vpandd instruction.

@minybot
Copy link
Contributor

minybot commented Sep 9, 2020

I try to implement _mm512_cvt_roundps_ph (__m512 a, int sae).
The document describes:
'Exceptions can be suppressed by passing _MM_FROUND_NO_EXC in the sae parameter.'
As my understanding, sae should only be '_MM_FROUND_NO_EXC (0x08)' or '_MM_FROUND_CUR_DIRECTION (0x04)'
However, Clang accepts the sae parameters from 0 to 255.

Should we follow the clang or only accept 4 and 8?

@Amanieu
Copy link
Member

Amanieu commented Sep 9, 2020

I checked both Clang and GCC and they both pass the full 8 bits on to the underlying instruction: https://www.felixcloutier.com/x86/vcvtps2ph

@minybot
Copy link
Contributor

minybot commented Sep 9, 2020

I checked both Clang and GCC and they both pass the full 8 bits on to the underlying instruction: https://www.felixcloutier.com/x86/vcvtps2ph

Ok. Thanks. The document I checked is
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=512_cvt_roundps_ph&expand=1354

@minybot
Copy link
Contributor

minybot commented Sep 17, 2020

I try to implement _mm512_mask_extractf32x4_ps (__m128 src, __mmask8 k, __m512 a, int imm8)
The input mask is u8.
However, it uses u4 only.
FOR j := 0 to 3
i := j*32
IF k[j]
dst[i+31:i] := tmp[i+31:i]
ELSE
dst[i+31:i] := src[i+31:i]
FI
ENDFOR

The simd_select_bitmask(mask, extract, src) shows mismatched lengths: mask length 8 != other vector length 4.

My question is I should implement a u4 or otherwise?

@Amanieu
Copy link
Member

Amanieu commented Sep 17, 2020

You can just mask to keep only the bottom 4 bits and use constify_imm4.

@bjorn3
Copy link
Member

bjorn3 commented Sep 17, 2020

simd_select_bitmask really requires the mask to be a non-existent 4bit integer type. This is a bug in rustc.

@minybot
Copy link
Contributor

minybot commented Sep 17, 2020

simd_select_bitmask really requires the mask to be a non-existent 4bit integer type. This is a bug in rustc.

Is any plan to support 4bit or 2 bit integer type in the future?
Clang has i1x4 to support extract with 4 bit mask.
"%3 = select <4 x i1> %extract, <4 x i32> %shuffle, <4 x i32> %1"

AVX512F uses a lot of 4bit(32x4) or 2bit(64x2) masks on _mm_mask_xxxxx instructions which inputs and outputs are 128 bit.

@Amanieu
Copy link
Member

Amanieu commented Sep 18, 2020

I had a look in the compiler and it seems that this is a bug in the implementation of simd_select_bitmask: it should accept u8 inputs when the number of lanes is less than 8. simd_bitmask already supports this by returning u8 when the number of lanes is less than 8.

@minybot @bjorn3 Would one of you be willing to make a PR to fix this in rustc? The relevant code is here: https://github.com/rust-lang/rust/blob/f3c923a13a458c35ee26b3513533fce8a15c9c05/compiler/rustc_codegen_llvm/src/intrinsic.rs#L1272

@minybot
Copy link
Contributor

minybot commented Sep 19, 2020

I had a look in the compiler and it seems that this is a bug in the implementation of simd_select_bitmask: it should accept u8 inputs when the number of lanes is less than 8. simd_bitmask already supports this by returning u8 when the number of lanes is less than 8.

I try to modify simd_select_bitmask to use 4bit mask if the output is f32x"4"

    if mask_len > output_len { mask_len = output_len; }
    ...
    let i1 = bx.type_i1();
    let i1xn = bx.type_vector(i1, mask_len);
    let m_i1s = bx.trunc(args[0].immediate(), i1xn);
    return Ok(bx.select(m_i1s, args[1].immediate(), args[2].immediate()));

However, it shows "error: failed to parse bitcode for LTO module: Bitwidth for integer type out of range (Producer: 'LLVM11.0.0-rust-dev' Reader: 'LLVM 11.0.0-rust-dev')"

So, bx.select is only accept 8bit or more?

@bjorn3
Copy link
Member

bjorn3 commented Sep 19, 2020

You can't truncate i8 to i1 x 4. You have to truncate to i4 and then bitcast to i1 x 4 I think.

@minybot
Copy link
Contributor

minybot commented Oct 1, 2020

I had a look in the compiler and it seems that this is a bug in the implementation of simd_select_bitmask: it should accept u8 inputs when the number of lanes is less than 8. simd_bitmask already supports this by returning u8 when the number of lanes is less than 8.

@minybot @bjorn3 Would one of you be willing to make a PR to fix this in rustc? The relevant code is here: https://github.com/rust-lang/rust/blob/f3c923a13a458c35ee26b3513533fce8a15c9c05/compiler/rustc_codegen_llvm/src/intrinsic.rs#L1272

There is another solution without touching simd_select_bitmask.
Use cast. Take _mm512_mask_extractf32x4_ps (__m128 src, __mmask8 k, __m512 a, int imm8) as an example.
a->(32x4); Cast to (32x16); Cast to (32x8); do bitmask; Cast to (32x4).
There is no cast128_to_256 directly. only 128_to_512, 512_to_256. 512_to_128.

@Amanieu
Copy link
Member

Amanieu commented Oct 3, 2020

I just went ahead and fixed the issue in rust-lang/rust#77504.

@minybot
Copy link
Contributor

minybot commented Oct 6, 2020

I just went ahead and fixed the issue in rust-lang/rust#77504.

I test it, and it works when the mask size is 4.

@minybot
Copy link
Contributor

minybot commented Dec 21, 2020

For Mask operation in avx512 such as _kadd_mask32, it adds two masks.
According to https://travisdowns.github.io/blog/2019/12/05/kreg-facts.html, the Mask has its own hardware register.
Is there anyway to make sure _kadd_mask32 will generate "kaddd" instruction?

@Amanieu
Copy link
Member

Amanieu commented Dec 21, 2020

No, but it's fine since we don't guarantee a particular instruction is used for an intrinsic: we leave it to LLVM to decide whether it is better to use a kadd instruction or a normal add instruction.

@stopbystudent
Copy link

stopbystudent commented Apr 7, 2021

While working on a private project, I needed masked loading, so I wanted to prepare a PR with implementations for _mm512_mask_load_epi32 and the like. Reading https://github.com/rust-lang/stdarch/blob/master/crates/core_arch/avx512f.md, I found the following:

  • _mm512_mask_load_epi32 //need i1
  • _mm512_maskz_load_epi32 //need i1

What is the "need i1" part? I have not found any explanation there.

Currently, I am tempted to implement masked loading like in (as an example)

/// Load packed 32-bit integers from memory into dst using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr must be aligned on a 64-byte boundary or a general-protection exception may be generated.
///
/// [Intel's documentation](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_mask_load_epi32&expand=3305)
#[inline]
#[target_feature(enable = "avx512f")]
#[cfg_attr(test, assert_instr(vmovdqa32))]
pub unsafe fn _mm512_mask_load_epi32(src: __m512i, k: __mmask16, mem_addr: *const i32) -> __m512i {
    let loaded = ptr::read(mem_addr as *const __m512i).as_i32x16();
    let src = src.as_i32x16();
    transmute(simd_select_bitmask(k, loaded, src))
}

which follows how _mm512_maskz_mov_epi32 and _mm512_load_epi32 are implemented. If this sounds correct, I might make a PR in the next days.

@Amanieu
Copy link
Member

Amanieu commented Apr 7, 2021

This is incorrect since _mm512_mask_load_epi32 must not cause page faults on the parts of the vector that are masked off. Your version will still cause these page faults.

To support this properly we need to call an LLVM intrinsic directly. However this intrinsic uses a vector of i1 as argument, which we cannot represent with Rust types. We need additional support in the compiler to call LLVM intrinsics that take a vector of i1 as a parameter.

@stopbystudent
Copy link

Makes sense. Many thanks for the explanation.

@jhorstmann
Copy link
Contributor

Another possible implementation for _mm512_mask_load_epi32 would using the asm feature. I have successfully used the following implementation:

#[inline]
pub unsafe fn _mm512_mask_loadu_epi32(src: __m512i, mask: __mmask16, ptr: *const i32) -> __m512i {
    let mut result: __m512i = src;

    asm!(
    "vmovdqu32 {io}{{{k}}}, [{p}]",
    p = in(reg) ptr,
    k = in(kreg) mask,
    io = inout(zmm_reg) result,
    options(nostack), options(pure), options(readonly)
    );

    result
}

If such an implementation would be ok maintenance wise I could try preparing a PR that adds the missing avx512f this way.

@Amanieu
Copy link
Member

Amanieu commented Nov 8, 2021

If such an implementation would be ok maintenance wise I could try preparing a PR that adds the missing avx512f this way.

Sounds good!

@mert-kurttutan
Copy link

Just coming from the discussion: rust-lang/portable-simd#28.

Regarding the separation of avx512f intrinsics and and target_feature=avx512f, now, I have enough interest and time to investigate it.
My particular case of interest is using zmm_reg for inline assembly (so need for avx512f intrinsics), but target_feature=avx512f is not stable yet. If it helps the stabilisation of target_feature, I am willing to work on it under some guidance.
@Amanieu What do you think?

@Amanieu
Copy link
Member

Amanieu commented Jul 25, 2024

I expect that we will be stabilizing AVX-512 soon, thanks to the hard work of many people in implementing the full set of AVX-512 intrinsics in stdarch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests