_mm_loadl_epi64 doesn't allow reads aligned to 8-byte boundaries #582

brian-armstrong · 2018-10-23T08:45:39Z

std::arch::x86_64::_mm_loadl_epi64 is kind of a weird case. Intel guides say it takes a *const __m128i, but the documentation is unclear on whether this needs to be aligned https://software.intel.com/en-us/node/524242 Note that load (must be aligned) and loadu (need not be aligned) are defined while loadl isn't. I believe the correct definition is that it need be aligned, but only to an 8-byte boundary (instead of the full 16) but I haven't been able to find documentation backing this up.

Clang's intrinsics header actually does go out of its way to allow this to be aligned to an 8-byte boundary, not a 16-byte boundary https://github.com/llvm-mirror/clang/blob/master/lib/Headers/emmintrin.h#L3587

This instruction produces the expected result on a 8- but not 16-byte aligned pointer in clang, but yields a segmentation fault on the same class of pointer in Rust.

The text was updated successfully, but these errors were encountered:

gnzlbg · 2018-10-23T08:59:46Z

You can ask Intel for clarification here: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/363747

It might be interesting to survey what GCC, MSVC, and the Intel compiler do here. If they all support unaligned loads, then doing the same here is the right call.

brian-armstrong · 2018-10-23T09:15:55Z

On a related note, _mm_loadu_si64 appears to be missing.

gnzlbg · 2018-10-23T09:19:05Z

Might be worth it to fill a separate issue to track that (or just send a PR).

brian-armstrong · 2018-10-24T23:55:19Z

I went ahead and asked in that thread, and someone responded to point out that the manual says that in general, alignment isn't required unless otherwise specified. I'm not sure how much stock to put in that for this case, though.

I've been poking at godbolt to try to come up with a convincing argument about what C does one way or the other, but I'm finding the output is hard to reason about, and one _mm_loadl_epi64 seems to yield on average 5 instructions.

gnzlbg · 2018-10-25T09:50:56Z

I went ahead and asked in that thread, and someone responded to point out that the manual says that in general, alignment isn't required unless otherwise specified. I'm not sure how much stock to put in that for this case, though.

So IIUC, because __m128 is a quadword, the pointer does not need to point to an aligned address ?

I've been poking at godbolt to try to come up with a convincing argument about what C does one way or the other, but I'm finding the output is hard to reason about, and one _mm_loadl_epi64 seems to yield on average 5 instructions.

You mentioned before that clang supports unaligned loads via the intrinsic. Does GCC support unaligned loads as well? If so, we should probably support them too and you could just send a PR since that would be a backwards compatible change.

brian-armstrong · 2018-10-25T19:59:05Z

Sure.

#include <stdio.h>
#include <immintrin.h>

int main() {
    int64_t data[64];
    for (int i = 0; i < 64; ++i) {
        data[i] = i;
    }
    __m128i a = _mm_loadl_epi64((const __m128i *)(data + 3));
    _mm_storel_epi64((__m128i *)(data + 5), a);
#ifdef __clang__
    printf("clang! ");
#endif
    printf("%p %ld %lld %lld\n", data + 3, (intptr_t)(data + 3) % 16, data[3], data[5]);
}

$ gcc-7 loadl.c
$ ./a.out
0x7fff54322328 8 3 3

$ clang loadl.c
$ ./a.out
clang! 0x7fff5a63e368 8 3 3

brian-armstrong · 2018-10-25T20:03:23Z

I'm looking at the code in sse2.rs but I don't think I actually get how this works. The body of these functions is the fallback code used if sse2 isn't supported, right? I'm not sure how to modify how the intrinsic itself works.

brian-armstrong · 2018-10-31T18:46:00Z

This works now. Thank you!

gnzlbg · 2018-11-02T14:01:34Z

This PR (rust-lang/rust#55610) updates stdsimd in Rust to a version containing this fix. Once that is merged you should be able to use core::arch/std::arch for this in the following nightly.

brian-armstrong mentioned this issue Oct 27, 2018

allow unaligned reads with _mm_loadl_epi64 #584

Merged

brian-armstrong closed this as completed Oct 31, 2018

talumbau mentioned this issue Apr 30, 2021

Build Failure fix on AVX2 machines replace si64 with epi64 tensorflow/tensorflow#48814

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_mm_loadl_epi64 doesn't allow reads aligned to 8-byte boundaries #582

_mm_loadl_epi64 doesn't allow reads aligned to 8-byte boundaries #582

brian-armstrong commented Oct 23, 2018

gnzlbg commented Oct 23, 2018

brian-armstrong commented Oct 23, 2018

gnzlbg commented Oct 23, 2018

brian-armstrong commented Oct 24, 2018

gnzlbg commented Oct 25, 2018

brian-armstrong commented Oct 25, 2018

brian-armstrong commented Oct 25, 2018

brian-armstrong commented Oct 31, 2018

gnzlbg commented Nov 2, 2018

_mm_loadl_epi64 doesn't allow reads aligned to 8-byte boundaries #582

_mm_loadl_epi64 doesn't allow reads aligned to 8-byte boundaries #582

Comments

brian-armstrong commented Oct 23, 2018

gnzlbg commented Oct 23, 2018

brian-armstrong commented Oct 23, 2018

gnzlbg commented Oct 23, 2018

brian-armstrong commented Oct 24, 2018

gnzlbg commented Oct 25, 2018

brian-armstrong commented Oct 25, 2018

brian-armstrong commented Oct 25, 2018

brian-armstrong commented Oct 31, 2018

gnzlbg commented Nov 2, 2018