Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_mm_loadl_epi64 doesn't allow reads aligned to 8-byte boundaries #582

Closed
brian-armstrong opened this issue Oct 23, 2018 · 9 comments
Closed

Comments

@brian-armstrong
Copy link
Contributor

std::arch::x86_64::_mm_loadl_epi64 is kind of a weird case. Intel guides say it takes a *const __m128i, but the documentation is unclear on whether this needs to be aligned https://software.intel.com/en-us/node/524242 Note that load (must be aligned) and loadu (need not be aligned) are defined while loadl isn't. I believe the correct definition is that it need be aligned, but only to an 8-byte boundary (instead of the full 16) but I haven't been able to find documentation backing this up.

Clang's intrinsics header actually does go out of its way to allow this to be aligned to an 8-byte boundary, not a 16-byte boundary https://github.com/llvm-mirror/clang/blob/master/lib/Headers/emmintrin.h#L3587

This instruction produces the expected result on a 8- but not 16-byte aligned pointer in clang, but yields a segmentation fault on the same class of pointer in Rust.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 23, 2018

You can ask Intel for clarification here: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/363747

It might be interesting to survey what GCC, MSVC, and the Intel compiler do here. If they all support unaligned loads, then doing the same here is the right call.

@brian-armstrong
Copy link
Contributor Author

On a related note, _mm_loadu_si64 appears to be missing.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 23, 2018

Might be worth it to fill a separate issue to track that (or just send a PR).

@brian-armstrong
Copy link
Contributor Author

I went ahead and asked in that thread, and someone responded to point out that the manual says that in general, alignment isn't required unless otherwise specified. I'm not sure how much stock to put in that for this case, though.

I've been poking at godbolt to try to come up with a convincing argument about what C does one way or the other, but I'm finding the output is hard to reason about, and one _mm_loadl_epi64 seems to yield on average 5 instructions.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 25, 2018

I went ahead and asked in that thread, and someone responded to point out that the manual says that in general, alignment isn't required unless otherwise specified. I'm not sure how much stock to put in that for this case, though.

So IIUC, because __m128 is a quadword, the pointer does not need to point to an aligned address ?

I've been poking at godbolt to try to come up with a convincing argument about what C does one way or the other, but I'm finding the output is hard to reason about, and one _mm_loadl_epi64 seems to yield on average 5 instructions.

You mentioned before that clang supports unaligned loads via the intrinsic. Does GCC support unaligned loads as well? If so, we should probably support them too and you could just send a PR since that would be a backwards compatible change.

@brian-armstrong
Copy link
Contributor Author

Sure.

#include <stdio.h>
#include <immintrin.h>

int main() {
    int64_t data[64];
    for (int i = 0; i < 64; ++i) {
        data[i] = i;
    }
    __m128i a = _mm_loadl_epi64((const __m128i *)(data + 3));
    _mm_storel_epi64((__m128i *)(data + 5), a);
#ifdef __clang__
    printf("clang! ");
#endif
    printf("%p %ld %lld %lld\n", data + 3, (intptr_t)(data + 3) % 16, data[3], data[5]);
}
$ gcc-7 loadl.c
$ ./a.out
0x7fff54322328 8 3 3
$ clang loadl.c
$ ./a.out
clang! 0x7fff5a63e368 8 3 3

@brian-armstrong
Copy link
Contributor Author

I'm looking at the code in sse2.rs but I don't think I actually get how this works. The body of these functions is the fallback code used if sse2 isn't supported, right? I'm not sure how to modify how the intrinsic itself works.

@brian-armstrong
Copy link
Contributor Author

This works now. Thank you!

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 2, 2018

This PR (rust-lang/rust#55610) updates stdsimd in Rust to a version containing this fix. Once that is merged you should be able to use core::arch/std::arch for this in the following nightly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants