-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSE1 support #11193
SSE1 support #11193
Conversation
Microbenchmark results at http://clb.confined.space/dump/wasm_simd_results_sse1.html |
Eyeballing the results suggests:
In general looks like one can expect about "twice as slow" SIMD performance from Wasm compared to native, but still quite a bit faster than scalar Wasm. There is likely to exist some amount of noise in the numbers, they are not variance-controlled. |
This is not surprising: LLVM does not emit
This is extremely interesting and something we need to dive into. Do you have a specific example where
Does that include floating point min and max instructions or just integer min and max?
@ngzhian, @dtig do you know why this could be?
@ngzhian, @dtig IIRC, we decided not to inlclude rsqrt because it was nondeterministic across platforms. Is that correct?
👍 |
FYI @seanptmaher as well |
You can repro the slowness in
That includes only SSE1 min and max, as benchmarked via the script. Integer min and max will be later in SSE2. |
Hmm for some reason the test |
Uploaded rendered documentation at http://clb.confined.space/dump/sse_docs_singlehtml/#porting-simd-code-targeting-webassembly |
I've seen that test flake a few times and I've tried to investigate but I really can't figure it out. It seems perfectly deterministic .. Can you try re-running to make sure it is a flake? |
I looked at |
Looks like CI is self-isolating today.. Tried to rebase to latest master, but that still gives network errors on icu and something else on flake8. |
The code there looks like this:
The constant However I am surprised to see the operation |
The C++ code for the section looks like this: do {
double start = emscripten_get_now();;
__m128 o0 = _mm_load_ps(src);
__m128 o1 = _mm_load_ps(src2);
for(int i = 0; i < N; i += 4)
o0 = _mm_and_ps(o0, o1);
_mm_store_ps(dst, o0);
double end = emscripten_get_now();
...
} while(0);; so should be only one Running wasm-dis on the compiled code (-O3) gives (loop $label$3
(local.set $8
(v128.and
(local.get $8)
(local.get $9)
)
)
(local.set $2
(i32.lt_u
(local.get $0)
(i32.const 16777212)
)
)
(local.set $0
(i32.add
(local.get $0)
(i32.const 4)
)
)
(br_if $label$3
(local.get $2)
)
) so definitely only one |
Sorry, missed this yesterday. A general question for when you are running native benchmarks, the baseline to compare with is to force SSE4.1 for a more accurate comparison, if not depending on the machine, it could pick AVX+ enabled which makes the comparison not exactly accurate.
This is definitely surprising, sounds like it's falling into the scalar path, or generating more code than necessary. The generic logical operations (pand, por, pxor etc.) should be fairly fast.
This is correct, see corresponding issue on the SIMD repository here - https://github.com/WebAssembly/simd/issues/3.
|
V8 doesn't optimize much over the .wasm file that's generated by the tools. This looks like there's some loop unrolling that's kicking in. V8 will generate code for individual Wasm operations in the generated binary, and doesn't optimize across operation boundaries. |
I think the issue here is that emscripten is generating the 4 v128.and instructions that you mentioned in #11193 (comment) when compiling the entire benchmark_sse1.cpp. This is the scalar portion of the benchmark. Which seems wrong. In bench.wat.txt I attached, the vector code begins at 6889, which looks fine.
V8 is running whatever wasm code is generated, so I think the issue is with the cpp->wasm step. Like Deepti mentioned, it could be a loop unrolling bug. |
Looks like you are right - wasm-dissing the whole benchmark, I can also see 4x and instructions in the inner loop, but in a small test case, only one gets generated. It is currently not possible to control enabling or disabling autovectorization, However given that all tests pass (assuming CI network will now be up), I'll leave such investigation for later. At least the autovectorizer is not generating incorrect code, just slow paths. |
You should be able to pass |
Oh right.. Trying that out, that does not fix the issue, the 4x ands are still there. |
I haven't looked too deep but does this benchmark try to do multiple runs until we reach a confidence interval or something?
|
@ngzhian Yeah no it doesn't - that is what I meant above by the benchmark not being variance controlled. |
Got it, sorry I missed that point, thanks for clarifying! |
Yay, finally CI is green. This is good to review and land now. |
pmin and pmax opcodes in v8 were wrong after the renumber, this will be fixed in https://chromium-review.googlesource.com/c/v8/v8/+/2212682 |
Hey, if this looks semi-decent, would love to get this landed, I have a further PR for SSE2 ready to submit that depends on this one. |
// Emscripten SIMD support doesn't support MMX/float32x2/__m64. | ||
// However, we support loading and storing 2-vectors, so | ||
// treat "__m64 *" as "void *" for that purpose. | ||
typedef float __m64 __attribute__((__vector_size__(16), __aligned__(16))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be __vector_size__(8), __aligned__(8)
? I'm not sure how _mm_loadl_pi
and _mm_loadh_pi
can be correct if they are loading 16 bytes rather than 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I don't remember why it was originally a 16-byte thing - in a followup PR that adds SSE2 support this will change to a 8-byte quantity.
_mm_loadr_ps(const float *__p) | ||
{ | ||
__m128 __v = _mm_load_ps(__p); | ||
return __builtin_shufflevector(__v, __v, 3, 2, 1, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These __builtin_shufflevector
s are breaking the v128_t
abstraction an relying on the fact that v128_t
is defined to have four elements. Can you add a comment saying that these really should be wasm shuffle intrinsics as soon as the performance problem is figured out? Also, since we are already using types like __f32x4
in this file, it would be good to use them here, too, to make the number of lanes explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me revisit this in next PR, I have conflicting changes to these lines in the next SSE2 PR.
struct __unaligned { | ||
float __v; | ||
} __attribute__((__packed__, __may_alias__)); | ||
((struct __unaligned *)__p)->__v = ((__f32x4)__a)[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__f32x4
and friends are not meant to be exported from wasm_simd128.h, but I suppose if you copied the definitions to this file you would get errors about multiple typedefs of the same thing. Maybe they should be macros in wasm_simd128.h so they can be undefined at the end? Anyway, I think this is fine for now but it might be something we want to change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I hope it'll be fine - if upstream changes, we should be able to adapt quite easily.
return (__p.__x[0] >> 31) | ||
| ((__p.__x[1] >> 30) & 2) | ||
| ((__p.__x[2] >> 29) & 4) | ||
| ((__p.__x[3] >> 28) & 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you experimented with the prototyped bitmask
instruction for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That one does not seem to be available yet (WebAssembly/simd#201). That one looks like it would directly implement movemask here, so would be pretty cool to have that.
Added a TODO in code to remember to look at that when it becomes available.
if '*************************' in benchmark_results: | ||
benchmark_results = benchmark_results[:benchmark_results.find('*************************')].strip() | ||
|
||
print benchmark_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had switched to Python 3 already? Or is that just a WIP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script is from old era, not Python 3 unfortunately. We are still dual Python2+Python3 in Emscripten. I'll have to see about converting this to be Python 3 friendly, although not critical for now.
SS_TEST("_mm_movelh_ps", _mm_movelh_ps(_mm_load_ps(src+i), _mm_load_ps(src2+i))); | ||
|
||
SETCHART("movemask"); | ||
START(); for(int i = 0; i < N; i += 4) { int movemask = ((unsigned int)src[i] >> 31) | (((unsigned int)src[i+1] >> 30) & 2) | (((unsigned int)src[i+2] >> 29) & 4) | (((unsigned int)src[i+3] >> 28) & 8); dst_int += movemask; } ENDSCALAR(checksum_dst(&dst_int), "scalar movemask"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this is all one line? I think it would be more readable if it were formatted normally. Same with other long lines with multiple statements below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent in these benchmark files have been to have each line implement its own instruction in full, to keep the large amount of code in this file low. I find that it's more navigatable that way overall, expanding these would make the file navigate more verbosely.
Because each line like this implements the matching SSE code in scalar, I find that it's not that important to have all of that "nice in sight", but more to hide it, so it feels more convenient to group them like this.
Thanks for the review! Addressed comments, the next SSE 2 PR will change __builtin_shufflevector and __m64 size to avoid merge conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let @kripken merge in case he has any additional comments.
Ah, it looks like there are merge conflicts anyhow @juj |
Merging this in.. let's iterate in tree on further comments. |
This duplicates some of what we've been working on in SIMDe. I added a (rather length, sorry) comment to simd-everywhere/simde#86 to try to get a discussion started on how to proceed without duplicating effort. If anyone has an opinion I'd be interested in hearing it. |
This PR restores SSE1 support that was nuked in SIMD.js -> Wasm SIMD transition.