Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSE1 support #11193

Merged
merged 15 commits into from
May 27, 2020
Merged

SSE1 support #11193

merged 15 commits into from
May 27, 2020

Conversation

juj
Copy link
Collaborator

@juj juj commented May 19, 2020

This PR restores SSE1 support that was nuked in SIMD.js -> Wasm SIMD transition.

@juj juj added the SIMD label May 19, 2020
@juj juj requested a review from tlively May 19, 2020 11:38
@juj
Copy link
Collaborator Author

juj commented May 19, 2020

Microbenchmark results at http://clb.confined.space/dump/wasm_simd_results_sse1.html

@juj
Copy link
Collaborator Author

juj commented May 19, 2020

Eyeballing the results suggests:

  • load and set instructions are quite good, within 2x slowdown range,
  • _mm_setzero_ps/ is quite fishy, it does not seem like optimizations are kicking in on wasm_f32x4_const(0.f, 0.f, 0.f, 0.f)
  • shuffle instruction generation is definitely off. Tried wasm_v32x4_shuffle instead of __builtin_shufflevector, but that was even slower.
  • min, max, move and store instructions are also in good 2x slower range,
  • binary operations and, andnot, or, xor on the other hand are definitely not reaching the target, as they are 10x+ slower than native.
  • arithmetic add/sub/mul/div are also in comfortable 2x slower range,
  • lack of rcp and rsqrt in Wasm SIMD is quite unfortunate, leading to slow rcp behavior.
  • However performance of rsqrt seems good, I wonder if that may be a case where some pattern matching magic is occurring?
  • compare instructions are also at good 2x slower ballpark, even though some of them are scalarized.

In general looks like one can expect about "twice as slow" SIMD performance from Wasm compared to native, but still quite a bit faster than scalar Wasm.

There is likely to exist some amount of noise in the numbers, they are not variance-controlled.

@tlively
Copy link
Member

tlively commented May 19, 2020

FYI @ngzhian @dtig

  • _mm_setzero_ps/ is quite fishy, it does not seem like optimizations are kicking in on wasm_f32x4_const(0.f, 0.f, 0.f, 0.f)

This is not surprising: LLVM does not emit v128.const because it has not been implemented by v8 yet.

  • shuffle instruction generation is definitely off. Tried wasm_v32x4_shuffle instead of __builtin_shufflevector, but that was even slower.

This is extremely interesting and something we need to dive into. Do you have a specific example where wasm_v32x4_shuffle is slower than __builtin_shufflevector?

  • min, max, move and store instructions are also in good 2x slower range,

Does that include floating point min and max instructions or just integer min and max?

  • binary operations and, andnot, or, xor on the other hand are definitely not reaching the target, as they are 10x+ slower than native.

@ngzhian, @dtig do you know why this could be?

  • arithmetic add/sub/mul/div are also in comfortable 2x slower range,
  • lack of rcp and rsqrt in Wasm SIMD is quite unfortunate, leading to slow rcp behavior.
  • However performance of rsqrt seems good, I wonder if that may be a case where some pattern matching magic is occurring?

@ngzhian, @dtig IIRC, we decided not to inlclude rsqrt because it was nondeterministic across platforms. Is that correct?

  • compare instructions are also at good 2x slower ballpark, even though some of them are scalarized.

In general looks like one can expect about "twice as slow" SIMD performance from Wasm compared to native, but still quite a bit faster than scalar Wasm.

There is likely to exist some amount of noise in the numbers, they are not variance-controlled.

👍

@tlively
Copy link
Member

tlively commented May 19, 2020

FYI @seanptmaher as well

@juj
Copy link
Collaborator Author

juj commented May 19, 2020

Do you have a specific example where wasm_v32x4_shuffle is slower than __builtin_shufflevector?

You can repro the slowness in python tests/benchmark_sse1.py script in this repo, and replace the __builtin_shufflevector with wasm_v32x4_shuffle in https://github.com/emscripten-core/emscripten/pull/11193/files#diff-0bb71a2bb9518b5d5dade153d5132a0bR139

Does that include floating point min and max instructions or just integer min and max?

That includes only SSE1 min and max, as benchmarked via the script. Integer min and max will be later in SSE2.

@juj
Copy link
Collaborator Author

juj commented May 19, 2020

Hmm for some reason the test FAIL: test_archive_duplicate_basenames (test_other.other) failed on CI, cannot reproduce locally.

@juj
Copy link
Collaborator Author

juj commented May 19, 2020

@sbc100
Copy link
Collaborator

sbc100 commented May 19, 2020

Hmm for some reason the test FAIL: test_archive_duplicate_basenames (test_other.other) failed on CI, cannot reproduce locally.

I've seen that test flake a few times and I've tried to investigate but I really can't figure it out. It seems perfectly deterministic .. Can you try re-running to make sure it is a flake?

@ngzhian
Copy link
Collaborator

ngzhian commented May 20, 2020

I looked at mm_and_ps for a start, i trimmed down benchmark_sse1.cpp to just _mm_and_ps to narrow down the issue, and also to get a more sizable generated code. The wasm file is large, and the I've uploaded the text file. In particular line 6741 seems to be where the core of the _mm_and_ps loop is happening, and it has that constant N=1610241024 there.
I'm not sure why there is so much extra code.

bench.wat.txt

@juj
Copy link
Collaborator Author

juj commented May 20, 2020

Looks like CI is self-isolating today.. Tried to rebase to latest master, but that still gives network errors on icu and something else on flake8.

@juj
Copy link
Collaborator Author

juj commented May 20, 2020

In particular line 6741 seems to be where the core of the _mm_and_ps loop is happening, and it has that constant N=16_1024_1024 there.

The code there looks like this:

    loop  ;; label = @1
      local.get 12
      local.get 11
      v128.and
      local.set 11
      local.get 15
      local.get 14
      v128.and
      local.set 14
      local.get 16
      local.get 13
      v128.and
      local.set 13
      local.get 17
      local.get 10
      v128.and
      local.set 10
      local.get 0
      i32.const 4
      i32.add
      local.tee 0
      i32.const 16777216
      i32.ne
      br_if 0 (;@1;)
    end

The constant 16777216 is the loop termination condition. The idea is to benchmark the operation this many times, and then later average it to get to the overhead per single call. (because profiling a single call would be too little work). It comes from const int N = 16*1024*1024; in the source code tests/sse/benchmark_sse1.h.

However I am surprised to see the operation v128.and appear four times(!) in the body of the loop. That would make sense only if the compiler had done some partial loop unrolling, but to my eye it does not look like the loop iteration count would have gotten reduced.

@juj
Copy link
Collaborator Author

juj commented May 20, 2020

The C++ code for the section looks like this:

 do { 
  double start = emscripten_get_now();; 
  __m128 o0 = _mm_load_ps(src); 
  __m128 o1 = _mm_load_ps(src2); 
  for(int i = 0; i < N; i += 4) 
    o0 = _mm_and_ps(o0, o1); 
  _mm_store_ps(dst, o0); 
  double end = emscripten_get_now(); 
  ...
} while(0);;

so should be only one and operation per loop body.

Running wasm-dis on the compiled code (-O3) gives

  (loop $label$3
   (local.set $8
    (v128.and
     (local.get $8)
     (local.get $9)
    )
   )
   (local.set $2
    (i32.lt_u
     (local.get $0)
     (i32.const 16777212)
    )
   )
   (local.set $0
    (i32.add
     (local.get $0)
     (i32.const 4)
    )
   )
   (br_if $label$3
    (local.get $2)
   )
  )

so definitely only one v128.and instruction present in the wasm code. The question is then - does V8 amplify that to 4x v128.and operations?

@dtig
Copy link

dtig commented May 20, 2020

Sorry, missed this yesterday. A general question for when you are running native benchmarks, the baseline to compare with is to force SSE4.1 for a more accurate comparison, if not depending on the machine, it could pick AVX+ enabled which makes the comparison not exactly accurate.

FYI @ngzhian @dtig

  • _mm_setzero_ps/ is quite fishy, it does not seem like optimizations are kicking in on wasm_f32x4_const(0.f, 0.f, 0.f, 0.f)

This is not surprising: LLVM does not emit v128.const because it has not been implemented by v8 yet.

  • shuffle instruction generation is definitely off. Tried wasm_v32x4_shuffle instead of __builtin_shufflevector, but that was even slower.

This is extremely interesting and something we need to dive into. Do you have a specific example where wasm_v32x4_shuffle is slower than __builtin_shufflevector?

  • min, max, move and store instructions are also in good 2x slower range,

Does that include floating point min and max instructions or just integer min and max?

  • binary operations and, andnot, or, xor on the other hand are definitely not reaching the target, as they are 10x+ slower than native.

@ngzhian, @dtig do you know why this could be?

This is definitely surprising, sounds like it's falling into the scalar path, or generating more code than necessary. The generic logical operations (pand, por, pxor etc.) should be fairly fast.

  • arithmetic add/sub/mul/div are also in comfortable 2x slower range,
  • lack of rcp and rsqrt in Wasm SIMD is quite unfortunate, leading to slow rcp behavior.
  • However performance of rsqrt seems good, I wonder if that may be a case where some pattern matching magic is occurring?

@ngzhian, @dtig IIRC, we decided not to inlclude rsqrt because it was nondeterministic across platforms. Is that correct?

This is correct, see corresponding issue on the SIMD repository here - https://github.com/WebAssembly/simd/issues/3.

  • compare instructions are also at good 2x slower ballpark, even though some of them are scalarized.

In general looks like one can expect about "twice as slow" SIMD performance from Wasm compared to native, but still quite a bit faster than scalar Wasm.
There is likely to exist some amount of noise in the numbers, they are not variance-controlled.

👍

@dtig
Copy link

dtig commented May 20, 2020

The C++ code for the section looks like this:

 do { 
  double start = emscripten_get_now();; 
  __m128 o0 = _mm_load_ps(src); 
  __m128 o1 = _mm_load_ps(src2); 
  for(int i = 0; i < N; i += 4) 
    o0 = _mm_and_ps(o0, o1); 
  _mm_store_ps(dst, o0); 
  double end = emscripten_get_now(); 
  ...
} while(0);;

so should be only one and operation per loop body.

Running wasm-dis on the compiled code (-O3) gives

  (loop $label$3
   (local.set $8
    (v128.and
     (local.get $8)
     (local.get $9)
    )
   )
   (local.set $2
    (i32.lt_u
     (local.get $0)
     (i32.const 16777212)
    )
   )
   (local.set $0
    (i32.add
     (local.get $0)
     (i32.const 4)
    )
   )
   (br_if $label$3
    (local.get $2)
   )
  )

so definitely only one v128.and instruction present in the wasm code. The question is then - does V8 amplify that to 4x v128.and operations?

V8 doesn't optimize much over the .wasm file that's generated by the tools. This looks like there's some loop unrolling that's kicking in. V8 will generate code for individual Wasm operations in the generated binary, and doesn't optimize across operation boundaries.

@ngzhian
Copy link
Collaborator

ngzhian commented May 20, 2020

so definitely only one v128.and instruction present in the wasm code. The question is then - does V8 amplify that to 4x v128.and operations?

I think the issue here is that emscripten is generating the 4 v128.and instructions that you mentioned in #11193 (comment) when compiling the entire benchmark_sse1.cpp. This is the scalar portion of the benchmark. Which seems wrong.

In bench.wat.txt I attached, the vector code begins at 6889, which looks fine.

    local.set 10                                                                                        
    loop  ;; label = @1                                                                                 
      local.get 10                                                                                      
      local.get 11                                                                                      
      v128.and                                                                                          
      local.set 10                                                                                      
      local.get 0                                                                                       
      i32.const 16777212                                                                                
      i32.lt_u                                                                                          
      local.set 2                                                                                       
      local.get 0                                                                                       
      i32.const 4                                                                                       
      i32.add                                                                                           
      local.set 0                                                                                       
      local.get 2                                                                                       
      br_if 0 (;@1;)                                                                                    
    end 

V8 is running whatever wasm code is generated, so I think the issue is with the cpp->wasm step. Like Deepti mentioned, it could be a loop unrolling bug.

@juj
Copy link
Collaborator Author

juj commented May 20, 2020

Looks like you are right - wasm-dissing the whole benchmark, I can also see 4x and instructions in the inner loop, but in a small test case, only one gets generated.

It is currently not possible to control enabling or disabling autovectorization, -msimd128 always enables autovectorization.

However given that all tests pass (assuming CI network will now be up), I'll leave such investigation for later. At least the autovectorizer is not generating incorrect code, just slow paths.

@tlively
Copy link
Member

tlively commented May 20, 2020

It is currently not possible to control enabling or disabling autovectorization, -msimd128 always enables autovectorization.

You should be able to pass -disable-loop-vectorization -disable-slp-vectorization -vectorize-loops=false -vectorize-slp=false manually to disable all vectorization.

@juj
Copy link
Collaborator Author

juj commented May 20, 2020

Oh right.. Trying that out, that does not fix the issue, the 4x ands are still there.

@ngzhian
Copy link
Collaborator

ngzhian commented May 20, 2020

I haven't looked too deep but does this benchmark try to do multiple runs until we reach a confidence interval or something?
Based on a local build of just the andps test running, the variance can be quite high between runs (scalar is 0 since I didn't run any scalar code).

$ for i in $(seq 1 10); do ~/v8/out/x64.release/d8 --experimental-wasm-simd benchmark_andps.js; done
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.530481 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.540674 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.546813 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.098467 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.537694 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.193417 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.146508 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.545502 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.121117 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.530481 }

@juj
Copy link
Collaborator Author

juj commented May 20, 2020

@ngzhian Yeah no it doesn't - that is what I meant above by the benchmark not being variance controlled.

@ngzhian
Copy link
Collaborator

ngzhian commented May 20, 2020

Got it, sorry I missed that point, thanks for clarifying!
Btw, I tried -fno-vectorize (from https://llvm.org/docs/Vectorizers.html#the-loop-vectorizer) and it seems to fix the problem of having 4 v128.and in the scalar code, I only see 1.

@juj
Copy link
Collaborator Author

juj commented May 21, 2020

Yay, finally CI is green. This is good to review and land now.

@ngzhian
Copy link
Collaborator

ngzhian commented May 21, 2020

pmin and pmax opcodes in v8 were wrong after the renumber, this will be fixed in https://chromium-review.googlesource.com/c/v8/v8/+/2212682

@juj
Copy link
Collaborator Author

juj commented May 23, 2020

Hey, if this looks semi-decent, would love to get this landed, I have a further PR for SSE2 ready to submit that depends on this one.

// Emscripten SIMD support doesn't support MMX/float32x2/__m64.
// However, we support loading and storing 2-vectors, so
// treat "__m64 *" as "void *" for that purpose.
typedef float __m64 __attribute__((__vector_size__(16), __aligned__(16)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be __vector_size__(8), __aligned__(8)? I'm not sure how _mm_loadl_pi and _mm_loadh_pi can be correct if they are loading 16 bytes rather than 8.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I don't remember why it was originally a 16-byte thing - in a followup PR that adds SSE2 support this will change to a 8-byte quantity.

system/include/SSE/xmmintrin.h Outdated Show resolved Hide resolved
_mm_loadr_ps(const float *__p)
{
__m128 __v = _mm_load_ps(__p);
return __builtin_shufflevector(__v, __v, 3, 2, 1, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These __builtin_shufflevectors are breaking the v128_t abstraction an relying on the fact that v128_t is defined to have four elements. Can you add a comment saying that these really should be wasm shuffle intrinsics as soon as the performance problem is figured out? Also, since we are already using types like __f32x4 in this file, it would be good to use them here, too, to make the number of lanes explicit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me revisit this in next PR, I have conflicting changes to these lines in the next SSE2 PR.

struct __unaligned {
float __v;
} __attribute__((__packed__, __may_alias__));
((struct __unaligned *)__p)->__v = ((__f32x4)__a)[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__f32x4 and friends are not meant to be exported from wasm_simd128.h, but I suppose if you copied the definitions to this file you would get errors about multiple typedefs of the same thing. Maybe they should be macros in wasm_simd128.h so they can be undefined at the end? Anyway, I think this is fine for now but it might be something we want to change in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I hope it'll be fine - if upstream changes, we should be able to adapt quite easily.

return (__p.__x[0] >> 31)
| ((__p.__x[1] >> 30) & 2)
| ((__p.__x[2] >> 29) & 4)
| ((__p.__x[3] >> 28) & 8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you experimented with the prototyped bitmask instruction for this one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one does not seem to be available yet (WebAssembly/simd#201). That one looks like it would directly implement movemask here, so would be pretty cool to have that.

Added a TODO in code to remember to look at that when it becomes available.

if '*************************' in benchmark_results:
benchmark_results = benchmark_results[:benchmark_results.find('*************************')].strip()

print benchmark_results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had switched to Python 3 already? Or is that just a WIP?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is from old era, not Python 3 unfortunately. We are still dual Python2+Python3 in Emscripten. I'll have to see about converting this to be Python 3 friendly, although not critical for now.

tests/sse/benchmark_sse1.cpp Outdated Show resolved Hide resolved
SS_TEST("_mm_movelh_ps", _mm_movelh_ps(_mm_load_ps(src+i), _mm_load_ps(src2+i)));

SETCHART("movemask");
START(); for(int i = 0; i < N; i += 4) { int movemask = ((unsigned int)src[i] >> 31) | (((unsigned int)src[i+1] >> 30) & 2) | (((unsigned int)src[i+2] >> 29) & 4) | (((unsigned int)src[i+3] >> 28) & 8); dst_int += movemask; } ENDSCALAR(checksum_dst(&dst_int), "scalar movemask");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is all one line? I think it would be more readable if it were formatted normally. Same with other long lines with multiple statements below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent in these benchmark files have been to have each line implement its own instruction in full, to keep the large amount of code in this file low. I find that it's more navigatable that way overall, expanding these would make the file navigate more verbosely.

Because each line like this implements the matching SSE code in scalar, I find that it's not that important to have all of that "nice in sight", but more to hide it, so it feels more convenient to group them like this.

tests/sse/benchmark_sse1.h Outdated Show resolved Hide resolved
@juj
Copy link
Collaborator Author

juj commented May 25, 2020

Thanks for the review! Addressed comments, the next SSE 2 PR will change __builtin_shufflevector and __m64 size to avoid merge conflicts.

@juj juj mentioned this pull request May 25, 2020
Copy link
Member

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let @kripken merge in case he has any additional comments.

@tlively
Copy link
Member

tlively commented May 27, 2020

Ah, it looks like there are merge conflicts anyhow @juj

@juj juj merged commit 2509ec9 into emscripten-core:master May 27, 2020
@juj
Copy link
Collaborator Author

juj commented May 27, 2020

Merging this in.. let's iterate in tree on further comments.

@nemequ
Copy link

nemequ commented May 30, 2020

This duplicates some of what we've been working on in SIMDe. I added a (rather length, sorry) comment to simd-everywhere/simde#86 to try to get a discussion started on how to proceed without duplicating effort. If anyone has an opinion I'd be interested in hearing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants