SSE1 support #11193

juj · 2020-05-19T11:38:57Z

This PR restores SSE1 support that was nuked in SIMD.js -> Wasm SIMD transition.

juj · 2020-05-19T11:41:59Z

Microbenchmark results at http://clb.confined.space/dump/wasm_simd_results_sse1.html

juj · 2020-05-19T12:44:33Z

Eyeballing the results suggests:

load and set instructions are quite good, within 2x slowdown range,
_mm_setzero_ps/ is quite fishy, it does not seem like optimizations are kicking in on wasm_f32x4_const(0.f, 0.f, 0.f, 0.f)
shuffle instruction generation is definitely off. Tried wasm_v32x4_shuffle instead of __builtin_shufflevector, but that was even slower.
min, max, move and store instructions are also in good 2x slower range,
binary operations and, andnot, or, xor on the other hand are definitely not reaching the target, as they are 10x+ slower than native.
arithmetic add/sub/mul/div are also in comfortable 2x slower range,
lack of rcp and rsqrt in Wasm SIMD is quite unfortunate, leading to slow rcp behavior.
However performance of rsqrt seems good, I wonder if that may be a case where some pattern matching magic is occurring?
compare instructions are also at good 2x slower ballpark, even though some of them are scalarized.

In general looks like one can expect about "twice as slow" SIMD performance from Wasm compared to native, but still quite a bit faster than scalar Wasm.

There is likely to exist some amount of noise in the numbers, they are not variance-controlled.

tlively · 2020-05-19T17:52:45Z

FYI @ngzhian @dtig

_mm_setzero_ps/ is quite fishy, it does not seem like optimizations are kicking in on wasm_f32x4_const(0.f, 0.f, 0.f, 0.f)

This is not surprising: LLVM does not emit v128.const because it has not been implemented by v8 yet.

shuffle instruction generation is definitely off. Tried wasm_v32x4_shuffle instead of __builtin_shufflevector, but that was even slower.

This is extremely interesting and something we need to dive into. Do you have a specific example where wasm_v32x4_shuffle is slower than __builtin_shufflevector?

min, max, move and store instructions are also in good 2x slower range,

Does that include floating point min and max instructions or just integer min and max?

binary operations and, andnot, or, xor on the other hand are definitely not reaching the target, as they are 10x+ slower than native.

@ngzhian, @dtig do you know why this could be?

arithmetic add/sub/mul/div are also in comfortable 2x slower range,

lack of rcp and rsqrt in Wasm SIMD is quite unfortunate, leading to slow rcp behavior.

However performance of rsqrt seems good, I wonder if that may be a case where some pattern matching magic is occurring?

@ngzhian, @dtig IIRC, we decided not to inlclude rsqrt because it was nondeterministic across platforms. Is that correct?

compare instructions are also at good 2x slower ballpark, even though some of them are scalarized.

In general looks like one can expect about "twice as slow" SIMD performance from Wasm compared to native, but still quite a bit faster than scalar Wasm.

There is likely to exist some amount of noise in the numbers, they are not variance-controlled.

👍

tlively · 2020-05-19T18:01:15Z

FYI @seanptmaher as well

juj · 2020-05-19T18:10:28Z

Do you have a specific example where wasm_v32x4_shuffle is slower than __builtin_shufflevector?

You can repro the slowness in python tests/benchmark_sse1.py script in this repo, and replace the __builtin_shufflevector with wasm_v32x4_shuffle in https://github.com/emscripten-core/emscripten/pull/11193/files#diff-0bb71a2bb9518b5d5dade153d5132a0bR139

Does that include floating point min and max instructions or just integer min and max?

That includes only SSE1 min and max, as benchmarked via the script. Integer min and max will be later in SSE2.

juj · 2020-05-19T18:11:18Z

Hmm for some reason the test FAIL: test_archive_duplicate_basenames (test_other.other) failed on CI, cannot reproduce locally.

juj · 2020-05-19T18:16:31Z

Uploaded rendered documentation at http://clb.confined.space/dump/sse_docs_singlehtml/#porting-simd-code-targeting-webassembly

sbc100 · 2020-05-19T18:58:11Z

Hmm for some reason the test FAIL: test_archive_duplicate_basenames (test_other.other) failed on CI, cannot reproduce locally.

I've seen that test flake a few times and I've tried to investigate but I really can't figure it out. It seems perfectly deterministic .. Can you try re-running to make sure it is a flake?

ngzhian · 2020-05-20T01:07:55Z

I looked at mm_and_ps for a start, i trimmed down benchmark_sse1.cpp to just _mm_and_ps to narrow down the issue, and also to get a more sizable generated code. The wasm file is large, and the I've uploaded the text file. In particular line 6741 seems to be where the core of the _mm_and_ps loop is happening, and it has that constant N=1610241024 there.
I'm not sure why there is so much extra code.

bench.wat.txt

juj · 2020-05-20T14:08:11Z

Looks like CI is self-isolating today.. Tried to rebase to latest master, but that still gives network errors on icu and something else on flake8.

juj · 2020-05-20T14:51:03Z

In particular line 6741 seems to be where the core of the _mm_and_ps loop is happening, and it has that constant N=16_1024_1024 there.

The code there looks like this:

    loop  ;; label = @1
      local.get 12
      local.get 11
      v128.and
      local.set 11
      local.get 15
      local.get 14
      v128.and
      local.set 14
      local.get 16
      local.get 13
      v128.and
      local.set 13
      local.get 17
      local.get 10
      v128.and
      local.set 10
      local.get 0
      i32.const 4
      i32.add
      local.tee 0
      i32.const 16777216
      i32.ne
      br_if 0 (;@1;)
    end

The constant 16777216 is the loop termination condition. The idea is to benchmark the operation this many times, and then later average it to get to the overhead per single call. (because profiling a single call would be too little work). It comes from const int N = 16*1024*1024; in the source code tests/sse/benchmark_sse1.h.

However I am surprised to see the operation v128.and appear four times(!) in the body of the loop. That would make sense only if the compiler had done some partial loop unrolling, but to my eye it does not look like the loop iteration count would have gotten reduced.

juj · 2020-05-20T15:00:27Z

The C++ code for the section looks like this:

 do { 
  double start = emscripten_get_now();; 
  __m128 o0 = _mm_load_ps(src); 
  __m128 o1 = _mm_load_ps(src2); 
  for(int i = 0; i < N; i += 4) 
    o0 = _mm_and_ps(o0, o1); 
  _mm_store_ps(dst, o0); 
  double end = emscripten_get_now(); 
  ...
} while(0);;

so should be only one and operation per loop body.

Running wasm-dis on the compiled code (-O3) gives

  (loop $label$3
   (local.set $8
    (v128.and
     (local.get $8)
     (local.get $9)
    )
   )
   (local.set $2
    (i32.lt_u
     (local.get $0)
     (i32.const 16777212)
    )
   )
   (local.set $0
    (i32.add
     (local.get $0)
     (i32.const 4)
    )
   )
   (br_if $label$3
    (local.get $2)
   )
  )

so definitely only one v128.and instruction present in the wasm code. The question is then - does V8 amplify that to 4x v128.and operations?

dtig · 2020-05-20T16:36:31Z

Sorry, missed this yesterday. A general question for when you are running native benchmarks, the baseline to compare with is to force SSE4.1 for a more accurate comparison, if not depending on the machine, it could pick AVX+ enabled which makes the comparison not exactly accurate.

FYI @ngzhian @dtig

_mm_setzero_ps/ is quite fishy, it does not seem like optimizations are kicking in on wasm_f32x4_const(0.f, 0.f, 0.f, 0.f)

This is not surprising: LLVM does not emit v128.const because it has not been implemented by v8 yet.

shuffle instruction generation is definitely off. Tried wasm_v32x4_shuffle instead of __builtin_shufflevector, but that was even slower.

This is extremely interesting and something we need to dive into. Do you have a specific example where wasm_v32x4_shuffle is slower than __builtin_shufflevector?

min, max, move and store instructions are also in good 2x slower range,

Does that include floating point min and max instructions or just integer min and max?

binary operations and, andnot, or, xor on the other hand are definitely not reaching the target, as they are 10x+ slower than native.

@ngzhian, @dtig do you know why this could be?

This is definitely surprising, sounds like it's falling into the scalar path, or generating more code than necessary. The generic logical operations (pand, por, pxor etc.) should be fairly fast.

arithmetic add/sub/mul/div are also in comfortable 2x slower range,

lack of rcp and rsqrt in Wasm SIMD is quite unfortunate, leading to slow rcp behavior.

However performance of rsqrt seems good, I wonder if that may be a case where some pattern matching magic is occurring?

@ngzhian, @dtig IIRC, we decided not to inlclude rsqrt because it was nondeterministic across platforms. Is that correct?

This is correct, see corresponding issue on the SIMD repository here - https://github.com/WebAssembly/simd/issues/3.

compare instructions are also at good 2x slower ballpark, even though some of them are scalarized.

In general looks like one can expect about "twice as slow" SIMD performance from Wasm compared to native, but still quite a bit faster than scalar Wasm.
There is likely to exist some amount of noise in the numbers, they are not variance-controlled.

👍

dtig · 2020-05-20T16:40:39Z

The C++ code for the section looks like this:

 do { 
  double start = emscripten_get_now();; 
  __m128 o0 = _mm_load_ps(src); 
  __m128 o1 = _mm_load_ps(src2); 
  for(int i = 0; i < N; i += 4) 
    o0 = _mm_and_ps(o0, o1); 
  _mm_store_ps(dst, o0); 
  double end = emscripten_get_now(); 
  ...
} while(0);;

so should be only one and operation per loop body.

Running wasm-dis on the compiled code (-O3) gives

  (loop $label$3
   (local.set $8
    (v128.and
     (local.get $8)
     (local.get $9)
    )
   )
   (local.set $2
    (i32.lt_u
     (local.get $0)
     (i32.const 16777212)
    )
   )
   (local.set $0
    (i32.add
     (local.get $0)
     (i32.const 4)
    )
   )
   (br_if $label$3
    (local.get $2)
   )
  )

so definitely only one v128.and instruction present in the wasm code. The question is then - does V8 amplify that to 4x v128.and operations?

V8 doesn't optimize much over the .wasm file that's generated by the tools. This looks like there's some loop unrolling that's kicking in. V8 will generate code for individual Wasm operations in the generated binary, and doesn't optimize across operation boundaries.

ngzhian · 2020-05-20T17:16:42Z

so definitely only one v128.and instruction present in the wasm code. The question is then - does V8 amplify that to 4x v128.and operations?

I think the issue here is that emscripten is generating the 4 v128.and instructions that you mentioned in #11193 (comment) when compiling the entire benchmark_sse1.cpp. This is the scalar portion of the benchmark. Which seems wrong.

In bench.wat.txt I attached, the vector code begins at 6889, which looks fine.

    local.set 10                                                                                        
    loop  ;; label = @1                                                                                 
      local.get 10                                                                                      
      local.get 11                                                                                      
      v128.and                                                                                          
      local.set 10                                                                                      
      local.get 0                                                                                       
      i32.const 16777212                                                                                
      i32.lt_u                                                                                          
      local.set 2                                                                                       
      local.get 0                                                                                       
      i32.const 4                                                                                       
      i32.add                                                                                           
      local.set 0                                                                                       
      local.get 2                                                                                       
      br_if 0 (;@1;)                                                                                    
    end

V8 is running whatever wasm code is generated, so I think the issue is with the cpp->wasm step. Like Deepti mentioned, it could be a loop unrolling bug.

juj · 2020-05-20T17:31:53Z

Looks like you are right - wasm-dissing the whole benchmark, I can also see 4x and instructions in the inner loop, but in a small test case, only one gets generated.

It is currently not possible to control enabling or disabling autovectorization, -msimd128 always enables autovectorization.

However given that all tests pass (assuming CI network will now be up), I'll leave such investigation for later. At least the autovectorizer is not generating incorrect code, just slow paths.

tlively · 2020-05-20T18:03:08Z

It is currently not possible to control enabling or disabling autovectorization, -msimd128 always enables autovectorization.

You should be able to pass -disable-loop-vectorization -disable-slp-vectorization -vectorize-loops=false -vectorize-slp=false manually to disable all vectorization.

juj · 2020-05-20T18:09:07Z

Oh right.. Trying that out, that does not fix the issue, the 4x ands are still there.

ngzhian · 2020-05-20T18:47:36Z

I haven't looked too deep but does this benchmark try to do multiple runs until we reach a confidence interval or something?
Based on a local build of just the andps test running, the variance can be quite high between runs (scalar is 0 since I didn't run any scalar code).

$ for i in $(seq 1 10); do ~/v8/out/x64.release/d8 --experimental-wasm-simd benchmark_andps.js; done
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.530481 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.540674 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.546813 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.098467 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.537694 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.193417 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.146508 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.545502 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.121117 }
,{ "chart": "chart", "category": "andps", "scalar": 0.000000, "simd": 0.530481 }

juj · 2020-05-20T18:54:47Z

@ngzhian Yeah no it doesn't - that is what I meant above by the benchmark not being variance controlled.

ngzhian · 2020-05-20T18:58:11Z

Got it, sorry I missed that point, thanks for clarifying!
Btw, I tried -fno-vectorize (from https://llvm.org/docs/Vectorizers.html#the-loop-vectorizer) and it seems to fix the problem of having 4 v128.and in the scalar code, I only see 1.

juj · 2020-05-21T07:05:20Z

Yay, finally CI is green. This is good to review and land now.

ngzhian · 2020-05-21T17:27:51Z

pmin and pmax opcodes in v8 were wrong after the renumber, this will be fixed in https://chromium-review.googlesource.com/c/v8/v8/+/2212682

juj · 2020-05-23T03:23:57Z

Hey, if this looks semi-decent, would love to get this landed, I have a further PR for SSE2 ready to submit that depends on this one.

tlively · 2020-05-23T18:51:15Z