Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WebAssembly SIMD128 implementation and Node.JS support #825

Merged
merged 8 commits into from
Jul 10, 2023

Conversation

easyaspi314
Copy link
Contributor

@easyaspi314 easyaspi314 commented Mar 14, 2023

  • Support WebAssembly SIMD128 via the NEON path
    • The Emscripten SDK includes the SIMDeverywhere polyfill for <arm_neon.h>
    • Since SIMD128 seems to be designed after a subset of NEON, the double width NEON path maps 1:1 and gets full speed with few modifications
      • Note: XXH3_NEON_LANES != 8 IS BAD since SIMDe scalarizes half vectors
    • On v8, about 2x faster than scalar on AArch64, 3-4x faster on x86_64
    • Firefox seems to be slightly faster on x86, slightly slower on aarch64
  • Adds a few fixes for compiling with Node.JS (make NODE_JS=1)
    • Requires linking nodefs and noderawfs to access local files
    • Hook into node.js directly using inline JavaScript to access tty.isatty() from Node instead of the broken isatty() in libc (which always returns 1 for stdin/stdout/stderr)
  • Add wasm/asmjs to the welcome message
  • Remove emcc output files on make clean
  • Add basic CI tests for both WASM and asm.js (the latter of which is painfully slow lol)
  • Also since I needed to modify XXH3_scrambleAcc_neon, I found a stupidly obvious optimization to standard NEON.

Due to the fact that WebAssembly is JIT-compiled and that JavaScript timer precision is limited due to Spectre, the results may vary.

NEON gets almost native speed on v8, but x86_64 does not mainly due to the instruction sets not exactly matching up. Firefox's aarch64 JIT seems to be considerably slower when it comes to WASM.

xxhsum -b on an AArch64 Google Tensor G1 (Cortex-X1 @ 2.80 GHz):

Browser/engine Scalar XXH3_64b Scalar XXH128 WASM128 XXH3_64B WASM128 XXH128
Node 18.13 / v8 10.2.154 (termux) 8421.7 MB/s 8161.7 MB/s 16485.4 MB/s 16355.0 MB/s
Firefox Nightly for Android 112.0a1 7850.5 MB/s 7870.8 MB/s 6148.4 MB/s 5861.7 MB/s

x86_64 (with AVX2) Core i7-8650U @ 2.11 GHz (turbo 4.20 GHz):

Browser/engine Scalar XXH3_64b Scalar XXH128 WASM128 XXH3_64B WASM128 XXH128
Node 18.14 / v8 10.2.154 (mingw64) 3605.8 MB/s 3473.8 MB/s 9658.2 MB/s 11535.5 MB/s
Firefox for Windows 110.0.1 6406.9 MB/s 6004.3 MB/s 9249.1 MB/s 9288.8 MB/s

(Not sure why XXH128 is that much faster on x86 but not ARM. Maybe code alignment memes?).

@easyaspi314 easyaspi314 changed the title Add preliminary WebAssembly SIMD128 implementation [WIP] Add WebAssembly SIMD128 implementation Mar 14, 2023
@Cyan4973
Copy link
Owner

Cyan4973 commented Mar 14, 2023

The added source code is pretty clean, this is easy to review.

It seems you want to test it a little longer ?
In which case, could you make this PR "draft", so I don't wonder if it needs to be merged now or later ?

@easyaspi314 easyaspi314 marked this pull request as draft March 14, 2023 21:24
@easyaspi314
Copy link
Contributor Author

So apparently marking it as a draft cancels CI tests...

@easyaspi314 easyaspi314 marked this pull request as ready for review March 14, 2023 21:25
@easyaspi314 easyaspi314 marked this pull request as draft March 14, 2023 21:37
@easyaspi314 easyaspi314 marked this pull request as ready for review March 14, 2023 21:37
@easyaspi314 easyaspi314 force-pushed the wasm128 branch 3 times, most recently from e5ceb21 to 23bca55 Compare March 14, 2023 21:55
@easyaspi314 easyaspi314 changed the title [WIP] Add WebAssembly SIMD128 implementation Add WebAssembly SIMD128 implementation and Node.JS support Mar 14, 2023
@easyaspi314
Copy link
Contributor Author

Ok I think I figured everything out. I want to add various versions of the Emscripten SDK to the matrix but that isn't a huge priority. I mostly want to test the recent node.js versions.

Amusingly due to the clock() granularity (due to Spectre mitigations), xxhsum -bi0 has only a few possible entries.

@easyaspi314
Copy link
Contributor Author

This seems to be ready to merge, porting the NEON code was dead simple. Sorry I had to do it the old fashioned way because CI tests were ignored.

xxhash.h Outdated Show resolved Hide resolved
xxhash.h Outdated Show resolved Hide resolved
Copy link
Owner

@Cyan4973 Cyan4973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.
It fits well in existing code base.

I would recommend publishing a set of benchmark as part of this PR description, as it can serve as reference for future users of the WASM128 code path.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 14, 2023

This looks good to me. It fits well in existing code base.

I would recommend publishing a set of benchmark as part of this PR description, as it can serve as reference for future users of the WASM128 code path.

A basic xxhsum -b benchmark is up now.

Also, I managed to get a bench on Firefox. It isn't as impressive but it is still a solid improvement on x86 and just a slight decrease on aarch64.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 15, 2023

Actually....I have an idea.

emcc emulates NEON intrinsics with SIMDe, and the intrinsics are so similar that I am considering scrapping the intrinsics and using NEON with some slight modifications.

I literally get the same BETTER performance if I just replace the multiply in scrambleAcc and put an asm guard on the secret pointer to keep it from turning it into literals.

Screw these intrinsics. I'm putting wasm on XXH_NEON with a couple of ifdefs.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 15, 2023

🦀 SIMD128 intrinsics are GONE 🦀

SIMDe generates the same thing, and I was actually able to make it faster, now reaching 16.3 GB/s on my phone. (Mostly due to stopping Clang from constant folding kSecret).

I also found a stupidly obvious optimization to XXH3_scrambleAcc_neon. 🤦‍♀️

I will get some new benchmarks tomorrow and probably squash.

 - Link in `nodefs` and `noderawfs`
 - Use Node's `tty.isatty()` via inline JS instead of the broken libc `isatty()`
 - Used with `make NODE_JS=1`
Currently only one version of EMCC, testing node 16, 17, and 18.

Cache is used because the emsdk has to cache each library which takes
a bit.
@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 15, 2023

Apparently I can't figure out how git rebase works and my commits doubled lol

Edit: fixed. Aside from the CI test commit there is no trace of the intrinsics now 😏

The Emscripten SDK includes arm_neon.h from SIMDeverywhere to port NEON projects.

Since SIMD128 is very similar to NEON without half vectors, the double NEON path
maps perfectly to WASM SIMD128 and can reach full speed with little modification.
(As a matter of fact I was able to make it slightly faster).

Note that if XXH3_NEON_LANES is not a multiple of 4, SIMDe will scalarize the
single width NEON paths, so 8 is strongly recommended.

Also, I found an optimization to scrambleAcc which should have been obvious.
It JITs to the same thing that slowed down SSE4 and NEON.
@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Mar 15, 2023

Ok so apparently aarch64 Firefox has slowed down even more.

However, I am not that concerned since aarch64 is likely going to be using V8 or WebKit anyways — Firefox for Android has like a 0.6% usage share lol, and it is still a reasonable speed. Looking over the source code it might be that the compiler can't hoist the shuffle masks out of the loop, but there isn't much I can do. Shuffle masks are inlined with the instruction. The only ones that would probably use it are ARM Macs and AArch64 Windows/Linux users.

Also I want to see how Safari (WebKit) on iOS fares since emrun can host a web server. Apparently WebKit doesn't support SIMD.

@easyaspi314
Copy link
Contributor Author

The last commit is just documentation.

@Cyan4973
Copy link
Owner

Cyan4973 commented Jun 30, 2023

Coming back to this topic, trying to re-acquire the context.

Initially, this PR was providing a new vector code path for WASM.
Doing so would improve performance on WASM target, which currently can only use the scalar code path.
This code was initially reviewed, and accepted.

Quickly after, @easyaspi314 discovered that the WASM compiler emcc could actually ingest NEON intrinsics directly, resulting in similar performance, but removing the need for additional dedicated code path, which is good for maintenance.
The PR was then substantially altered to take advantage of this property, resulting in a generally shortened implementation.

Then, it was discovered that the "similar performance" (as WASM SIMD128) statement is not entirely correct.
This is true when running with node.js for aarch64 target.
But when it comes to x64, some performance is lost (compared to WASM SIMD128), because NEON intrinsics are different enough from AVX ones. The counter to this point is that using NEON intrinsics for WASM still delivers some hefty speed benefit compared to scalar, so the PR is still a net progress.
Another problematic issue is that Firefox for Android actually suffers from the new NEON intrinsic code path, even as the local target is aarch64, to the point of being even slower than scalar code path, making this PR no longer a "guaranteed positive". The counter to this point is to state that Firefox has probably a negligible presence on aarch64.
It would have been appropriate therefore to complete the picture by presenting benchmarks using another system (Chrome maybe?) that is presumed more representative of aarch64 scenario.

Anyway, this is what I recall from memory, and since this PR's content has changed substantially since it was first validated, it deserves another review.

@easyaspi314 , is that a correct assessment of the situation ?

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Jun 30, 2023

V8 and NodeJS use Chrome's JavaScript engine.

Browser Render Engine JS engine
Chrome, Chromium Chromium V8
Edge Chromium V8
Opera, Brave, etc Chromium V8
Node.JS N/A V8
Firefox Gecko SpiderMonkey
Safari WebKit JavaScriptCore

I am not that concerned about the Firefox AArch64 problem:

  1. Firefox AArch64 is quite uncommon:
  • Apple mandates that all iOS browsers use WebKit, so Firefox isn't using its own engine
  • Firefox on Android is niche, with only 100M downloads despite being on the Play Store since 2010
  1. The only thing preventing the same 16 GB/s as V8 is a missed optimization we don't have control over (see also - MSVC x86). However, browsers are constantly updated so the performance will only improve even if the binaries aren't recompiled. (Judging from the performance it looks like SIMD isn't/wasn't ready for AArch64 and it was scalarized)
  2. Since it is so uncommon, I would put it in the category of the XXH32 AArch64 issue, where we give up max performance on specific targets (this case Firefox instead of the A53 and X1) for better/more balanced performance on the majority

I should update and run some new benchmarks on the latest versions, as well as get a Safari benchmark as iOS 16 should have SIMD now. (Unfortunately my MBP is out of commission so Safari x64 might be tricky to bench unless I can find a WebKit impl for Windows)

As for validation, the NEON intrinsics we use are literally a 1:1 translation to SIMD128 aside from the umlal needing an additional add, so it is as optimal as wasm intrinsics unless I am missing something.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Jun 30, 2023

Found that I can run WebKit via Epiphany on Linux.

WebKit x64 isn't doing too hot.

AMD Ryzen 5 3600
Ubuntu 22.04 LTS on WSL2 
Epiphany Technology Preview 45.alpha-2-g08992b26f+
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.4 Safari/605.1.15

scalar

this.program 0.8.1 by Yann Collet
compiled as 32-bit wasm/asmjs little endian with Clang 17.0.0
Sample of 100 KB...
 1#XXH32                         :     102400 ->    62087 it/s ( 6063.2 MB/s)
 3#XXH64                         :     102400 ->   117833 it/s (11507.1 MB/s)
 5#XXH3_64b                      :     102400 ->    79804 it/s ( 7793.4 MB/s)
11#XXH128                        :     102400 ->    77467 it/s ( 7565.1 MB/s)

simd128

this.program 0.8.1 by Yann Collet
compiled as 32-bit wasm/asmjs + simd128 little endian with Clang 17.0.0
Sample of 100 KB...
 1#XXH32                         :     102400 ->    62770 it/s ( 6129.9 MB/s)
 3#XXH64                         :     102400 ->   123634 it/s (12073.7 MB/s)
 5#XXH3_64b                      :     102400 ->    50899 it/s ( 4970.7 MB/s)
11#XXH128                        :     102400 ->    51623 it/s ( 5041.3 MB/s)

Apparently, this is because WebKit converts extmul to zero extend + full multiply without a special case, and then it is forced to scalarize because it will only emit a i64x2 multiply if AVX-512 is enabled. (NEON, x64). I might open up a PR/issue for this to be added because this is pathetic 💀

However, I don't quite understand what is up with Firefox's aarch64 jit. I might have to boot up my raspi so I can get the JIT debugger. (Edit: Really? No Linux aarch64 builds?! Do I seriously have to install tiny11 for this?)

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Jul 1, 2023

Well I got a build of Firefox's jsshell for AArch64 that can dump native code, but the standard emcc wrappers don't work with it. I might have to write my own WASM glue. 💀

I am truly baffled on how the performance is worse, as the reason isn't blatant like WebKit — all instructions should be lowered to the direct NEON counterparts.

Also I tried adjusting the JIT parameters on Firefox, to no avail. And it isn't baseline jit being used because when I disable optimizing jit it is only 2.9 GB/s.

@Cyan4973
Copy link
Owner

Cyan4973 commented Jul 5, 2023

So, to summarize the situation,
this PR influences performance of XXH3 when compiled for WASM,
under Node.JS, it's much faster (2x-3x),
but under Web Browser environment, whether it is Chrome of Firefox,
it's only a bit faster for x64 cpus (+50%)
but it's slower for aarch64 cpus (-20-30%)
even though the WASM code path uses NEON intrinsics.

Correct ?

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Jul 6, 2023

So, to summarize the situation,
this PR influences performance of XXH3 when compiled for WASM,
under Node.JS, it's much faster (2x-3x),
but under Web Browser environment, whether it is Chrome of Firefox,
it's only a bit faster for x64 cpus (+50%)
but it's slower for aarch64 cpus (-20-30%)
even though the WASM code path uses NEON intrinsics.

Correct ?

Basically although a little mixed up.

For all intents and purposes, Chrome and Node.JS are the same. I have confirmed that the benchmarks are nearly identical because they use the same JS engine (V8) under the hood. Chrome pairs V8 with the Chromium HTML renderer, Node.JS pairs V8 with an OS interface.

Rough diagram of the browser situation:

browser engines (1).png

"Common name" JS Engine Paired HTML engine Uses aarch64 x64 Notes
V8 V8 Chromium Node.js, Chrome, Chromium (Browser), Opera, Brave, Vivaldi, etc much faster faster Most popular, most important
Firefox SpiderMonkey Gecko Firefox, Tor Browser, Pale Moon, etc slower faster AArch64 rare, simple JIT bug?
WebKit JavaScriptCore WebKit Safari, Epiphany slower (speculation) slower SIMD is new, simple JIT bug.

Firefox and Chrome (and derivatives) are on a rapid release cycle, and the only time you see an old JS engine is an older version of Node.JS (which doesn't matter because it is already a performance improvement), an unsupported OS, or someone who refuses to update. You rarely come into the issue where someone is stuck with an old toolchain as well because of how emcc is distributed.

WebKit only recently added SIMD128 so it is unsurprising that there are a few kinks. This one is very simple and could be fixed in a few hours once I figure out how to contribute.

Firefox AArch64 is probably also a simple bug that I can't tell from the source code itself, given how it should be a direct translation. I mostly need to find a minimal reproducible example for the codegen bug to report it.

However, the important thing is that the code that is generated is optimal on the WASM level, and this is a problem with the upstream browsers. I say merge and then we try to fix the Firefox and WebKit bugs.

Additionally, the most important target is V8 because most browsers are Chromium based, and Node.JS will affect server-side performance which is very significant on a large server.

@@ -55,6 +55,12 @@ else
EXT =
endif

ifeq ($(NODE_JS),1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This environment variable NODE_JS=1 must be set at compilation time.
It should probably be documented in README.md.

@Cyan4973
Copy link
Owner

Cyan4973 commented Jul 9, 2023

I'm going to insist a bit regarding the documentation of the new NODE_JS build variable,
since I don't know if I would be able to complete it myself.
I see NODE_JS used in tests, but I don't know if this variable is meant to be used only for tests, or meant to be used only with node.js specifically, if it can ignored at (presumably) the cost of performance, etc.

@easyaspi314
Copy link
Contributor Author

easyaspi314 commented Jul 10, 2023

Done. I also did things a bit more correctly, added a helpful note to the make check output if it fails, and removed the RUN_ENV override.

NODE_JS adds proper detection of terminals (as the default emsdk's isatty is a no-op) and more importantly gives unrestricted access to the filesystem.

@Cyan4973 Cyan4973 merged commit c234b94 into Cyan4973:dev Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants