-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add WebAssembly SIMD128 implementation and Node.JS support #825
Conversation
The added source code is pretty clean, this is easy to review. It seems you want to test it a little longer ? |
So apparently marking it as a draft cancels CI tests... |
e5ceb21
to
23bca55
Compare
Ok I think I figured everything out. I want to add various versions of the Emscripten SDK to the matrix but that isn't a huge priority. I mostly want to test the recent node.js versions. Amusingly due to the |
This seems to be ready to merge, porting the NEON code was dead simple. Sorry I had to do it the old fashioned way because CI tests were ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
It fits well in existing code base.
I would recommend publishing a set of benchmark as part of this PR description, as it can serve as reference for future users of the WASM128
code path.
A basic xxhsum -b benchmark is up now. Also, I managed to get a bench on Firefox. It isn't as impressive but it is still a solid improvement on x86 and just a slight decrease on aarch64. |
Actually....I have an idea. emcc emulates NEON intrinsics with I literally get Screw these intrinsics. I'm putting wasm on |
🦀 SIMD128 intrinsics are GONE 🦀 SIMDe generates the same thing, and I was actually able to make it faster, now reaching 16.3 GB/s on my phone. (Mostly due to stopping Clang from constant folding kSecret). I also found a stupidly obvious optimization to I will get some new benchmarks tomorrow and probably squash. |
- Link in `nodefs` and `noderawfs` - Use Node's `tty.isatty()` via inline JS instead of the broken libc `isatty()` - Used with `make NODE_JS=1`
Currently only one version of EMCC, testing node 16, 17, and 18. Cache is used because the emsdk has to cache each library which takes a bit.
Apparently I can't figure out how Edit: fixed. Aside from the CI test commit there is no trace of the intrinsics now 😏 |
The Emscripten SDK includes arm_neon.h from SIMDeverywhere to port NEON projects. Since SIMD128 is very similar to NEON without half vectors, the double NEON path maps perfectly to WASM SIMD128 and can reach full speed with little modification. (As a matter of fact I was able to make it slightly faster). Note that if XXH3_NEON_LANES is not a multiple of 4, SIMDe will scalarize the single width NEON paths, so 8 is strongly recommended. Also, I found an optimization to scrambleAcc which should have been obvious.
It JITs to the same thing that slowed down SSE4 and NEON.
Ok so apparently aarch64 Firefox has slowed down even more. However, I am not that concerned since aarch64 is likely going to be using V8 or WebKit anyways — Firefox for Android has like a 0.6% usage share lol, and it is still a reasonable speed. Looking over the source code it might be that the compiler can't hoist the shuffle masks out of the loop, but there isn't much I can do. Shuffle masks are inlined with the instruction. The only ones that would probably use it are ARM Macs and AArch64 Windows/Linux users.
|
The last commit is just documentation. |
Coming back to this topic, trying to re-acquire the context. Initially, this PR was providing a new vector code path for WASM. Quickly after, @easyaspi314 discovered that the WASM compiler Then, it was discovered that the "similar performance" (as WASM SIMD128) statement is not entirely correct. Anyway, this is what I recall from memory, and since this PR's content has changed substantially since it was first validated, it deserves another review. @easyaspi314 , is that a correct assessment of the situation ? |
V8 and NodeJS use Chrome's JavaScript engine.
I am not that concerned about the Firefox AArch64 problem:
I should update and run some new benchmarks on the latest versions, as well as get a Safari benchmark as iOS 16 should have SIMD now. (Unfortunately my MBP is out of commission so Safari x64 might be tricky to bench unless I can find a WebKit impl for Windows) As for validation, the NEON intrinsics we use are literally a 1:1 translation to SIMD128 aside from the umlal needing an additional add, so it is as optimal as wasm intrinsics unless I am missing something. |
Found that I can run WebKit via Epiphany on Linux. WebKit x64 isn't doing too hot.
scalar this.program 0.8.1 by Yann Collet
compiled as 32-bit wasm/asmjs little endian with Clang 17.0.0
Sample of 100 KB...
1#XXH32 : 102400 -> 62087 it/s ( 6063.2 MB/s)
3#XXH64 : 102400 -> 117833 it/s (11507.1 MB/s)
5#XXH3_64b : 102400 -> 79804 it/s ( 7793.4 MB/s)
11#XXH128 : 102400 -> 77467 it/s ( 7565.1 MB/s) simd128 this.program 0.8.1 by Yann Collet
compiled as 32-bit wasm/asmjs + simd128 little endian with Clang 17.0.0
Sample of 100 KB...
1#XXH32 : 102400 -> 62770 it/s ( 6129.9 MB/s)
3#XXH64 : 102400 -> 123634 it/s (12073.7 MB/s)
5#XXH3_64b : 102400 -> 50899 it/s ( 4970.7 MB/s)
11#XXH128 : 102400 -> 51623 it/s ( 5041.3 MB/s) Apparently, this is because WebKit converts However, I don't quite understand what is up with Firefox's aarch64 jit. I might have to boot up my raspi so I can get the JIT debugger. (Edit: Really? No Linux aarch64 builds?! Do I seriously have to install tiny11 for this?) |
Well I got a build of Firefox's jsshell for AArch64 that can dump native code, but the standard emcc wrappers don't work with it. I might have to write my own WASM glue. 💀 I am truly baffled on how the performance is worse, as the reason isn't blatant like WebKit — all instructions should be lowered to the direct NEON counterparts. Also I tried adjusting the JIT parameters on Firefox, to no avail. And it isn't baseline jit being used because when I disable optimizing jit it is only 2.9 GB/s. |
So, to summarize the situation, Correct ? |
Basically although a little mixed up. For all intents and purposes, Chrome and Node.JS are the same. I have confirmed that the benchmarks are nearly identical because they use the same JS engine (V8) under the hood. Chrome pairs V8 with the Chromium HTML renderer, Node.JS pairs V8 with an OS interface. Rough diagram of the browser situation:
Firefox and Chrome (and derivatives) are on a rapid release cycle, and the only time you see an old JS engine is an older version of Node.JS (which doesn't matter because it is already a performance improvement), an unsupported OS, or someone who refuses to update. You rarely come into the issue where someone is stuck with an old toolchain as well because of how WebKit only recently added SIMD128 so it is unsurprising that there are a few kinks. This one is very simple and could be fixed in a few hours once I figure out how to contribute. Firefox AArch64 is probably also a simple bug that I can't tell from the source code itself, given how it should be a direct translation. I mostly need to find a minimal reproducible example for the codegen bug to report it. However, the important thing is that the code that is generated is optimal on the WASM level, and this is a problem with the upstream browsers. I say merge and then we try to fix the Firefox and WebKit bugs. Additionally, the most important target is V8 because most browsers are Chromium based, and Node.JS will affect server-side performance which is very significant on a large server. |
@@ -55,6 +55,12 @@ else | |||
EXT = | |||
endif | |||
|
|||
ifeq ($(NODE_JS),1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This environment variable NODE_JS=1
must be set at compilation time.
It should probably be documented in README.md
.
I'm going to insist a bit regarding the documentation of the new |
Done. I also did things a bit more correctly, added a helpful note to the make check output if it fails, and removed the
|
<arm_neon.h>
XXH3_NEON_LANES != 8
IS BAD since SIMDe scalarizes half vectorsmake NODE_JS=1
)nodefs
andnoderawfs
to access local filestty.isatty()
from Node instead of the brokenisatty()
in libc (which always returns1
for stdin/stdout/stderr)wasm/asmjs
to the welcome messageemcc
output files onmake clean
Due to the fact that WebAssembly is JIT-compiled and that JavaScript timer precision is limited due to Spectre, the results may vary.
NEON gets almost native speed on v8, but x86_64 does not mainly due to the instruction sets not exactly matching up. Firefox's aarch64 JIT seems to be considerably slower when it comes to WASM.
xxhsum -b
on an AArch64 Google Tensor G1 (Cortex-X1 @ 2.80 GHz):x86_64 (with AVX2) Core i7-8650U @ 2.11 GHz (turbo 4.20 GHz):
(Not sure why XXH128 is that much faster on x86 but not ARM. Maybe code alignment memes?).