-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate WASM as a HAL executable format #2863
Comments
That's a good analysis @benvanik! (sorry for sneak in the issue, just found it after searching for mentions of Wasmer in Github) On the Wasmer side we just landed a big refactor that added a lot of interesting new features for Wasmer 1.0:
We are also thinking on a C++ layer on top of the wasm-c-api to make the integration even easier. As a plus side, Dart (also from Google) and Swift have also integrated Wasmer into their codebase :) Note: wasmtime development might be affected by this https://twitter.com/tschneidereit/status/1293868141953667074 |
Thanks for the notes @syrusakbary! It's often hard to distill progress and it's really helpful to have a contributor give a brain dump like that :) It looks like https://github.com/wasmerio/wasmer/tree/master/lib/c-api is much more recent than https://github.com/wasmerio/wasmer-c-api, which is great! I didn't catch that and have updated the notes above. Not even sure how I found the other link :) Having working iOS examples I think would be very interesting for a lot of people - most of my searching around the net yielded a lot of "can I use wasm on iOS?" and a bunch of shrugs, so having something to point out would really help get people motivated to try out wasmer. Do you happen to know mechanically how wasmer uses LLVM? (as in, static library, shared library, shell exec on tool binaries, etc) One concern we have with a library using LLVM is that we need to track LLVM head very closely and LLVM doesn't have the most well-defined API - keeping things building in sync without ending up compiling in 2 versions of LLVM (and the pain associated/preventing such) is a worry :) I see some references to Inkwell but am not familiar with it. To support additional architectures, is there more needed than ensuring they are supported in config? https://github.com/wasmerio/wasmer/blob/master/lib/compiler-llvm/src/config.rs#L148-L172 (not sure if there are hidden dependencies on features only available in certain configurations, etc) Thanks again for chiming in! |
I started a PR to create a new codegen target for WASM on the IREE side, but didn't get far beyond getting it to build. I was going to start with just doing naive, scalar codegen to get things plumbed. Then was thinking of proceeding to figure out how to create externs for specific high level operations that we want to provide manually and possibly making a quasi-compiled VMLA like thing. |
Just pulling down the wasmer C-API, it looks like it properly anonymizes the symbols for its LLVM dep. The shared library is 6.4MB on x64 linux. So seems workable out of the box without needing to worry about LLVM version conflicts. |
Nice! |
Just adding them in config.rs along with the architecture feature in the cargo config should suffice :) |
/cc @abrown if interested |
@bhack thanks for the cc, I never would have seen this 😄. A couple of thoughts:
/cc @mingqiusun, @jlb6740, @rrwinterton |
I think you could be also quite interested in the status of Flexible vectors /cc @nicolasvasilache WebAssembly/flexible-vectors#7 |
It's really fun to see the interest in things in this direction, and I'm wondering if we want to host some kind of more collaborative group discussion about ways forward? From the IREE side, a good integration with WebAssembly is something we've had our eyes on from the beginning and (aside from some early prototypes) it was something that we were holding off on until the MLIR-based tooling was further along (purely out of a desire to get that level of the stack right before forking off in this direction). The potential benefits to deployability, portability, and security are what can bridge the gap for ML systems between compilers and runtimes, allowing us to have the best of both worlds. In any case, this issue represents our belief that the time is right for our project to go there. Speak up if there is any interest in a broader discussion on this. |
I think that's a great idea--send me a link and I'll be there. |
@stellaraccident Do you will also the exploration avantgrade for TFRT on this topic? |
@bhack Maybe there were some typos? Not quite parsing the question... |
What I meant about runtimes.. has TFRT its own plan on this topic? Or this could be explored by IREE and eventually contributed back. |
IREE is still positioned pretty well to be a delegate under TFRT to compile and execute subgraphs that conform to its limitations. It hasn't been on either project's critical path to do that work (aside from a POC integration I did last year to convince myself that it was feasible), but we try not to lose line of site to the option. Afaik, TFRT isn't really targeting solutions at this level at present, but I generally wouldn't be surprised if systems that need portable execution and distribution end up finding WebAssembly to be a reasonable way to achieve that. Balancing that, of course, is that for HPC code, it still seems a bit early and in need of some more bake time (i.e. fixed width SIMD is not fully launched and still has gaps with respect to what is needed to achieve the best performance, and most people are looking towards scalable vectors as the next tier). Definitely interesting times... |
Yes interesting times expecially about the portability impact on in browser and out of browser runtimes: https://blog.tensorflow.org/2020/03/introducing-webassembly-backend-for-tensorflow-js.html |
@jbingham (Google) has done an interesting presentation on July at https://www.w3.org/2020/06/machine-learning-workshop/talks/a_proposed_web_standard_to_load_and_run_ml_models_on_the_web.html There is also an interesting FAQ at https://github.com/webmachinelearning/model-loader/blob/master/explainer.md |
Such a proposal is exactly what we would like to prevent from happening :) |
For our work, we'll be focusing on compilers and making appropriately low-level representations and tools performant, portable and secure. Fixed function/op-based solutions still have a place in ML for the time being, but they come with significant challenges that we are no longer willing to accept -- and we'd like the lower level tooling to grow to fill the gap. Predictions are pointless, so I won't make any estimates as to when the switch happens :) But that's the tack we're taking -- and I don't think we're talking about years of work. |
it's always nice to understand in which direction the different forces are pushing 😉 |
I've mentioned this thread in two github tickets/threads related to the next W3C machine learning virtual workshop in September so you can find the reference here if you want to comment. |
Tentative Q1 target for this. |
Some related updates if you are interest in the topic: |
I have a functional WASM HAL backend for IREE at #5096 using WAMR that can run MNIST, BERT, and our other supported models. It's very slow right now, probably due to how it naïvely allocates/copies memory. When trying to clean up that memory allocation, we ran into a blocking issue trying to map between WAMR's memory allocation APIs and how IREE models drivers/devices/executables though: #5137. A few options for moving forward are mentioned at the bottom of that issue. |
Hi, just wondering what the status of this bug is and what your dependencies / needs are with respect to multi memory & SIMD. I have done a little digging and I think the following summarises the status of some projects mentioned in the above: State:
Questions:
Thanks in advance, really cool project. |
@Cypher1 The WAMR supports SIMD for AARCH64 now. |
Nice work @xwang98! Small addition: Wasmer also fully supports SIMD since 2.0 (and multi memory, reference types, and even runs on Android!) |
@Cypher1 hi! I think our only major blocker now is the memory issue (#5137) - which we believe is mostly just the APIs exposed by all the engines assuming they allocate and own the memory instance for each instantiated module. Multi-memory would be nice as it would let us partition the local scratch memory from the shared bulk storage memory, but I think if the engines allowed independent memory creation and assignment we could make things work even without multi-memory. In browser land (where we'd also love to run with wasm) we'd want to be able to use a SharedArrayBuffer across multiple loaded wasm modules. SIMD reaching maturity is exciting! Our resident SIMD+GEMM guru @bjacob strongly feels like we need a few more instructions to reach reasonable performance (WebAssembly/simd#127 (comment) + WebAssembly/relaxed-simd#9) - it looks like one dot went in but I'm not sure about the details there. If we can get past the blocking memory issue then having something working that we could measure and test with new instructions would make for easier progress on any such additions to the instruction set. The motivator is that a proper dot instruction can yield 3-4x performance improvements in GEMM and would be worth just about any effort to get implemented given how much GEMM dominates most (non-classic vision) ML models (in speech/translation/sequence-to-sequence text models GEMM is often 90%+ of the total execution time!). We'd love to have v8 wired up as well as the other engines (wasmtime/wamr/etc). Our main issues are around build system/toolchain complexity - any dep we added to the core code would need to be something that we could make work with both cmake and bazel and build across the major platforms (mac/win/linux). This is one reason why we were investigating the embedded engines to start - no/optional JITs that have more platform-specific behavior, no alternative languages/toolchains (rust), no custom build systems (gn), etc. A good alternative that would be worth exploring to work around this is putting each engine in its own shared object that keeps it out of the main build (like a plugin/extension) and we are fairly well setup to handle that code-wise with just a few tweaks. Ideally we'd be able to use the wasm-c-api for everything when it properly supports multi-memory/independent memory allocation/importing memory/etc as in #5137 - then we could just build the engine using its own build system/toolchain and load it at runtime with no complex dependency management/toolchain/build goo. We are still really excited about getting wasm working - both standalone (android/ios/desktop/etc) and on the web - and would be happy to refresh our prototypes with new APIs or try out proposals! |
Oh the other thing we need to investigate is the best approach to multithreading in the various engines. Ideally we'd be able to load a module and then call in to it concurrently from multiple threads (by assigning the wasm stack pointer to unique thread-specific locations). That lets us have N threads without needing to instantiate the same module N times. Worst case we do a full N threads*M modules load, but it'd be much better if we didn't have to. I believe at least one engine we looked into stored invocation state on the module instance preventing this from being possible without fully reloading the module. It's the equivalent to if you had to dlopen the same shared object and get a unique instance of it for every thread you wanted to call functions from, which is not great :) |
There's a bunch of reasons to be interested in wasm as a distribution format for executables. This issue will track notes on the feasibility of this approach, how it could be implemented in IREE, and some open questions.
At a high level we can treat each HAL executable as a WASM binary with multiple entry points (exports) as we do with dylibs for the LLVM AOT backend, store the wasm binary embedded in the module as we do with all other executables, and have a host-derived HAL backend that uses a wasm runtime library to load/cache/invoke the wasm. With this approach we can likely reuse the current LLVM AOT compiler target with different target/linker options almost verbatim.
IREE Implementation
Compiler
We can reuse the existing LLVM target backend with changes only to how we setup the compiler and serialize the resulting binary - LLVMAOTTarget and LLVMIRTarget are two examples that already exist.
Work to link multiple executables together (#1587) such that we ideally end up with a single executable per module with multiple entry points will be very useful here to reduce overhead (only one wasm runtime needed, etc).
Runtime
A majority of the runtime work is identical to the existing dylib, llvmjit, and vmla HAL drivers. All of these share code in iree/hal/host/ for things like work scheduling.
A custom iree::hal::Allocator will be required as wasm runtimes can only access a single contiguous memory address range and we need to suballocate within that if we want to ensure zero-copy behavior. This most closely aligns with the DEVICE_LOCAL|HOST_VISIBLE memory type in that the device here (the wasm runtime) can cheaply manipulate the memory and that HOST_LOCAL|DEVICE_VISIBLE memory would require a copy to use. There's likely some other gotchas here to play with. See open questions below. See the WAMR example.
Toys
WASM Runtime Notes
There are a few big WASM-specific runtimes that have various levels of build time, runtime, and architecture support. This is excluding any that simply wrap v8/JSC, as we don't need arbitrary JS, WASI, and other complex bridging layers and instead just need access to the global memory and export table. Directly using system libraries (such as JavaScriptCore) may be the only option in some environments while in others on platforms with the ability to allocate executable pages have a lot more freedom.
There's a lot of runtimes: https://github.com/appcypher/awesome-wasm-runtimes
Many experimental or specialty (a blockchain wasm vm, etc). I've listed the most popular/relevant/still-active ones here and excluded any (such as WAVM) that require LLVM in their deployment.
v8
Much more full-featured than we need, but also one of the fastest/most ubiquitous runtimes. Not sure if there's a minimal build that only includes that required for wasm - the runtime+jit+etc can be several MB.
JavaScriptCore
The only real option (besides interpreted) on iOS. Supports WebAssembly on device and in simulator. Can't find a signal as to when SIMD will be supported (likely after first spec MVP published).
See open questions below; unclear how JIT is supported on appstore releases.
wasmtime
One of the bigger/more complete runtimes. Currently only targets x86-64 and aarch64 (on linux). They claim new backends are planned but it's unclear the timeline.
Wasmer
WAMR
Focused on breadth of architectures and small size, looking pretty similar to our needs: x86-32/64, armv7, aarch64, mips, with/without MMU. Recompiles the WASM to a custom target-specific AOT format that can be done either automatically or offline. Would work well with our pipeline cache model (translate and cache AOT binary, load that via mmmap for execution).
Embedding guide
Toolchain guide
Performance notes
Supports both JIT and interpreter (for unsupported archs)
Claims to be near-native performance
SIMD support not present - could be contributed
wasm3
A pure interpreter using a custom bitcode format and threaded interpreter (like IREE's VM). (Mostly) pure C with no executable page requirement so it'll run just about anywhere. Take the performance breakdown with a big grain of salt (from beginning of the year).
Open Questions
SIMD spec op availability
SIMD spec: https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md
We should confirm we have access to the core NN-friendly ops that are required. There's a proposal to add integer dotprod instructions but it looks like @bjacob commented on the spec here noting that the most useful dotprod op form is still missing: WebAssembly/simd#127 (comment)
iOS
It's extremely difficult to tell but it seems like JavaScriptCore on iOS when used by an application can JIT and load WebAssembly. Whether this requires special entitlements is unclear (oh Apple). Recent issues indicate that the global context has a WebAssembly object that works and that the iOS simulator supports it as well (https://trac.webkit.org/changeset/264801/webkit). Workarounds that involve using WebKit (WKWebView) are a no-go as they run JSC out of process, cannot share memory, and can only marshal strings across the API.
Multiple memory spaces
WASM was defined to support multiple memory spaces (linear regions of addressable memory) - think x86 segments (what's old is new again!). This is interesting to us as the actual fixed-size heap required for wasm can then be fixed to a maximum of our shared memory size (accessible from multiple concurrent invocations) and buffers can be passed in/out via other memories.
Unfortunately this isn't supported in MVP (or AFAICT any current runtime), though the multi-memory spec proposal is active and extends things to support what we'd need.
Without this we must ensure that all buffers are allocated from the single wasm memory region. This is not difficult to accomplish (via a custom iree::hal::Allocator) and since the same behavior is needed for GPUs it's possible we can share code (something like VMA, if not VMA itself). The scheduling code we emit in the compiler for allocations can help here as the same behavior we'll want for devices with discrete memory (out-of-process CPU/GPU/TPU) we'll want for WASM, so for example the ringbuffer used for most transients can be allocated directly from wasm memory.
wasm64
Though provisionally speced, 64-bit wasm is not really making traction just yet. The major limitation there is a 4GB address space (or possibly smaller depending on the runtime, which may use some bits for tracking). multi-memory would alleviate some of the pressure here as we could add multiple chunks, but at the point that we are streaming through GB of data in a single dispatch we've probably got other issues. Since SPIR-V also has 32-bit limitations I think this is fine.
The text was updated successfully, but these errors were encountered: