-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialized wasmtime modules are extremely large #2318
Comments
Oh and the script that runs it is here. It's extremely simple but maybe I forgot something important. |
This commit compresses `FunctionAddressMap` by performing a simple coalescing of adjacent `InstructionAddressMap` descriptors if they describe the same source location. This is intended to handle the common case where a sequene of machine instructions describes a high-level wasm instruction. For the module on bytecodealliance#2318 this reduces the cache entry size from 306MB to 161MB.
The cache entry size is effectively the bincode-coded size of this structure, which I believe is roughly analagous to the in-memory size of that structure. Adding some simple instrumentation I was able to get the encoded byte size of each of the fields:
clearly I've yet to investigate the Windows discrepancy here, that's quite worrisome too! |
Oh and after that PR, the breakdown looks like:
|
Update `Module::{serialize,deserialize}` to use variable-length integers with `bincode` to make the output artifacts smaller. Locally this reduces the size of bytecodealliance#2318 from 160 to 110 MB, a 30% decrease in size! Deserialization performance is slightly slower, but seemingly within the range of noise locally for me.
Update `Module::{serialize,deserialize}` to use variable-length integers with `bincode` to make the output artifacts smaller. Locally this reduces the size of #2318 from 160 to 110 MB, a 30% decrease in size! Deserialization performance is slightly slower, but seemingly within the range of noise locally for me.
This commit compresses `FunctionAddressMap` by performing a simple coalescing of adjacent `InstructionAddressMap` descriptors if they describe the same source location. This is intended to handle the common case where a sequene of machine instructions describes a high-level wasm instruction. For the module on #2318 this reduces the cache entry size from 306MB to 161MB.
This commit reduces the size of `InstructionAddressMap` from 24 bytes to 8 bytes by dropping the `code_len` field and reducing `code_offset` to `u32` instead of `usize`. The intention is to primarily make the in-memory version take up less space, and the hunch is that the `code_len` is largely not necessary since most entries in this map are always adjacent to one another. The `code_len` field is now implied by the `code_offset` field of the next entry in the map. This isn't as big of an improvement to serialized module size as bytecodealliance#2321 or bytecodealliance#2322, primarily because of the switch to variable-length encoding. Despite this though it shaves about 10MB off the encoded size of the module from bytecodealliance#2318
This commit reduces the size of `InstructionAddressMap` from 24 bytes to 8 bytes by dropping the `code_len` field and reducing `code_offset` to `u32` instead of `usize`. The intention is to primarily make the in-memory version take up less space, and the hunch is that the `code_len` is largely not necessary since most entries in this map are always adjacent to one another. The `code_len` field is now implied by the `code_offset` field of the next entry in the map. This isn't as big of an improvement to serialized module size as bytecodealliance#2321 or bytecodealliance#2322, primarily because of the switch to variable-length encoding. Despite this though it shaves about 10MB off the encoded size of the module from bytecodealliance#2318
Added in c4e1022 I think the original reason (which I'm not entirely knowledgeable of) may no longer be applicable? In any case this is a significant difference on Windows from other platforms because it makes loads/stores of wasm code have manual checks instead of relying on the guard page, causing runtime and compile-time slowdowns on Windows-only. I originally rediscovered this when investigating bytecodealliance#2318 and saw that both the compile time of the module in question and trap information tables were much larger than they were on Linux. Removing this Windows-specific configuration fixed the discrepancies and afterwards Linux and Windows were basically the same.
Ok I've got one more linked improvement as well as #2326 which should account for the discrepancy on Windows. |
Added in c4e1022 I think the original reason (which I'm not entirely knowledgeable of) may no longer be applicable? In any case this is a significant difference on Windows from other platforms because it makes loads/stores of wasm code have manual checks instead of relying on the guard page, causing runtime and compile-time slowdowns on Windows-only. I originally rediscovered this when investigating #2318 and saw that both the compile time of the module in question and trap information tables were much larger than they were on Linux. Removing this Windows-specific configuration fixed the discrepancies and afterwards Linux and Windows were basically the same.
With current master plus #2324 the cache entry size of the original wasm module is down to 97MB, and the cache entry sizes should be roughly the same on Linux and Windows. @whitequark for your use cases is that still too large? It's still pretty large having a >4x size increase, but further changes may require more invasive tweaks than the low hanging fruit picked off so far. |
Seems excellent to me. Any chance you can do a point release of wasmtime-py with the updated interpreter? I have to test on a fairly heterogenous set of machines so it'd be much easier if I didn't have to distribute patched wheels as well. |
Sure! I'll wait until #2324 is in and then I'll do a release of wasmtime |
I don't know how easy |
(and we ensured they were sorted) |
This commit reduces the size of `InstructionAddressMap` from 24 bytes to 8 bytes by dropping the `code_len` field and reducing `code_offset` to `u32` instead of `usize`. The intention is to primarily make the in-memory version take up less space, and the hunch is that the `code_len` is largely not necessary since most entries in this map are always adjacent to one another. The `code_len` field is now implied by the `code_offset` field of the next entry in the map. This isn't as big of an improvement to serialized module size as bytecodealliance#2321 or bytecodealliance#2322, primarily because of the switch to variable-length encoding. Despite this though it shaves about 10MB off the encoded size of the module from bytecodealliance#2318
This commit reduces the size of `InstructionAddressMap` from 24 bytes to 8 bytes by dropping the `code_len` field and reducing `code_offset` to `u32` instead of `usize`. The intention is to primarily make the in-memory version take up less space, and the hunch is that the `code_len` is largely not necessary since most entries in this map are always adjacent to one another. The `code_len` field is now implied by the `code_offset` field of the next entry in the map. This isn't as big of an improvement to serialized module size as #2321 or #2322, primarily because of the switch to variable-length encoding. Despite this though it shaves about 10MB off the encoded size of the module from #2318
@fitzgen turns out bincode has an option for just that! |
Ok wasmtime-py 0.21.0 is now published @whitequark if you'd be able to test it out |
I'll take a look over the next few days! |
Were there any breaking changes? It's a bit unfortunate that I'll have to rebuild all the packages as that makes it a bit harder to get other people to compare several different versions. |
I've rebuilt yowasp-yosys and yowasp-nextpnr-ice40 with wasmtime 0.21.0 on my Linux machine.
|
Right now our release process for wasmtime is just "major bump everything", and we're then trying to keep the external packages like wasmtime-py and wasmtime-go in sync with the main repo. We should probably allocate space to have minor version bumps of everything as well though... Glad to hear the results as well! |
Ah ok if it's in sync with the main repo then it might actually be pretty handy. I wasn't sure exactly how it worked. |
@alexcrichton I've tried running yowasp-yosys and yowasp-nextpnr-ice40 on the cheapest and slowest laptop I found nearby, and it's only something like 2× slower on both cold and warm startup than my XPS 13. I have no idea how you managed to achieve that, and I'm quite impressed. As far as I'm concerned this is enough performance. |
Nice! Thanks again for the report and helping to investigate! |
Update `Module::{serialize,deserialize}` to use variable-length integers with `bincode` to make the output artifacts smaller. Locally this reduces the size of #2318 from 160 to 110 MB, a 30% decrease in size! Deserialization performance is slightly slower, but seemingly within the range of noise locally for me.
This commit compresses `FunctionAddressMap` by performing a simple coalescing of adjacent `InstructionAddressMap` descriptors if they describe the same source location. This is intended to handle the common case where a sequene of machine instructions describes a high-level wasm instruction. For the module on #2318 this reduces the cache entry size from 306MB to 161MB.
Added in c4e1022 I think the original reason (which I'm not entirely knowledgeable of) may no longer be applicable? In any case this is a significant difference on Windows from other platforms because it makes loads/stores of wasm code have manual checks instead of relying on the guard page, causing runtime and compile-time slowdowns on Windows-only. I originally rediscovered this when investigating #2318 and saw that both the compile time of the module in question and trap information tables were much larger than they were on Linux. Removing this Windows-specific configuration fixed the discrepancies and afterwards Linux and Windows were basically the same.
This commit reduces the size of `InstructionAddressMap` from 24 bytes to 8 bytes by dropping the `code_len` field and reducing `code_offset` to `u32` instead of `usize`. The intention is to primarily make the in-memory version take up less space, and the hunch is that the `code_len` is largely not necessary since most entries in this map are always adjacent to one another. The `code_len` field is now implied by the `code_offset` field of the next entry in the map. This isn't as big of an improvement to serialized module size as #2321 or #2322, primarily because of the switch to variable-length encoding. Despite this though it shaves about 10MB off the encoded size of the module from #2318
I'm using custom module serialization based caching in YoWASP, primarily to be able to show a message during cold startup (which can take many minutes on slow machines), but also to avoid applying zstd compression that noticeably increases warm startup latency.
The problem is that the cached modules are extremely large—so large that it causes a near-pathological amount of disk I/O. (On Windows VMs, due to the I/O, warm startup actually becomes slower compared to wasmtime's own caching. On bare metal Windows on SSDs it still gets faster.)
On Linux, yosys.wasm (21M wasm file) results in a 290M serialized module, an increase of ~14×. On Windows, for some reason, that very same file results in a ~580M serialized module, an increase of ~28×. The native Yosys binary for x86_64 is 17M; even once you account for libc and libstdc++, it really should not take over half a gigabyte of space.
I'm filing this issue on request of @alexcrichton here.
The text was updated successfully, but these errors were encountered: