Serialized wasmtime modules are extremely large #2318

whitequark · 2020-10-25T21:28:04Z

I'm using custom module serialization based caching in YoWASP, primarily to be able to show a message during cold startup (which can take many minutes on slow machines), but also to avoid applying zstd compression that noticeably increases warm startup latency.

The problem is that the cached modules are extremely large—so large that it causes a near-pathological amount of disk I/O. (On Windows VMs, due to the I/O, warm startup actually becomes slower compared to wasmtime's own caching. On bare metal Windows on SSDs it still gets faster.)

On Linux, yosys.wasm (21M wasm file) results in a 290M serialized module, an increase of ~14×. On Windows, for some reason, that very same file results in a ~580M serialized module, an increase of ~28×. The native Yosys binary for x86_64 is 17M; even once you account for libc and libstdc++, it really should not take over half a gigabyte of space.

I'm filing this issue on request of @alexcrichton here.

whitequark · 2020-10-25T21:28:49Z

Oh and the script that runs it is here. It's extremely simple but maybe I forgot something important.

This commit compresses `FunctionAddressMap` by performing a simple coalescing of adjacent `InstructionAddressMap` descriptors if they describe the same source location. This is intended to handle the common case where a sequene of machine instructions describes a high-level wasm instruction. For the module on bytecodealliance#2318 this reduces the cache entry size from 306MB to 161MB.

alexcrichton · 2020-10-26T00:05:51Z

The cache entry size is effectively the bincode-coded size of this structure, which I believe is roughly analagous to the in-memory size of that structure. Adding some simple instrumentation I was able to get the encoded byte size of each of the fields:

module:                  152817        
obj:                   56236896        
funcs:                242146780        
  funcs(traps):                24182940
  funcs(address_map):         217774640
  funcs(stack_maps):             189216
data_initializers:      2880584        
unwind_info:            4611040        
total:                306028126

clearly funcs is the massive field (too much so), and the worst offenders are the traps and address_maps maps. Neither of these maps really justifies taking up so much space, and longer-term we need to have a more serious refactoring to probably encode these entirely different in a much more compact fashion. The consumer, however, relies on random-access right now (or at least the ability to do a binary search), so it won't be immediately trivial to restructure these data structures. In the meantime #2321 should be a band-aid for now to make it a bit more reasonable (although still pretty unreasonable).

I've yet to investigate the Windows discrepancy here, that's quite worrisome too!

alexcrichton · 2020-10-26T00:06:19Z

Oh and after that PR, the breakdown looks like:

module:                  152817        
obj:                   56236896        
funcs:                 97596780        
  funcs(traps):                24182940
  funcs(address_map):          73224640
  funcs(stack_maps):             189216
data_initializers:      2880584        
unwind_info:            4611040        
total:                161478126

Update `Module::{serialize,deserialize}` to use variable-length integers with `bincode` to make the output artifacts smaller. Locally this reduces the size of bytecodealliance#2318 from 160 to 110 MB, a 30% decrease in size! Deserialization performance is slightly slower, but seemingly within the range of noise locally for me.

Update `Module::{serialize,deserialize}` to use variable-length integers with `bincode` to make the output artifacts smaller. Locally this reduces the size of #2318 from 160 to 110 MB, a 30% decrease in size! Deserialization performance is slightly slower, but seemingly within the range of noise locally for me.

This commit compresses `FunctionAddressMap` by performing a simple coalescing of adjacent `InstructionAddressMap` descriptors if they describe the same source location. This is intended to handle the common case where a sequene of machine instructions describes a high-level wasm instruction. For the module on #2318 this reduces the cache entry size from 306MB to 161MB.

This commit reduces the size of `InstructionAddressMap` from 24 bytes to 8 bytes by dropping the `code_len` field and reducing `code_offset` to `u32` instead of `usize`. The intention is to primarily make the in-memory version take up less space, and the hunch is that the `code_len` is largely not necessary since most entries in this map are always adjacent to one another. The `code_len` field is now implied by the `code_offset` field of the next entry in the map. This isn't as big of an improvement to serialized module size as bytecodealliance#2321 or bytecodealliance#2322, primarily because of the switch to variable-length encoding. Despite this though it shaves about 10MB off the encoded size of the module from bytecodealliance#2318

Added in c4e1022 I think the original reason (which I'm not entirely knowledgeable of) may no longer be applicable? In any case this is a significant difference on Windows from other platforms because it makes loads/stores of wasm code have manual checks instead of relying on the guard page, causing runtime and compile-time slowdowns on Windows-only. I originally rediscovered this when investigating bytecodealliance#2318 and saw that both the compile time of the module in question and trap information tables were much larger than they were on Linux. Removing this Windows-specific configuration fixed the discrepancies and afterwards Linux and Windows were basically the same.

alexcrichton · 2020-10-27T03:43:50Z

Ok I've got one more linked improvement as well as #2326 which should account for the discrepancy on Windows.

Added in c4e1022 I think the original reason (which I'm not entirely knowledgeable of) may no longer be applicable? In any case this is a significant difference on Windows from other platforms because it makes loads/stores of wasm code have manual checks instead of relying on the guard page, causing runtime and compile-time slowdowns on Windows-only. I originally rediscovered this when investigating #2318 and saw that both the compile time of the module in question and trap information tables were much larger than they were on Linux. Removing this Windows-specific configuration fixed the discrepancies and afterwards Linux and Windows were basically the same.

alexcrichton · 2020-10-29T19:51:00Z

With current master plus #2324 the cache entry size of the original wasm module is down to 97MB, and the cache entry sizes should be roughly the same on Linux and Windows. @whitequark for your use cases is that still too large? It's still pretty large having a >4x size increase, but further changes may require more invasive tweaks than the low hanging fruit picked off so far.

whitequark · 2020-10-30T07:30:35Z

for your use cases is that still too large?

Seems excellent to me. Any chance you can do a point release of wasmtime-py with the updated interpreter? I have to test on a fairly heterogenous set of machines so it'd be much easier if I didn't have to distribute patched wheels as well.

alexcrichton · 2020-10-30T19:12:01Z

Sure! I'll wait until #2324 is in and then I'll do a release of wasmtime

fitzgen · 2020-10-30T22:15:10Z

I don't know how easy bincode is to extend, but all the address-keyed func data could probably compress really well if we used deltas instead of absolute values combined with a variable length integer encoding (e.g. LEB128).

fitzgen · 2020-10-30T22:15:37Z

(and we ensured they were sorted)

This commit reduces the size of `InstructionAddressMap` from 24 bytes to 8 bytes by dropping the `code_len` field and reducing `code_offset` to `u32` instead of `usize`. The intention is to primarily make the in-memory version take up less space, and the hunch is that the `code_len` is largely not necessary since most entries in this map are always adjacent to one another. The `code_len` field is now implied by the `code_offset` field of the next entry in the map. This isn't as big of an improvement to serialized module size as bytecodealliance#2321 or bytecodealliance#2322, primarily because of the switch to variable-length encoding. Despite this though it shaves about 10MB off the encoded size of the module from bytecodealliance#2318

This commit reduces the size of `InstructionAddressMap` from 24 bytes to 8 bytes by dropping the `code_len` field and reducing `code_offset` to `u32` instead of `usize`. The intention is to primarily make the in-memory version take up less space, and the hunch is that the `code_len` is largely not necessary since most entries in this map are always adjacent to one another. The `code_len` field is now implied by the `code_offset` field of the next entry in the map. This isn't as big of an improvement to serialized module size as #2321 or #2322, primarily because of the switch to variable-length encoding. Despite this though it shaves about 10MB off the encoded size of the module from #2318

alexcrichton · 2020-11-04T16:03:47Z

@fitzgen turns out bincode has an option for just that!

alexcrichton · 2020-11-05T16:00:23Z

Ok wasmtime-py 0.21.0 is now published @whitequark if you'd be able to test it out

whitequark · 2020-11-05T16:01:37Z

I'll take a look over the next few days!

whitequark · 2020-11-05T16:04:06Z

Were there any breaking changes? It's a bit unfortunate that I'll have to rebuild all the packages as that makes it a bit harder to get other people to compare several different versions.

whitequark · 2020-11-05T16:19:21Z

I've rebuilt yowasp-yosys and yowasp-nextpnr-ice40 with wasmtime 0.21.0 on my Linux machine.

Yosys: 291M→94M cache file size improvement, ~2x warm startup latency improvement (wow!!);
nextpnr-ice40: 23M→6.9M cache file size improvement, warm startup latency changes within noise threshold.

alexcrichton · 2020-11-05T16:29:10Z

Right now our release process for wasmtime is just "major bump everything", and we're then trying to keep the external packages like wasmtime-py and wasmtime-go in sync with the main repo. We should probably allocate space to have minor version bumps of everything as well though...

Glad to hear the results as well!

whitequark · 2020-11-05T16:31:18Z

Ah ok if it's in sync with the main repo then it might actually be pretty handy. I wasn't sure exactly how it worked.

whitequark · 2020-11-05T22:53:37Z

@alexcrichton I've tried running yowasp-yosys and yowasp-nextpnr-ice40 on the cheapest and slowest laptop I found nearby, and it's only something like 2× slower on both cold and warm startup than my XPS 13. I have no idea how you managed to achieve that, and I'm quite impressed. As far as I'm concerned this is enough performance.

alexcrichton · 2020-11-05T23:10:44Z

Nice! Thanks again for the report and helping to investigate!

Update `Module::{serialize,deserialize}` to use variable-length integers with `bincode` to make the output artifacts smaller. Locally this reduces the size of #2318 from 160 to 110 MB, a 30% decrease in size! Deserialization performance is slightly slower, but seemingly within the range of noise locally for me.

This commit compresses `FunctionAddressMap` by performing a simple coalescing of adjacent `InstructionAddressMap` descriptors if they describe the same source location. This is intended to handle the common case where a sequene of machine instructions describes a high-level wasm instruction. For the module on #2318 this reduces the cache entry size from 306MB to 161MB.

Added in c4e1022 I think the original reason (which I'm not entirely knowledgeable of) may no longer be applicable? In any case this is a significant difference on Windows from other platforms because it makes loads/stores of wasm code have manual checks instead of relying on the guard page, causing runtime and compile-time slowdowns on Windows-only. I originally rediscovered this when investigating #2318 and saw that both the compile time of the module in question and trap information tables were much larger than they were on Linux. Removing this Windows-specific configuration fixed the discrepancies and afterwards Linux and Windows were basically the same.

This commit reduces the size of `InstructionAddressMap` from 24 bytes to 8 bytes by dropping the `code_len` field and reducing `code_offset` to `u32` instead of `usize`. The intention is to primarily make the in-memory version take up less space, and the hunch is that the `code_len` is largely not necessary since most entries in this map are always adjacent to one another. The `code_len` field is now implied by the `code_offset` field of the next entry in the map. This isn't as big of an improvement to serialized module size as #2321 or #2322, primarily because of the switch to variable-length encoding. Despite this though it shaves about 10MB off the encoded size of the module from #2318

whitequark mentioned this issue Oct 25, 2020

Add bindings for Module.{serialize,deserialize} bytecodealliance/wasmtime-py#43

Merged

alexcrichton mentioned this issue Oct 26, 2020

Compress in-memory representation of FunctionAddressMap #2321

Merged

alexcrichton mentioned this issue Oct 26, 2020

Encode modules with variable-length integers #2322

Merged

alexcrichton mentioned this issue Oct 26, 2020

Further compress the in-memory representation of address maps #2324

Merged

alexcrichton mentioned this issue Oct 27, 2020

Remove Windows-specific code for static memory bounds #2326

Merged

whitequark closed this as completed Nov 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialized wasmtime modules are extremely large #2318

Serialized wasmtime modules are extremely large #2318

whitequark commented Oct 25, 2020 •

edited

Loading

whitequark commented Oct 25, 2020

alexcrichton commented Oct 26, 2020

alexcrichton commented Oct 26, 2020

alexcrichton commented Oct 27, 2020

alexcrichton commented Oct 29, 2020

whitequark commented Oct 30, 2020

alexcrichton commented Oct 30, 2020

fitzgen commented Oct 30, 2020

fitzgen commented Oct 30, 2020

alexcrichton commented Nov 4, 2020

alexcrichton commented Nov 5, 2020

whitequark commented Nov 5, 2020

whitequark commented Nov 5, 2020

whitequark commented Nov 5, 2020

alexcrichton commented Nov 5, 2020

whitequark commented Nov 5, 2020

whitequark commented Nov 5, 2020

alexcrichton commented Nov 5, 2020

Serialized wasmtime modules are extremely large #2318

Serialized wasmtime modules are extremely large #2318

Comments

whitequark commented Oct 25, 2020 • edited Loading

whitequark commented Oct 25, 2020

alexcrichton commented Oct 26, 2020

alexcrichton commented Oct 26, 2020

alexcrichton commented Oct 27, 2020

alexcrichton commented Oct 29, 2020

whitequark commented Oct 30, 2020

alexcrichton commented Oct 30, 2020

fitzgen commented Oct 30, 2020

fitzgen commented Oct 30, 2020

alexcrichton commented Nov 4, 2020

alexcrichton commented Nov 5, 2020

whitequark commented Nov 5, 2020

whitequark commented Nov 5, 2020

whitequark commented Nov 5, 2020

alexcrichton commented Nov 5, 2020

whitequark commented Nov 5, 2020

whitequark commented Nov 5, 2020

alexcrichton commented Nov 5, 2020

whitequark commented Oct 25, 2020 •

edited

Loading