Use mmap'd `*.cwasm` as a source for memory initialization images #3787

alexcrichton · 2022-02-10T04:58:54Z

This commit addresses #3758 and makes it possible to avoid memfd_create when loading a module from a precompiled binary stored on disk. In this situation we already mmap the file from disk, and we can use the same technique that memfd uses where a copy-on-write mapping is made whenever a module is instantiated. This measn that all Unix platforms, not only Linux, can benefit from copy-on-write so long as the module comes from a precompiled module on disk.

The first commit here is refactoring to enable this functionality on Linux. After the first commit we avoid creation of a memfd and instead map the raw underlying *.cwasm into memories. This involved moving the creation of the memory image to compile-time of Module rather than construction-time of Module, as well as aligning the data section to ensure it shows up at a page-offset boundary in the file (which is required by mmap). The second commit then enables this support to work on macOS which involved some #[cfg] followed by tweaking the madvise logic to instead blow away the mapping (no reuse on systems without madvise(DONTNEED) as there's no portable way to reset the CoW overlay)

I tried for a bit to get this working on Windows, but while I could get things to compile I don't believe the same technique we're using here for Unix works on Windows. Windows appears to reject mapping a file onto a pre-existing region allocated with VirtualAlloc, meaning that all attemps to map the file into memory have failed so far for me. This StackOverflow question seems to suggest that this may simply not work on Windows unless we use undocumented APIs. In any case the major benefit of this PR is avoiding extra file descriptors on Unix for modules created from files on disk, so while having Windows support would be nice it's not necessarily required.

github-actions · 2022-02-10T05:20:55Z

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

cfallin

A few initial comments; I'll read this over and digest more tomorrow. But: I'm really excited to see this!

I'm obviously curious about benchmarks but I'm sure those are coming :-) Definitely would want to see effect on both instantiation and runtime perf (we want to make sure the mmap-of-disk vs mmap-of-memfd doesn't have any ill effects on CoW performance once the file is hot in pagecache).

One possibly subtle thing to note in Module API documentation or elsewhere: bad things will happen if the file on disk changes while the Module exists. This is already very much a bad idea prior to this PR, because we use the mmap for other things (JIT code! wasm_data!) too, but mapping the live Wasm heap data from the file brings the risk front-of-mind, and makes me think we should warn about it. It's not an unreasonable expectation that a "load a thing from a file" API would read the file once and then be done with it, otherwise.

A very conservative answer to the above risk would be a new Module constructor that is something like Module::map_file_direct(File), and then explicitly avoid the direct-mmap for the existing API that just takes a filename -- though I'm ambivalent about that as it would mean perf benefits are hidden in a non-default place that the average user might not find either.

crates/environ/src/module.rs

crates/runtime/src/memfd.rs

crates/runtime/src/mmap_vec.rs

tschneidereit · 2022-02-10T10:12:54Z

A very conservative answer to the above risk would be a new Module constructor that is something like Module::map_file_direct(File), and then explicitly avoid the direct-mmap for the existing API that just takes a filename -- though I'm ambivalent about that as it would mean perf benefits are hidden in a non-default place that the average user might not find either.

Given that as you say this already is a bad idea, I think the downside you point out here weighs heavy. Is there any kind of actual safety downside to this relative to the current situation? If not, then we should either make the current behavior safer by default, or continue making use of the benefits we get from this approach. (I also think that it's fine to assume continued existence of files like this, with the argument being "don't shared libraries work this way, too" or something similarly handwavy in that direction.)

alexcrichton · 2022-02-10T16:00:39Z

For performance numbers I actually naively assumed it wouldn't really matter here. I don't think our microbenchmarks will really show much today though since it's all just hitting the kernel's page cache. I don't know enough about mmap performance from files to know whether this is what we'd continue to see out in the wild though.

What I got though for "instantiate large module" is below. In this benchmark it was purely instantiation so only setting up vma mappings and not actually doing anything with them:

parallel/pooling/rustpython.wasm: with 1 thread
                        time:   [5.1166 us 5.1196 us 5.1226 us]

parallel/pooling/rustpython.wasm: with 1 thread (loaded from *.cwasm)
                        time:   [5.0117 us 5.0143 us 5.0170 us]

I then modified the benchmark to write 0, from Rust, to every single page of memory after instantiation succeeded and I got:

parallel/pooling/rustpython.wasm: with 1 thread
                        time:   [4.0414 ms 4.0418 ms 4.0422 ms]

parallel/pooling/rustpython.wasm: with 1 thread (loaded from *.cwasm)
                        time:   [4.0682 ms 4.0686 ms 4.0690 ms]

Running a few times this seemed consistent, although the "load all the pages" had some variance in time to the point that I think they were roughly equivalent.

The full diff for the benchmark was https://gist.github.com/alexcrichton/7e555a405f2815ec69fcac659b5d85de and the first numbers were collected without the changes to the instantiate function which write to memory.

alexcrichton · 2022-02-10T16:02:03Z

In terms of safety, Module::deserialize_file is already unsafe because loading arbitrary data is not memory safe, so adding to the list of requirements "this needs to stay as-expected for the entire duration of the lifetime of the Module doesn't seem so bad to add. I don't think it makes the safety any worse relative to what we have today?

alexcrichton · 2022-02-10T16:43:07Z

Hm something I just thought of, I know that on Linux you can't write to an executable that's currently in use in some other process, which is exactly what we want here. I found that MAP_DENYWRITE seems like it would do this as an option to mmap but the man page says that it's an ignored flag nowadays due to denial-of-service attacks (I guess it lets you just exclusively lock files for yourself against everyone else's will). Do others know how we could get this behavior, though? That might help the prevent-future-writes problem a bit.

In working on bytecodealliance#3787 I see now that our coverage of loading precompiled files specifically is somewhat lacking, so this adds a config option to the fuzzers where, if enabled, will round-trip all compiled modules through the filesystem to test out the mmapped-file case.

cfallin · 2022-02-10T17:31:16Z

In terms of safety, Module::deserialize_file is already unsafe because loading arbitrary data is not memory safe, so adding to the list of requirements "this needs to stay as-expected for the entire duration of the lifetime of the Module doesn't seem so bad to add. I don't think it makes the safety any worse relative to what we have today?

Yup, that sounds like the right path to me; the risk was already there, it's just good to warn about it now :-)

Re: benchmarking -- a thought occurred to me: this is mostly improving module-load performance, so it would be interesting to benchmark module loading! Perhaps a measured inner loop of (module load, instantiate, terminate). In theory, with a module that has a very large heap (ahem SpiderMonkey), we should see load times that are O(heap size) without this PR, and O(1)-ish with this PR.

The above isn't really necessary to get this in I think -- with no negative impacts to instantiation or pagefault cost during runtime, and with the RSS benefit, I'm already convinced it's a good thing; but it would be a good way to demonstrate the benefit if you want to do that.

cfallin · 2022-02-10T17:43:52Z

Hm something I just thought of, I know that on Linux you can't write to an executable that's currently in use in some other process, which is exactly what we want here. I found that MAP_DENYWRITE seems like it would do this as an option to mmap but the man page says that it's an ignored flag nowadays due to denial-of-service attacks (I guess it lets you just exclusively lock files for yourself against everyone else's will). Do others know how we could get this behavior, though? That might help the prevent-future-writes problem a bit.

This sent me down a very interesting rabbithole -- apparently, until recently, MAP_DENYWRITE did work. Indeed if I strace /bin/ls on my system I see the flag included in the mmaps of system libraries. A quick test shows that while one binary executes on my system, I get ETXTBSY when trying to open the file O_RDWR from another process. (Superficially one can "write" it by copying over it or editing with a text editor, but this really just replaces the directory entry, and doesn't open the existing file for writing.)

But it seems that MAP_DENYWRITE is going away -- was removed from the kernel last August -- and so this same issue, if it is one, will exist for system binaries/libraries too.

So just the warning and the fact that the API is unsafe seems enough to me at least!

In working on #3787 I see now that our coverage of loading precompiled files specifically is somewhat lacking, so this adds a config option to the fuzzers where, if enabled, will round-trip all compiled modules through the filesystem to test out the mmapped-file case.

cfallin

Did a finer-grained pass just now; a few little nits but overall this looks great! Thanks again for implementing the idea -- I think it will be a significant benefit in various use-cases.

cfallin · 2022-02-10T18:08:36Z

crates/environ/src/module.rs

@@ -249,9 +296,12 @@ impl ModuleTranslation<'_> {
        let mut offset = 0;
        for (memory, pages) in page_contents {
            let mut page_offsets = Vec::with_capacity(pages.len());
-            for (byte_offset, page) in pages {
+            for (page_index, page) in pages {


Interesting, was this a bug in the original code? I see the insertion above contents.entry(page_index) so this is correct now, but seemed to be using a page index as a byte offset previously. Or was the first tuple element used as a page index elsewhere?

I looked at a few other places but I think this was just a typo because everywhere else named and used this as a page index. I was a bit worried myself though and had to do a few double-takes as I updated this!

cfallin · 2022-02-10T18:13:25Z

crates/environ/src/module.rs

+        // The `init_memory` method of `MemoryInitialization` is used here to
+        // do most of the validation for us, and otherwise the data chunks are
+        // collected into the `images` array here.
+        let mut images: PrimaryMap<MemoryIndex, Vec<u8>> = PrimaryMap::default();


images seems to be created up to num_defined_memories below, but it's a PrimaryMap<MemoryIndex, _>; could we either use DefinedMemoryIndex or fill it up to the total memory count?

This seems a bit different than #3782 as we are type-safe wrt the index, but would just lead to an index-out-of-bounds if there is an imported memory I think...

I ended up using MemoryIndex here instead of DefinedMemoryIndex because the interface to init_memory works on MemoryIndex (as an initializer can be for any memory) and otherwise translating between the two would require extra callbacks in the InitMemory::Runtime case.

Otherwise though none of the specialized initialization techniques work with imported memories anyway so I don't think anything is lost with using MemoryIndex since in the cases the optimization is applicable the two are equal. I'll double-check these areas though and make sure they're all prepared to use MemoryIndex.

cfallin · 2022-02-10T18:14:45Z

crates/environ/src/module.rs

+        self.data_align = Some(page_size);
+        let mut map = PrimaryMap::with_capacity(images.len());
+        let mut offset = 0u32;
+        for (defined_memory, mut image) in images {


here also the index variable defined_memory seems to imply that images only contains DefinedMemoryIndexs; should be consistent with what we do above. map contains an entry for every memory, defined or imported, so this loop is correct if images is over all memories as well; just a naming issue I think.

Ah yeah this is a historical name, good catch though and definitely needed a rename.

crates/jit/src/instantiate.rs

crates/runtime/build.rs

crates/runtime/src/memfd.rs

This commit moves function names in a module out of the `wasmtime_environ::Module` type and into separate sections stored in the final compiled artifact. Spurred on by bytecodealliance#3787 to look at module load times I noticed that a huge amount of time was spent in deserializing this map. The `spidermonkey.wasm` file, for example, has a 3MB name section which is a lot of unnecessary data to deserialize at module load time. The names of functions are now split out into their own dedicated section of the compiled artifact and metadata about them is stored in a more compact format at runtime by avoiding a `BTreeMap` and instead using a sorted array. Overall this improves deserialize times by up to 80% for modules with large name sections since the name section is no longer deserialized at load time and it's lazily paged in as names are actually referenced.

This commit has a few minor updates and some improvements to the instantiation benchmark harness: * A `once_cell::unsync::Lazy` type is now used to guard creation of modules/engines/etc. This enables running singular benchmarks to be much faster since the benchmark no longer compiles all other benchmarks that are filtered out. Unfortunately I couldn't find a way in criterion to test whether a `BenchmarkId` is filtered out or not so we rely on the runtime laziness to initialize on the first run for benchmarks that do so. * All files located in `benches/instantiation` are now loaded for benchmarking instead of a hardcoded list. This makes it a bit easier to throw files into the directory and have them benchmarked instead of having to recompile when working with new files. * Finally a module deserialization benchmark was added to measure the time it takes to deserialize a precompiled module from disk (inspired by discussion on bytecodealliance#3787) While I was at it I also upped some limits to be able to instantiate cfallin's `spidermonkey.wasm`.

alexcrichton · 2022-02-10T19:42:08Z

a thought occurred to me: this is mostly improving module-load performance, so it would be interesting to benchmark module loading

This is a great idea and something I forgot! I was inspired to write this up in a benchmark from this -- #3790. My first attempt at benchmarking this showed no improvement from this PR but I was alarmed at the relatively slow deserialize time, which spawned #3789. Upon further thought though I remembered that memfd creation is lazy and consequently not affected by what I was benchmarking (purely deserialization, not deserialization plus instantiation).

I then updated the existing code to more eagerly construct the memfd, and that regressed relative to this PR by ~10x, with most of the time spent in write and memcpy (movment from data segments to the image). So it looks like this can definitely help improve first-instantiate time and we can probably make memfd creation un-lazy after this.

Anyway I will get to the rest of the review in a moment now...

* Move function names out of `Module` This commit moves function names in a module out of the `wasmtime_environ::Module` type and into separate sections stored in the final compiled artifact. Spurred on by #3787 to look at module load times I noticed that a huge amount of time was spent in deserializing this map. The `spidermonkey.wasm` file, for example, has a 3MB name section which is a lot of unnecessary data to deserialize at module load time. The names of functions are now split out into their own dedicated section of the compiled artifact and metadata about them is stored in a more compact format at runtime by avoiding a `BTreeMap` and instead using a sorted array. Overall this improves deserialize times by up to 80% for modules with large name sections since the name section is no longer deserialized at load time and it's lazily paged in as names are actually referenced. * Fix a typo * Fix compiled module determinism Need to not only sort afterwards but also first to ensure the data of the name section is consistent.

This commit updates the memfd support internally to not actually use a memfd if a compiled module originally came from disk via the `wasmtime::Module::deserialize_file` API. In this situation we already have a file descriptor open and there's no need to copy a module's heap image to a new file descriptor. To facilitate a new source of `mmap` the currently-memfd-specific-logic of creating a heap image is generalized to a new form of `MemoryInitialization` which is attempted for all modules at module-compile-time. This means that the serialized artifact to disk will have the memory image in its entirety waiting for us. Furthermore the memory image is ensured to be padded and aligned carefully to the target system's page size, notably meaning that the data section in the final object file is page-aligned and the size of the data section is also page aligned. This means that when a precompiled module is mapped from disk we can reuse the underlying `File` to mmap all initial memory images. This means that the offset-within-the-memory-mapped-file can differ for memfd-vs-not, but that's just another piece of state to track in the memfd implementation. In the limit this waters down the term "memfd" for this technique of quickly initializing memory because we no longer use memfd unconditionally (only when the backing file isn't available). This does however open up an avenue in the future to porting this support to other OSes because while `memfd_create` is Linux-specific both macOS and Windows support mapping a file with copy-on-write. This porting isn't done in this PR and is left for a future refactoring. Closes bytecodealliance#3758

Cordon off the Linux-specific bits and enable the memfd support to compile and run on platforms like macOS which have a Linux-like `mmap`. This only works if a module is mapped from a precompiled module file on disk, but that's better than not supporting it at all!

No need to have conditional alignment since their sizes are all aligned anyway

These functions all work with memory indexes, not specifically defined memory indexes.

cfallin · 2022-02-10T21:12:55Z

I then updated the existing code to more eagerly construct the memfd, and that regressed relative to this PR by ~10x, with most of the time spent in write and memcpy (movment from data segments to the image). So it looks like this can definitely help improve first-instantiate time and we can probably make memfd creation un-lazy after this.

This is great! And yeah, I agree, given that the memfd state is now cheap to create on load (a wrapper around the Arc<File> with some metadata basically) there's no reason for it to be lazy anymore.

This commit has a few minor updates and some improvements to the instantiation benchmark harness: * A `once_cell::unsync::Lazy` type is now used to guard creation of modules/engines/etc. This enables running singular benchmarks to be much faster since the benchmark no longer compiles all other benchmarks that are filtered out. Unfortunately I couldn't find a way in criterion to test whether a `BenchmarkId` is filtered out or not so we rely on the runtime laziness to initialize on the first run for benchmarks that do so. * All files located in `benches/instantiation` are now loaded for benchmarking instead of a hardcoded list. This makes it a bit easier to throw files into the directory and have them benchmarked instead of having to recompile when working with new files. * Finally a module deserialization benchmark was added to measure the time it takes to deserialize a precompiled module from disk (inspired by discussion on #3787) While I was at it I also upped some limits to be able to instantiate cfallin's `spidermonkey.wasm`.

In working on bytecodealliance#3787 I see now that our coverage of loading precompiled files specifically is somewhat lacking, so this adds a config option to the fuzzers where, if enabled, will round-trip all compiled modules through the filesystem to test out the mmapped-file case.

* Move function names out of `Module` This commit moves function names in a module out of the `wasmtime_environ::Module` type and into separate sections stored in the final compiled artifact. Spurred on by bytecodealliance#3787 to look at module load times I noticed that a huge amount of time was spent in deserializing this map. The `spidermonkey.wasm` file, for example, has a 3MB name section which is a lot of unnecessary data to deserialize at module load time. The names of functions are now split out into their own dedicated section of the compiled artifact and metadata about them is stored in a more compact format at runtime by avoiding a `BTreeMap` and instead using a sorted array. Overall this improves deserialize times by up to 80% for modules with large name sections since the name section is no longer deserialized at load time and it's lazily paged in as names are actually referenced. * Fix a typo * Fix compiled module determinism Need to not only sort afterwards but also first to ensure the data of the name section is consistent.

This commit has a few minor updates and some improvements to the instantiation benchmark harness: * A `once_cell::unsync::Lazy` type is now used to guard creation of modules/engines/etc. This enables running singular benchmarks to be much faster since the benchmark no longer compiles all other benchmarks that are filtered out. Unfortunately I couldn't find a way in criterion to test whether a `BenchmarkId` is filtered out or not so we rely on the runtime laziness to initialize on the first run for benchmarks that do so. * All files located in `benches/instantiation` are now loaded for benchmarking instead of a hardcoded list. This makes it a bit easier to throw files into the directory and have them benchmarked instead of having to recompile when working with new files. * Finally a module deserialization benchmark was added to measure the time it takes to deserialize a precompiled module from disk (inspired by discussion on bytecodealliance#3787) While I was at it I also upped some limits to be able to instantiate cfallin's `spidermonkey.wasm`.

…tecodealliance#3787) * Skip memfd creation with precompiled modules This commit updates the memfd support internally to not actually use a memfd if a compiled module originally came from disk via the `wasmtime::Module::deserialize_file` API. In this situation we already have a file descriptor open and there's no need to copy a module's heap image to a new file descriptor. To facilitate a new source of `mmap` the currently-memfd-specific-logic of creating a heap image is generalized to a new form of `MemoryInitialization` which is attempted for all modules at module-compile-time. This means that the serialized artifact to disk will have the memory image in its entirety waiting for us. Furthermore the memory image is ensured to be padded and aligned carefully to the target system's page size, notably meaning that the data section in the final object file is page-aligned and the size of the data section is also page aligned. This means that when a precompiled module is mapped from disk we can reuse the underlying `File` to mmap all initial memory images. This means that the offset-within-the-memory-mapped-file can differ for memfd-vs-not, but that's just another piece of state to track in the memfd implementation. In the limit this waters down the term "memfd" for this technique of quickly initializing memory because we no longer use memfd unconditionally (only when the backing file isn't available). This does however open up an avenue in the future to porting this support to other OSes because while `memfd_create` is Linux-specific both macOS and Windows support mapping a file with copy-on-write. This porting isn't done in this PR and is left for a future refactoring. Closes bytecodealliance#3758 * Enable "memfd" support on all unix systems Cordon off the Linux-specific bits and enable the memfd support to compile and run on platforms like macOS which have a Linux-like `mmap`. This only works if a module is mapped from a precompiled module file on disk, but that's better than not supporting it at all! * Fix linux compile * Use `Arc<File>` instead of `MmapVecFileBacking` * Use a named struct instead of mysterious tuples * Comment about unsafety in `Module::deserialize_file` * Fix tests * Fix uffd compile * Always align data segments No need to have conditional alignment since their sizes are all aligned anyway * Update comment in build.rs * Use rustix, not `region` * Fix some confusing logic/names around memory indexes These functions all work with memory indexes, not specifically defined memory indexes.

github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Feb 10, 2022

cfallin reviewed Feb 10, 2022

View reviewed changes

crates/environ/src/module.rs Outdated Show resolved Hide resolved

crates/runtime/src/memfd.rs Outdated Show resolved Hide resolved

crates/runtime/src/mmap_vec.rs Outdated Show resolved Hide resolved

alexcrichton mentioned this pull request Feb 10, 2022

Fuzz using precompiled modules on CI #3788

Merged

cfallin approved these changes Feb 10, 2022

View reviewed changes

alexcrichton mentioned this pull request Feb 10, 2022

Move function names out of Module #3789

Merged

alexcrichton mentioned this pull request Feb 10, 2022

Minor instantiation benchmark updates #3790

Merged

alexcrichton added 11 commits February 10, 2022 12:39

Fix linux compile

eb9022a

Use Arc<File> instead of MmapVecFileBacking

3b6cc54

Use a named struct instead of mysterious tuples

964a12a

Comment about unsafety in Module::deserialize_file

55530f7

Fix tests

fc79339

Fix uffd compile

828f513

Always align data segments

c6196c0

No need to have conditional alignment since their sizes are all aligned anyway

Update comment in build.rs

fbd2e71

Use rustix, not region

fe2cde5

Fix some confusing logic/names around memory indexes

6b5bd2a

These functions all work with memory indexes, not specifically defined memory indexes.

alexcrichton force-pushed the reuse-cwasm-mmap branch from 55a34e8 to 6b5bd2a Compare February 10, 2022 20:39

alexcrichton merged commit c0c368d into bytecodealliance:main Feb 10, 2022

alexcrichton deleted the reuse-cwasm-mmap branch February 10, 2022 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use mmap'd `*.cwasm` as a source for memory initialization images #3787

Use mmap'd `*.cwasm` as a source for memory initialization images #3787

alexcrichton commented Feb 10, 2022

github-actions bot commented Feb 10, 2022

cfallin left a comment

tschneidereit commented Feb 10, 2022

alexcrichton commented Feb 10, 2022

alexcrichton commented Feb 10, 2022

alexcrichton commented Feb 10, 2022

cfallin commented Feb 10, 2022

cfallin commented Feb 10, 2022

cfallin left a comment

cfallin Feb 10, 2022

alexcrichton Feb 10, 2022

cfallin Feb 10, 2022

alexcrichton Feb 10, 2022

cfallin Feb 10, 2022

alexcrichton Feb 10, 2022

alexcrichton commented Feb 10, 2022

cfallin commented Feb 10, 2022

Use mmap'd *.cwasm as a source for memory initialization images #3787

Use mmap'd *.cwasm as a source for memory initialization images #3787

Conversation

alexcrichton commented Feb 10, 2022

github-actions bot commented Feb 10, 2022

Subscribe to Label Action

cfallin left a comment

Choose a reason for hiding this comment

tschneidereit commented Feb 10, 2022

alexcrichton commented Feb 10, 2022

alexcrichton commented Feb 10, 2022

alexcrichton commented Feb 10, 2022

cfallin commented Feb 10, 2022

cfallin commented Feb 10, 2022

cfallin left a comment

Choose a reason for hiding this comment

cfallin Feb 10, 2022

Choose a reason for hiding this comment

alexcrichton Feb 10, 2022

Choose a reason for hiding this comment

cfallin Feb 10, 2022

Choose a reason for hiding this comment

alexcrichton Feb 10, 2022

Choose a reason for hiding this comment

cfallin Feb 10, 2022

Choose a reason for hiding this comment

alexcrichton Feb 10, 2022

Choose a reason for hiding this comment

alexcrichton commented Feb 10, 2022

cfallin commented Feb 10, 2022

Use mmap'd `*.cwasm` as a source for memory initialization images #3787

Use mmap'd `*.cwasm` as a source for memory initialization images #3787