Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate faulting addresses are valid to fault on #6028

Merged
merged 6 commits into from
Mar 17, 2023

Conversation

alexcrichton
Copy link
Member

This commit adds a defense-in-depth measure to Wasmtime which is intended to mitigate the impact of CVEs such as GHSA-ff4p-7xrq-q5r8. Currently Wasmtime will catch SIGSEGV signals for WebAssembly code so long as the instruction which faulted is an allow-listed instruction (aka has a trap code listed for it). With the recent security issue, however, the problem was that a wasm guest could exploit a compiler bug to access memory outside of its sandbox. If the access was successful there's no real way to detect that, but if the access was unsuccessful then Wasmtime would happily swallow the SIGSEGV and report a nominal trap. To embedders, this might look like nothing is going awry.

The new strategy implemented here in this commit is to attempt to be more robust towards these sorts of failures. When a SIGSEGV is raised the faulting pc is recorded but additionally the address of the inaccessible location is also record. After the WebAssembly stack is unwound and control returns to Wasmtime which has access to a Store Wasmtime will now use this inaccessible faulting address to translate it to a wasm address. This process should be guaranteed to succeed as WebAssembly should only be able to access a well-defined region of memory for all linear memories in a Store.

If no linear memory in a Store could contain the faulting address, then Wasmtime now prints a scary message and aborts the process. The purpose of this is to catch these sorts of bugs, make them very loud errors, and hopefully mitigate impact. This would continue to not mitigate the impact of a guest successfully loading data outside of its sandbox, but if a guest was doing a sort of probing strategy trying to find valid addresses then any invalid access would turn into a process crash which would immediately be noticed by embedders.

While I was here I went ahead and additionally took a stab at #3120. Traps due to SIGSEGV will now report the size of linear memory and the address that was being accessed in addition to the bland "access out of bounds" error. While this is still somewhat bland in the context of a high level source language it's hopefully at least a little bit more actionable for some. I'll note though that this isn't a guaranteed contextual message since only the default configuration for Wasmtime generates SIGSEGV on out-of-bounds memory accesses. Dynamically bounds-checked configurations, for example, don't do this.

Testing-wise I unfortunately am not aware of a great way to test this. The closet equivalent would be something like an unsafe method Config::allow_wasm_sandbox_escape. In lieu of adding tests, though, I can confirm that during development the crashing messages works just fine as it took awhile on macOS to figure out where the faulting address was recorded in the exception information which meant I had lots of instances of recording an address of a trap not accessible from wasm.

@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Mar 15, 2023
@github-actions
Copy link

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

  • peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@github-actions github-actions bot added the fuzzing Issues related to our fuzzing infrastructure label Mar 15, 2023
@github-actions
Copy link

Subscribe to Label Action

cc @fitzgen

This issue or pull request has been labeled: "fuzzing"

Thus the following users have been cc'd because of the following labels:

  • fitzgen: fuzzing

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable; most of my comments are nits/requests for clarified comments. Overall this is a very nice backstop to ensure we are more likely to catch codegen bugs; so, thanks!

A high-level question as well: a pagefault from within a hostcall won't trigger the warning, right? Or do we enter the handler if an instance in the store is on the stack at all?

crates/runtime/src/lib.rs Outdated Show resolved Hide resolved
crates/runtime/src/memory.rs Show resolved Hide resolved
crates/runtime/src/traphandlers.rs Outdated Show resolved Hide resolved
//
// In the original Gecko sources this was copied from the structure here
// is `pragma(pack(4))` which should be the same as an unaligned i64
// here which doesn't increase the overall alignment of the structure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we link the Gecko source (e.g. on searchfox) for reference?

Is repr(packed) equivalent to align=4 here because the current struct offset is already 4-aligned? If so, could we say this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repr(packed) reduces alignment to 1 I think, but it turns out that Rust supports repr(packed(4)) which exactly matches what C does so I've switched to that instead of the packed newtype wrapper.

crates/wasmtime/src/store.rs Outdated Show resolved Hide resolved
return None;
}
let mut fault = None;
for instance in self.instances.iter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we aren't worried about linear-time behavior here because this is only within a store, not across all live instances in the engine; and way may have some number of instances (due to a component instantiation perhaps) but not an extremely large number. Can we note that here? And, are there circumstances in the future where this may be a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's the rough idea, that the total number of linear memories and instances is likely quite small per-store. I'm also sort of banking on everything being of the mindset "well it's ok if traps are a bit slower". If this becomes an issue though I think we can push more maps outward into the store to more efficiently (probably log(n) style) calculate which instance/memory a linear memory address belongs to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only traps that are semi-hot are wasi exits, which don't have faulting addresses and so should never enter this function anyways, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right yeah wasm exits won't execute this. I don't know of any where the loop would be hot other than synthetic "instantiate 10000 things and then repeatedly trap", but I'll definitely bolster the comments here.

@alexcrichton
Copy link
Member Author

A high-level question as well: a pagefault from within a hostcall won't trigger the warning, right? Or do we enter the handler if an instance in the store is on the stack at all?

Correct, yeah. We've actually got tests which ensure that a host-triggered segfault isn't caught by Wasmtime and aborts the process (as you'd expect a segfault in a language like Rust to do so). The first thing the signal handlers do is test whether the faulting pc is a wasm pc, and it defers to the default behavior of each signal if it's not a wasm pc. If it's a wasm pc though we fully catch the trap assuming it's expected that the wasm instruction had the kind of fault.

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo nitpicks that Chris pointed out and my tiny suggestion for the fatal error message, this looks great! Really excited to have this extra backstop in place, and getting better trap error messages on top of that is icing on the cake. Well done!

This commit adds a defense-in-depth measure to Wasmtime which is
intended to mitigate the impact of CVEs such as GHSA-ff4p-7xrq-q5r8.
Currently Wasmtime will catch `SIGSEGV` signals for WebAssembly code so
long as the instruction which faulted is an allow-listed instruction
(aka has a trap code listed for it). With the recent security issue,
however, the problem was that a wasm guest could exploit a compiler bug
to access memory outside of its sandbox. If the access was successful
there's no real way to detect that, but if the access was unsuccessful
then Wasmtime would happily swallow the `SIGSEGV` and report a nominal
trap. To embedders, this might look like nothing is going awry.

The new strategy implemented here in this commit is to attempt to be
more robust towards these sorts of failures. When a `SIGSEGV` is raised
the faulting pc is recorded but additionally the address of the
inaccessible location is also record. After the WebAssembly stack is
unwound and control returns to Wasmtime which has access to a `Store`
Wasmtime will now use this inaccessible faulting address to translate it
to a wasm address. This process should be guaranteed to succeed as
WebAssembly should only be able to access a well-defined region of
memory for all linear memories in a `Store`.

If no linear memory in a `Store` could contain the faulting address,
then Wasmtime now prints a scary message and aborts the process. The
purpose of this is to catch these sorts of bugs, make them very loud
errors, and hopefully mitigate impact. This would continue to not
mitigate the impact of a guest successfully loading data outside of its
sandbox, but if a guest was doing a sort of probing strategy trying to
find valid addresses then any invalid access would turn into a process
crash which would immediately be noticed by embedders.

While I was here I went ahead and additionally took a stab at bytecodealliance#3120.
Traps due to `SIGSEGV` will now report the size of linear memory and the
address that was being accessed in addition to the bland "access out of
bounds" error. While this is still somewhat bland in the context of a
high level source language it's hopefully at least a little bit more
actionable for some. I'll note though that this isn't a guaranteed
contextual message since only the default configuration for Wasmtime
generates `SIGSEGV` on out-of-bounds memory accesses. Dynamically
bounds-checked configurations, for example, don't do this.

Testing-wise I unfortunately am not aware of a great way to test this.
The closet equivalent would be something like an `unsafe` method
`Config::allow_wasm_sandbox_escape`. In lieu of adding tests, though, I
can confirm that during development the crashing messages works just
fine as it took awhile on macOS to figure out where the faulting address
was recorded in the exception information which meant I had lots of
instances of recording an address of a trap not accessible from wasm.
@alexcrichton alexcrichton added this pull request to the merge queue Mar 16, 2023
@alexcrichton alexcrichton removed this pull request from the merge queue due to a manual request Mar 16, 2023
@alexcrichton alexcrichton enabled auto-merge March 16, 2023 20:27
@alexcrichton alexcrichton added this pull request to the merge queue Mar 16, 2023
@alexcrichton
Copy link
Member Author

@uweigand I was wondering if I could enlist your help to debug a failure here. This has failed in CI with this error:

---- traps::wasm_fault_address_reported_by_default stdout ----
thread 'traps::wasm_fault_address_reported_by_default' panicked at 'bad error: error while executing at wasm backtrace:
    0:   0x25 - <unknown>!start

Caused by:
    0: memory fault at wasm address 0xdeadb000 in linear memory of size 0x10000
    1: wasm trap: out of bounds memory access', tests/all/traps.rs:1308:5

where this is the module being executed:

            (module
                (memory 1)
                (func $start
                    i32.const 0xdeadbeef
                    i32.load
                    drop)
                (start $start)
            )

and the test is asserting:

    let err = format!("{err:?}");
    assert!(
        err.contains("memory fault at wasm address 0xdeadbeef in linear memory of size 0x10000"),
        "bad error: {err}"
    );

so it seems like s390x is reporting the faulting address of the load instruction is something page-rounded (or something like that?). I don't know how to hook up gdb to qemu but in theory this means that the address coming out of siginfo_t's si_addr field in the signal handler doesn't have the 0xeef at the end and instead ends in 0x000 (I think).

Do you know if this is an s390x-specific behavior? Are faulting addresses reported at their page aligned boundary instead of the address itself?

@alexcrichton alexcrichton removed this pull request from the merge queue due to a manual request Mar 16, 2023
@uweigand
Copy link
Member

Do you know if this is an s390x-specific behavior? Are faulting addresses reported at their page aligned boundary instead of the address itself?

Yes, this is correct. Faulting addresses are rounded to their 4k page boundary on s390x.

@alexcrichton
Copy link
Member Author

Ok good to know it's expected! I'll disable this particular test for s390x in that case.

s390x rounds faulting addresses to 4k boundaries.
@alexcrichton alexcrichton enabled auto-merge March 17, 2023 14:32
@alexcrichton alexcrichton added this pull request to the merge queue Mar 17, 2023
@alexcrichton alexcrichton merged commit 28371bf into bytecodealliance:main Mar 17, 2023
@alexcrichton alexcrichton deleted the validate-segfault branch March 17, 2023 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fuzzing Issues related to our fuzzing infrastructure wasmtime:api Related to the API of the `wasmtime` crate itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants