Subscriptions and shared memory have a loophole which can lead to UB #143

Woyten · 2020-02-16T16:45:02Z

Overview

Bad news: syscalls::subscribe and syscalls::allow have a loophole. Their implementations rely on destructors being called before the subscribed closure/memory slice gets out of scope. Unfortunately, this is not guaranteed by Rust and can easily be circumvented using mem::forget, for example.

Solutions

I see different approaches to the problem which are not necessarily mutually exclusive

Use a closured-based scope instead of a `drop`-based one

Example

pub fn subscribe<C: Consumer<T>, T, F: FnOnce()>(
    driver_number: usize,
    subscribe_number: usize,
    payload: &mut T,
    execute_while_subscription_is_active: F,
) -> Result<(), SubscribeError> {
    ...
}

PRO: Capture of stack values and references to the stack is possible
PRO: Easy interaction with the main loop via mutable references
CON: Will result in nested closures when multiple subscriptions are applied (although this could be mitigated using combinators)
CON: It is not possible for a driver to subscribe once on initialization and use that subscription for the remaning lifetime of the application (e.g. ParallelSleepDriver)
CON: UB is still possible after the panic/oom handler is called

Use static subscriptions

Example

pub fn subscribe2<C: Consumer<T>, T>(
    driver_number: usize,
    subscribe_number: usize,
    userdata: &'static T,
) -> Result<(), SubscribeError> {
    ...
}

PRO: Mulitple subscriptions can be applied without resulting in nested closures
PRO: Drivers can subscribe once on initilization and use the subscription for the remaining lifetime of the application
PRO: Protection from UB even if the code jumps to panic/alloc error handler
CON: Capture of stack-based values is not possible
CON: Interaction with the main loop requires shared mutability (e.g. Cells or RefCells) and is limited to objects with a static lifetime

Keep the situation as is

Consider subscribe and allow safe although they aren't. A justification could be that UB can only occur if constructors are avoided – which is unlikely to happen intentionally – or if a panic/oom handler is called – which, in future, could be solved by reporting to the kernel that the application is in an invalid state.

Vision

I would propose to further investigate a combined solution. Given the buttons API as an example to be generalized over other drivers, the buttons API should have:

A conventional version offering a subscription model based on closure scopes
An asynchronous version which does a static subscription on initialization and offers convenient async methods, e.g. wait_for_button_pressed

In order to facilitate working with static subscriptions, we would use the approach that @janvrwhy proposes in https://github.com/tock/design-explorations/blob/master/size_comparison/futures/src/tock_static.rs.

The text was updated successfully, but these errors were encountered:

alistair23 · 2020-02-18T21:03:18Z

I like the closure based approach, that is what I have seen from other Rust libraries and kind of lines up with general Rust-iness.

In order to facilitate working with static subscriptions, we would use the approach that @janvrwhy proposes in https://github.com/tock/design-explorations/blob/master/size_comparison/futures/src/tock_static.rs.

Does that actually work? I thought Rust would still throw an error if you wrap something that isn't Sync in something that is. At least that is how Arc works

jrvanwhy · 2020-02-18T21:10:09Z

Does that actually work? I thought Rust would still throw an error if you wrap something that isn't Sync in something that is. At least that is how Arc works

Yes, it works. With unsafe impl Sync, we're promising to the compiler that TockStatic won't allow 2 different threads to get references to the object contained within. I believe TockStatic is sound provided it is only usable in contexts where the userspace is truly single-threaded (as libtock-rs's runtime is).

alistair23 · 2020-02-18T21:53:28Z

Cool 👍

torfmaster · 2020-02-19T20:58:39Z

CON: UB is still possible after the panic/oom handler is called

Could that be mitigated by marking the app as panicked in the kernel and disabling calling of callbacks?

torfmaster · 2020-02-19T21:04:37Z

I have concerns about the usability of the closure-based approach to subscriptions. I already find it hard to combine results from different subscriptions in the drop-based approach. The closure based approach will make this even harder.

Although the sync API of the library will feel very C-ish then I'm in favour of using the static subscription model because I believe it to be easier to use (and probably more suitable for internal driver implementations).

gendx · 2020-02-20T15:26:06Z

I'm not sure static subscriptions are the way to go.

Use static subscriptions

PRO: Drivers can subscribe once on initilization and use the subscription for the remaining lifetime of the application

Is it a "can" or a "must"?

A concern I have is how can you subscribe for a limited amount of time only? For example, subscribe to button touches only when your app is waiting for the user to touch a button. Although this may be simulated with static objects containing some subscription status, this could become contrived quite quickly.

CON: Capture of stack-based values is not possible

CON: Interaction with the main loop requires shared mutability (e.g. Cells or RefCells) and is limited to objects with a static lifetime

I fear that limiting interactions to static objects may be too limiting. From my experience with https://github.com/google/OpenSK, we capture local variables all the time.

On the other hand, it seems that Tock's syscalls were mostly designed with static callbacks and global variables in mind - at least that's what happening all the time with libtock-c - rather than local lambda functions that we have in Rust (which led to bugs like tock/tock#1376 - and similar memory corruption bugs would arise if a destructor isn't run with libtock-rs).

In any case, as a revamp of Tock syscalls is under way with Tock 2.0 discussions (tock/tock#1607), I'd definitely weigh in the ability to create usable and safe abstractions with libtock-rs on top of it.

My take on this is that the current subscribe/callback model is not well-designed because it only works if one is sure to unsubscribe before leaving the scope (or you can get use-after-free problems if a callback arrives later, due to dereferencing variables from a stack that doesn't exist anymore). And "callbacks" are not really "callbacks" because we need to yield to trigger them. So a subscribe/poll model without "control-flow jumps" due to callbacks would be a better model IMO.

I don't see how we could get rid of allow's issues though.

jrvanwhy · 2020-05-27T18:43:03Z

In Tock's current (pre-2.0) implementation, a shared allow region cannot be revoked. As a result, all buffers passed to allow must have static lifetime.

Unfortunately, this means we cannot share buffers on the stack, and further than all accesses to the buffers must be done in a single instruction (to avoid race conditions with the kernel).

I'm expecting to rework the system call API, so I'll address that as part of the rework. Unfortunately, it won't be nearly as nice to work with as the current setup. This may be worth addressing in Tock 2.0.

gendx · 2020-05-28T13:35:33Z

In Tock's current (pre-2.0) implementation, a shared allow region cannot be revoked. As a result, all buffers passed to allow must have static lifetime.

How is that? I thought that in the Tock API, allowing a null pointer was supposed to revoke the old buffer. Or is that driver-specific? Or is there a bug in the kernel implementation of this API?

Unfortunately, this means we cannot share buffers on the stack, and further than all accesses to the buffers must be done in a single instruction (to avoid race conditions with the kernel).

Although implementing that properly is likely driver-specific, shouldn't the command & callback syscalls delimit sections where a buffer is "owned" by the kernel vs. by the application? For example, the driver is not supposed to use an allowed buffer before a command is issued to take some action, nor after it has issued a "data transferred" callback?

jrvanwhy · 2020-05-28T18:15:28Z

How is that? I thought that in the Tock API, allowing a null pointer was supposed to revoke the old buffer. Or is that driver-specific? Or is there a bug in the kernel implementation of this API?

The documentation was incorrect, I fixed it in tock/tock#1831. Currently, an app cannot revoke a capsule's access to an allow-ed region, only ask the capsule to stop accessing it. Because we do not currently have const allow, that means the capsule could touch that memory at any time, hence allow-ed memory must always be treated as shared memory.

Although implementing that properly is likely driver-specific, shouldn't the command & callback syscalls delimit sections where a buffer is "owned" by the kernel vs. by the application? For example, the driver is not supposed to use an allowed buffer before a command is issued to take some action, nor after it has issued a "data transferred" callback?

There's a distinction between typical use case patterns, and what's possible. Ideally, yes. However, recall that in Tock's threat model applications do not completely trust capsules. If the application were to treat the buffer in a way that allows a capsule to invoke undefined behavior, that would be a vulnerability in the application.

gendx · 2020-05-29T09:25:02Z

How is that? I thought that in the Tock API, allowing a null pointer was supposed to revoke the old buffer. Or is that driver-specific? Or is there a bug in the kernel implementation of this API?

The documentation was incorrect, I fixed it in tock/tock#1831.

I disagree that the documentation was incorrect and that changing it fixed the problem. I'd rather say that the code was incorrect in that it didn't implement the documented API.

Currently, an app cannot revoke a capsule's access to an allow-ed region, only ask the capsule to stop accessing it. Because we do not currently have const allow, that means the capsule could touch that memory at any time, hence allow-ed memory must always be treated as shared memory.

I don't see how this can possibly work. This means that once a userspace application allows a region to a capsule, the capsule has read-write access to it forever, and therefore the application cannot use this memory anymore (memory aliasing rules would otherwise be violated).

The allow-subscribe-command-yield pattern to transmit data back from the kernel to a userspace buffer has been a common pattern for a while, e.g. to fill in a buffer with rng data or to receive a USB packet. But these changes in the API documentation mean that any kernel -> userspace data transfer cannot happen with allowed buffers - the only option left is to transmit this data one word at a time via the return value of a syscall.

Requiring that allowed buffers are 'static doesn't solve any of that. Without revocation of allowed regions, there's no way a userspace application can ever reclaim back the memory, to either read an incoming packet or to write the next outbound packet.

torfmaster · 2020-05-29T16:41:05Z

I don't see how this can possibly work. This means that once a userspace application allows a region to a capsule, the capsule has read-write access to it forever, and therefore the application cannot use this memory anymore (memory aliasing rules would otherwise be violated).

Maybe I have a wrong understanding of the aliasing rules to apply. As far as I understand they should apply inside the same application, mainly to give the compiler the possibility to perform only admissible optimizations. However, allowing a buffer to an external source is not affected by that (disclamer below). There are many use cases where an application allowing a buffer to an external source (as DMA, MMIO, shared memory via IPC, ...) and a reasonable set of rules for memory safety should allow for these use cases. As a Cell can allow for inner mutability a correct implementation of a SharedMemory can allow for mutability from outside (at least that's my hope).

Making these buffers 'static seems to be the only way for me to safely share buffers between a capsule and an app (with the reasoning from this issue). This, however makes a usage of buffers as in OpenSK impossible (where the buffer is kept on the stack and is unallowed after usage).

Disclaimer:
There are many issues which can occur and which have to be solved:

memory has to be clobbered on context switches which may affect the buffer
the lifetime has to be static as mem::forget would otherwise allow of reusing memory which is still writable by a capsule
read access to the shared buffer has to be protected by volatile reads and/or UnsafeCell to protect from incorrect optimizations.

gendx · 2020-05-29T17:48:03Z

I don't see how this can possibly work. This means that once a userspace application allows a region to a capsule, the capsule has read-write access to it forever, and therefore the application cannot use this memory anymore (memory aliasing rules would otherwise be violated).

Maybe I have a wrong understanding of the aliasing rules to apply. As far as I understand they should apply inside the same application, mainly to give the compiler the possibility to perform only admissible optimizations. However, allowing a buffer to an external source is not affected by that (disclamer below). There are many use cases where an application allowing a buffer to an external source (as DMA, MMIO, shared memory via IPC, ...) and a reasonable set of rules for memory safety should allow for these use cases. As a Cell can allow for inner mutability a correct implementation of a SharedMemory can allow for mutability from outside (at least that's my hope).

Making these buffers 'static seems to be the only way for me to safely share buffers between a capsule and an app (with the reasoning from this issue). This, however makes a usage of buffers as in OpenSK impossible (where the buffer is kept on the stack and is unallowed after usage).

Disclaimer:
There are many issues which can occur and which have to be solved:

memory has to be clobbered on context switches which may affect the buffer

I don't think that context switches are strong enough as a boundary. The Tock kernel can preempt an application at any point, e.g. due to an interrupt.

Consider the case of a "receive packet" driver (USB, TCP, etc.). Upon interrupt due to the hardware receiving a packet, Tock will preempt the application, then handle the interrupt which in turn likely invokes the corresponding capsule, which is allowed to modify the previously allowed buffer.

If this preemption happens while the userspace app was copying data to/from this allowed buffer, then the copied buffer will contain half of old data and half of new data.

Another example is the RNG driver. The userspace app wants to wait for the driver to have filled the buffer with random data to read back the buffer. Otherwise, it could be that half of the buffer contains non-random data, which then breaks security properties. If the driver is allowed to write data on the allowed buffer at any time, then it could for example clear the buffer with zeros, which breaks the security properties expected by the app.

the lifetime has to be static as mem::forget would otherwise allow of reusing memory which is still writable by a capsule

read access to the shared buffer has to be protected by volatile reads and/or UnsafeCell to protect from incorrect optimizations.

In terms of protection, I think that a memory barrier on syscalls to delimit a section between allow(ptr) and allow(null) within which the buffer pointed to by ptr is exclusively owned by the kernel and therefore not allowed to be touched by the app is reasonable.
But this requires that the buffer is exclusively owned by the app in the section after allow(null).

I don't see how UnsafeCell protects against anything. And in general, using cells in the buffer (e.g. [Cell(u8); N]) doesn't add any protection against the pitfalls I mentioned, because even if it makes reads/writes of each byte in the buffer atomic (but this was atomic anyway on any reasonable CPU), reading/writing the buffer as a whole is not atomic, so you can end up with a corrupted copy that contains half of a packet and half of another packet.
The logical unit is not each individual byte (I don't see how it could be memory unsafe to read/write a byte), but the buffer as a whole (which should be treated as a &mut [u8]).

To me, the only reasonable solution is to fix the Tock kernel, so that an allow(null) syscall removes access from the underlying buffer from any capsule, and always gives ownership back to the userspace application.

gendx · 2020-05-29T18:00:52Z

Actually, I remember thinking about this problem and found back a side comment in tock/tock#1761 (comment)

By the way, what prevents a "malicious" capsule from "collecting" an AppSlice when an app calls the allow syscall, and then keeping the AppSlice forever and continue using it, even after the app has disallowed it? How does this fall into the threat model? Shouldn't the capsules instead ask the kernel every time they need to access an application slice - so that disallowed slices are not accessible anymore?

I think this should be addressed on the kernel side, by changing the API around AppSlice to prevent capsules from accessing buffers after the slice is disallowed.

jrvanwhy · 2020-05-29T18:34:57Z

I think this should be addressed on the kernel side, by changing the API around AppSlice to prevent capsules from accessing buffers after the slice is disallowed.

I agree, but this is something to be solved in upstream Tock. Until then, libtock-rs needs to work with the kernel's current behavior.

gendx · 2020-06-02T08:11:46Z

I think this should be addressed on the kernel side, by changing the API around AppSlice to prevent capsules from accessing buffers after the slice is disallowed.

I agree, but this is something to be solved in upstream Tock. Until then, libtock-rs needs to work with the kernel's current behavior.

As I explained, I don't think there's any reasonable way to work with the kernel's currently documented API, apart to create trivial programs like blink. As a simple example, how do you plan to make the blink_random program work with the constraints of the documented API?

In particular, filling the following buffer from the kernel in a loop.

libtock-rs/examples/blink_random.rs

Lines 18 to 20 in eee2028

    
           let mut buf = [0; 64]; 
        
           loop { 
        
               drivers.rng.fill_buffer(&mut buf).await?;

However, speaking of behavior, we can note that the "current kernel behavior" actually provides stronger guarantees than what is offered in the documentation, in the sense that most drivers (LEDs, buttons, rng, etc.) will release any callback upon subscribe(null ptr) and release any application slice upon allow(null ptr). So if we want to match the kernel behavior, we don't need to change anything in userspace for these drivers!

My take on this is that we should rather focus efforts on making sure the kernel can provide these stronger guarantees for all drivers (even with a "malicious" capsule), than trying to cope with a broken API in userspace.

jrvanwhy · 2020-06-02T18:48:33Z

However, speaking of behavior, we can note that the "current kernel behavior" actually provides stronger guarantees than what is offered in the documentation, in the sense that most drivers (LEDs, buttons, rng, etc.) will release any callback upon subscribe(null ptr) and release any application slice upon allow(null ptr). So if we want to match the kernel behavior, we don't need to change anything in userspace for these drivers!

I think you understand this, but with the current kernel design/implementation, we cannot rely on that behavior (as under the Tock threat model that would be a vulnerability in libtock-rs).

My take on this is that we should rather focus efforts on making sure the kernel can provide these stronger guarantees for all drivers (even with a "malicious" capsule), than trying to cope with a broken API in userspace.

I agree this should be changed. I'm not sufficiently familiar with kernel internals to make a suggestion on how to do so, and I don't have time to look into it right now. I'm unsure whether this is a Tock 2.0 change or not.

I think this discussion should continue in a non-libtock-rs-specific form (e.g. the tock-dev mailinglist or an issue in the tock/tock repository).

gendx · 2020-06-03T10:02:04Z

I agree this should be changed. I'm not sufficiently familiar with kernel internals to make a suggestion on how to do so, and I don't have time to look into it right now. I'm unsure whether this is a Tock 2.0 change or not.

I think this discussion should continue in a non-libtock-rs-specific form (e.g. the tock-dev mailinglist or an issue in the tock/tock repository).

I definitely agree, I'll create an issue in tock/tock. I think that this would qualify as a Tock 2.0 change (or even earlier if any other release happens in 1.x).

Allowed is the type that will be returned by Platform::allow() in the successful case. It represents a memory buffer shared with the kernel. It uses raw pointers and volatile writes to perform read/write accesses to the memory to avoid encountering undefined behavior. This is the first step towards fixing tock#129 and tock#143.

222: Add the Allowed type to libtock_platform. r=hudson-ayers a=jrvanwhy Allowed is the type that will be returned by `Platform::allow()` in the successful case. It represents a memory buffer shared with the kernel. It uses raw pointers and volatile writes to perform read/write accesses to the memory to avoid encountering undefined behavior. This is the first step towards fixing #129 and #143. Co-authored-by: Johnathan Van Why <jrvanwhy@google.com>

jrvanwhy · 2022-02-09T02:08:45Z

The Tock 2.0 crates have a different Subscribe design that addresses this unsoundness.

torfmaster mentioned this issue May 23, 2020

Put shared buffer into UnsafeCell #185

Closed

This was referenced Jun 3, 2020

[RFC] Fixing the loophole in unallow & unsubscribe tock/tock#1905

Closed

Core working group notes from June 05 tock/tock#1919

Merged

This was referenced Jul 2, 2020

[WIP] Introduce a libtock-platform crate under libtock-core #211

Closed

Create a libtock-platform crate including syscalls + async traits. #217

Closed

jrvanwhy mentioned this issue Jul 17, 2020

Add the Allowed type to libtock_platform. #222

Merged

jrvanwhy closed this as completed Feb 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subscriptions and shared memory have a loophole which can lead to UB #143

Subscriptions and shared memory have a loophole which can lead to UB #143

Woyten commented Feb 16, 2020

alistair23 commented Feb 18, 2020

jrvanwhy commented Feb 18, 2020

alistair23 commented Feb 18, 2020

torfmaster commented Feb 19, 2020

torfmaster commented Feb 19, 2020

gendx commented Feb 20, 2020

Use static subscriptions

jrvanwhy commented May 27, 2020

gendx commented May 28, 2020

jrvanwhy commented May 28, 2020

gendx commented May 29, 2020

torfmaster commented May 29, 2020

gendx commented May 29, 2020

gendx commented May 29, 2020

jrvanwhy commented May 29, 2020

gendx commented Jun 2, 2020

jrvanwhy commented Jun 2, 2020

gendx commented Jun 3, 2020

jrvanwhy commented Feb 9, 2022

Subscriptions and shared memory have a loophole which can lead to UB #143

Subscriptions and shared memory have a loophole which can lead to UB #143

Comments

Woyten commented Feb 16, 2020

Overview

Solutions

Use a closured-based scope instead of a drop-based one

Use static subscriptions

Keep the situation as is

Vision

alistair23 commented Feb 18, 2020

jrvanwhy commented Feb 18, 2020

alistair23 commented Feb 18, 2020

torfmaster commented Feb 19, 2020

torfmaster commented Feb 19, 2020

gendx commented Feb 20, 2020

Use static subscriptions

jrvanwhy commented May 27, 2020

gendx commented May 28, 2020

jrvanwhy commented May 28, 2020

gendx commented May 29, 2020

torfmaster commented May 29, 2020

gendx commented May 29, 2020

gendx commented May 29, 2020

jrvanwhy commented May 29, 2020

gendx commented Jun 2, 2020

jrvanwhy commented Jun 2, 2020

gendx commented Jun 3, 2020

jrvanwhy commented Feb 9, 2022

Use a closured-based scope instead of a `drop`-based one