Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async usercall interface for SGX enclaves #515

Merged
merged 19 commits into from
Jan 18, 2024

Conversation

vn971
Copy link

@vn971 vn971 commented Aug 28, 2023

Entering and exiting an SGX enclave is performance costly. It's much more efficient to continue executing within the enclave and communicate with the enclave-runner by passing messages. The tokio runtime can be used for such asynchronous communication.
This PR provides very basic support for this in EDP, but changes to mio and tokio still need to be upstreamed. These changes are fully backwards compatible; your existing enclaves will continue to run as expected.

Credits for this PR go to:
Mohsen: #404
YxC: #441

This commit is an attempt to have the async-usercalls finally merged into the main codebase (master branch).

@vn971 vn971 force-pushed the vas/async-usercalls-scratch branch from 12728f3 to f917b37 Compare August 29, 2023 06:39
@vn971 vn971 force-pushed the vas/async-usercalls-scratch branch 5 times, most recently from c2eae70 to 7dfaf7d Compare September 8, 2023 09:17
Copy link
Contributor

@raoulstrackx raoulstrackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most things I found are minor, or can/should be picked up as part of another ticket

ipc-queue/src/position.rs Outdated Show resolved Hide resolved
ipc-queue/src/position.rs Outdated Show resolved Hide resolved
ipc-queue/src/position.rs Outdated Show resolved Hide resolved
ipc-queue/src/position.rs Outdated Show resolved Hide resolved
ipc-queue/src/position.rs Outdated Show resolved Hide resolved
intel-sgx/enclave-runner/src/usercalls/abi.rs Outdated Show resolved Hide resolved
intel-sgx/enclave-runner/src/usercalls/mod.rs Outdated Show resolved Hide resolved
intel-sgx/async-usercalls/src/callback.rs Outdated Show resolved Hide resolved
intel-sgx/async-usercalls/src/lib.rs Outdated Show resolved Hide resolved
ipc-queue/src/position.rs Outdated Show resolved Hide resolved
.travis.yml Outdated Show resolved Hide resolved
ipc-queue/src/position.rs Outdated Show resolved Hide resolved
ipc-queue/src/position.rs Outdated Show resolved Hide resolved
ipc-queue/src/position.rs Outdated Show resolved Hide resolved
ipc-queue/src/position.rs Outdated Show resolved Hide resolved
intel-sgx/async-usercalls/src/io_bufs.rs Show resolved Hide resolved
intel-sgx/async-usercalls/src/io_bufs.rs Show resolved Hide resolved
intel-sgx/async-usercalls/src/queues.rs Outdated Show resolved Hide resolved
intel-sgx/async-usercalls/src/queues.rs Outdated Show resolved Hide resolved
@raoulstrackx
Copy link
Contributor

Added one last comment related to MakeSend and ticket #530, but will approve merge once this passes further testing.

@DragonDev1906
Copy link

Short Questions (I have not read all the changes):

  • What exactly is meant with async usercall interface? As far as I can tell the enclave_runner side already uses futures and thus allows making UsercallExtensions with async functions. Does this change only affect the internals of rust-sgx (i.e. not visible from the enclave or runner), does it change the runner interface or does it change how the enclave code can interact with the outside?
  • Are there examples on how the async usercall interface is used (from the application developer side, if it doesn't just affect internals)?

At the moment I'm not sure what exactly is meant with async usercall interface.

@raoulstrackx
Copy link
Contributor

Good question @DragonDev1906, I've updated the description of this PR to make things more clear. Let me know if you still have questions. This PR doesn't have examples, but we'll add some once the changes to mio and tokio have been upstreamed and things are easier to be used.

@DragonDev1906
Copy link

DragonDev1906 commented Oct 16, 2023

Nice, I've had a few issues with dependencies that rely on tokio with the net feature, which made it impossible to use them. Thank you for the clarification.

I do have two more questions (though I'm not sure if this is the right place to ask them):
At the moment I only have sync code in the enclave, with a custom runner (using tokio and handling tls termination where I don't need it in the enclave) responsible for pushing data received from other systems to the enclave. Basically I'm just sending a continuous list of commands with data and process any results returned from the enclave.

  • Will enclaves without async code, where the runner doesn't have to wait for the enclave to finish before sending the next command benefit from this change? (I think it might be a good idea to not use async code in the enclave (application code) to lower the complexity, at least if the problem can be converted to a list of commands that should be executed), though please correct me if I'm wrong on that part.
  • The second one is a bit less related to async usercalls, but since you've asked if I had more questions ...
    I'm currently contemplating which implementation is best for the above stated situation (goal: High throughput, I'm getting a medium amount of data but also need some computation on it (mainly hashing and signature verification), so I might even end up compute-bound):
    1. Communicate via TCP (no custom runner needed), likely slow because it needs to go into kernel space
    2. Communicate via the existing Usercall Extensions (hence the question of whether there will be performance benefits in this Situation)
    3. Communicate via the async usercall interface (unless that only makes sense when the enclave itself runs async code).
    4. Use the enclave in library mode. At the moment I have no idea how to estimate the performance of this approach, as it basically means no async at all (as far as I can tell) and having to wait for the previous call to finish before continuing. It may save on serialization and deserialization (unless that's just done automatically), but I think it gives less flexibility than a (buffered) TCP or (async) usercall extension.

I plan to test the throughput of those options, but perhaps you already have some experience or suggestions which option may be the slowest/most inefficient. Especially the library mode and if such a system would even benefit of the async usercall interface changes. (It could also be useful to have such a comparison of communication options somewhere in the docs).

(so many questions, sorry)

@raoulstrackx
Copy link
Contributor

raoulstrackx commented Oct 17, 2023

No worries @DragonDev1906

Will enclaves without async code, where the runner doesn't have to wait for the enclave to finish before sending the next command benefit from this change?

No, without changes to your code, this PR doesn't have any impact for you.

... which implementation is best for the above stated situation...
i. Communicate via TCP (no custom runner needed), likely slow because it needs to go into kernel space

If you use the changes in this PR to build an async enclave, your code will be a bit more readable. Biggest change would be that you don't need to enter/exit the enclave to request new commands/return responses. If the enclave is compute expensive, the performance benefit of that may be minimal. Async code works best when it no longer blocks on I/O, but can do something useful while it waits for some event. Based on your description, you may already be doing that with a custom runner.

ii. Communicate via the existing Usercall Extensions

See previous answer

iii. Communicate via the async usercall interface (unless that only makes sense when the enclave itself runs async code).

Yes that only makes sense if the enclave runs async code

iv. Use the enclave in library mode.

That seems unrelated to whether you right sync or async code.

@DragonDev1906
Copy link

Biggest change would be that you don't need to enter/exit the enclave to request new commands/return responses.

Just to see if I understood that correctly: The changes in this PR (when using the new async interface) are going to mean that multiple usercalls can/will be batched into a single ECALL (enter/exit), with the ability to use async code to send multiple usercalls without waiting for the response. But there still needs to be at lest one ECALL (for the entire batch) before the runner can process the usercall and the same for the way back, correct?


Just some info if you're interested, @raoulstrackx:

Based on your description, you may already be doing that with a custom runner.

Yeah, my enclave is not waiting for any responses for requests sent out (that's handled outside the enclave) and only blocks while trying to read new commands (currently via TCP) or writing results (also via TCP), but new commands don't depend on previous results unless something goes wrong.

If you use the changes in this PR to build an async enclave, your code will be a bit more readable. [...]
Based on your description, you may already be doing that with a custom runner.

I've thought about implementing in a "enclave requests the data and waits for the response" way, where async usercalls would likely be a big performance benefit and/or be a lot more readable. My conclusion to that was that there is a rather big trade-of:

  • If the enclave does the requests directly, without an intermediary, it needs to terminate TLS, which is necessary for most use cases but for me the data integrity is provided in the data itself, using hashes, merkle trees and signatures, so TLS termination in the enclave only added complexity.
  • If I have a simple intermediary that just strips TLS and the enclave sends requests to get some data out (typical async model) the requesting logic would be simpler, as the runner wouldn't have to know what data is needed next, but it hides things a malicious runner could do. Additionally, if there is a bug in the Enclave code (e.g. requesting the wrong data), the enclave code would need to be updated, not just the runner code (in our case updating the runner code is a lot easier).
  • With the approach I've now gone (which I hopefully won't regret choosing): The runner providing the data and the Enclave just checking validity of it (and if anything is missing) the runner clearly has the ability to decide the order of the data/commands (which he kind of could do anyways, but not as easily) and a change to the order only requires updating the runner code, the enclave code is simpler. It doesn't need much code for network communication, it doesn't need async code, its execution can be deterministic given the input (hard to do with async) and thus makes auditing the enclave code easier, but that comes at additional complexity in the code generating the commands to run and thus a bigger chance for the entire system to stop working until the runner is updated. The main disadvantage is having to split the processing logic from the data fetching.

I'm not yet sure if this architecture is going to bite me at some point. It's good to know that there will be an efficient way to implement it in a "enclave asks for data" way should the need arise to do that because a complete separation of fetching and logic gets too difficult.

@raoulstrackx
Copy link
Contributor

@DragonDev1906 sorry I forgot to reply to your comment.

The changes in this PR (when using the new async interface) are going to mean that multiple usercalls can/will be batched into a single ECALL (enter/exit),

Strictly speaking: yes, but I think you misunderstood a bit how EDP is expecting to be used. The idea is to run an entire application in the enclave. So the single ecall you refer to, is coming from the enclave-runner that calls the enclave for the very first time. This eventually leads to the enclave calling your main function within its boundaries. Then all usercalls can be done asynchronously from within the enclave. See also the enclave execution lifecycle

For questions/comments not specifically related to this PR. Let's switch to the #rust-sgx channel in the Runtime-Encryption Slack workspace

fn make_progress(&self, deferred: &[Identified<Usercall>]) -> usize {
let sent = self.core.try_send_multiple_usercalls(deferred);
if sent == 0 {
self.core.send_usercall(deferred[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a potential runtime error here if someone calls make_progress with an empty slice for deferred. I don't think there's currently a place that calls this function with an empty slice, but someone could change the code in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

arai-fortanix
arai-fortanix previously approved these changes Nov 27, 2023
return;
}
let sent = self.make_progress(&self.deferred);
let mut not_sent = self.deferred.split_off(sent);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we're using the methodology here of doing split_off/clear/append rather than using drain(0..sent) or something like a VecDequeue?

I suspect that the answer is that the common case is expected to be that make_progress() should normally be able to process the entire queue, which makes the split_off() and append() effectively no-ops. Which is fine, but it's probably useful to leave a comment for the next person who looks at this code and wonders the same thing.

Copy link
Contributor

@raoulstrackx raoulstrackx Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it works, but your drain(..sent) seems more readable. I'll update.

intel-sgx/async-usercalls/src/batch_drop.rs Outdated Show resolved Hide resolved
}
let mut wrote = 0;
for buf in bufs {
wrote += self.write(buf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If self.write() ever returns 0, we can break out of the loop. This might be an efficiency concern if the slice of bufs is long and we're frequently writing to a nearly-full buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is even a strong requirement to immediately return when write ever returns 0; following writes may succeed and thus increase the wrote counter. The user will then incorrectly assume that the first slices were written while the latter ones weren't. That may not be the case.

intel-sgx/async-usercalls/src/lib.rs Show resolved Hide resolved
intel-sgx/async-usercalls/src/lib.rs Outdated Show resolved Hide resolved
n => &returns[..n],
};
// 2. try to lock the mutex, if successful, receive all pending callbacks and put them in the hash map
let mut guard = match self.callbacks.try_lock() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me why the try_lock() is useful here. It would seem simpler, as efficient, and possibly more correct to write this as something like:

let mut guard = self.callbacks.lock().unwrap();
for (id, cb) in self.callback_rx.try_iter() {
     guard.insert(id, cb);
}

What am I missing?

Copy link
Contributor

@raoulstrackx raoulstrackx Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is meant as an optimization; when self.callbacks.try_lock() fails, it means multiple threads are competing for this resource. The code then avoids receiving more callbacks. I can imagine that self.callback_rx.try_iter() is a potentially time costly operation.
But I don't understand why this is correct in all cases. When the system is under very heavy load, the result of the usercall may be received in step 1 before the callback is added to self.callbacks. This eventually leads to the callback never being called. There's also a memory leak: Eventually the callback is received in step 2, but never gets removed from the hashmap.
@mzohreva Are we missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, this should be simplified.

pub fn insert(&mut self, value: T) -> u32 {
let initial_id = self.next_id;
loop {
let id = self.next_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an extremely unlikely but severe potential performance problem with this approach for allocating ids. Suppose you have 2^32 - 1 calls that never terminate, and then a single other call which is fast but you never have more than one of those outstanding at once. Each time you make the fast call, you'll need to do 2^32 map lookups to find the available id.

I don't think we need to worry about this in practice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it does make me wonder if we might want to have some way to limit the length of the call queue to something smaller than 2^32, with either blocking if the queue is full, or switching to making the call synchronously in this case. This might not be useful for DSM, but I can imagine some type of EDP application in which enclave threads could be generating asynchronous calls faster than they could be serviced, and you might want to have the ability to either throttle the enclave threads when the queue gets full or switch to synchronous ocalls, instead of growing the queue without bounds and getting a panic when you run out of memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a potential problem, but can/should be fixed as a separate issue. @arai-fortanix what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing later seems fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created issue #550 for this

Vasili Novikov added 3 commits January 17, 2024 11:51
Credits for this commit go to:
Mohsen: fortanix#404
YxC: fortanix#441

This commit is an attempt to have the async-usercalls finally merged
into the main codebase (master branch).
@vn971 vn971 force-pushed the vas/async-usercalls-scratch branch from 273c394 to 2c49170 Compare January 17, 2024 10:52
This change ports the old tests from .travis.yml
to the new github actions
used on the current master branch.

@raoulstrackx
@vn971 vn971 force-pushed the vas/async-usercalls-scratch branch from 2bf3f6c to 51ea191 Compare January 17, 2024 11:30
@raoulstrackx raoulstrackx added this pull request to the merge queue Jan 18, 2024
Merged via the queue into fortanix:master with commit fe323cb Jan 18, 2024
1 check passed
This was referenced Apr 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants