New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Execution driver uses NodeSyncState for certificate execution #3323

Merged

mystenmark merged 5 commits into main from mlogan-exec-driver

Jul 21, 2022

Contributor

mystenmark commented Jul 20, 2022 •

edited

Loading

This takes care of the performance TODOs in execution driver (for now, at least).

It also makes it acceptable for checkpoints to pend digests for which it doesn't have certs (they will be fetched).

Note that this does not automatically fetch parent certs, and assumes that they will be enqueued as well (or are already executed). If we're not comfortable with this assumption we can certainly add parent fetching to NodeSyncState.

mystenmark changed the title ~~Mlogan exec driver~~ Execution driver uses NodeSyncState for certification execution

mystenmark force-pushed the mlogan-exec-driver branch from 7c6b4a1 to ec51269 Compare

July 20, 2022 05:23

mystenmark requested review from gdanezis and lxfind

July 20, 2022 05:24

mystenmark changed the title ~~Execution driver uses NodeSyncState for certification execution~~ Execution driver uses NodeSyncState for certificate execution

mystenmark marked this pull request as ready for review

July 20, 2022 05:24

Collaborator

gdanezis commented Jul 20, 2022

Note that this does not automatically fetch parent certs, and assumes that they will be enqueued as well (or are already executed). If we're not comfortable with this assumption we can certainly add parent fetching to NodeSyncState.

Unsure what the result of this is: can e guarantee that all previous certs will be enqueued or we need extra logic for this. I am particularity concerned about the precursor transactions to shared objects, that if not present may block the execution of a shared object cert, slowing everyone down.

gdanezis approved these changes

View reviewed changes

Collaborator

gdanezis left a comment

There is a serious amount of stuff in the PR which is quite complex -- due to the nature of what we do. Do try to provide a block interface to drive / request execution rather than a tx by tx interface -- it will make our job down the line easier. I am still unclear if for shared object trasnactions we fetch deps -- but I think not.

crates/sui-core/src/authority_active/execution_driver/mod.rs

+                      .handle_execution_request(pending_transactions.iter().map(|(_, digest)| *digest))
+                      // zip results back together with seq
+                      .zip(stream::iter(pending_transactions.iter()))
+                      // filter out errors

Collaborator

gdanezis Jul 20, 2022

Should we not do something about the errors? This is the place where if some dependent previous transaction has not been processed, objects will not be available etc. We should probably print / log something about these? And also ensure we process them using full sync?

crates/sui-core/src/node_sync/node_follower.rs

                       // this pattern for limiting concurrency is from
                       // https://github.com/tokio-rs/tokio/discussions/2648
                       let limit = Arc::new(Semaphore::new(MAX_NODE_SYNC_CONCURRENCY));
                       let mut stream = Box::pin(stream);
-                      while let Some(DigestsMessage { digests, peer, tx }) = stream.next().await {
+                      while let Some(DigestsMessage { sync_arg, peer, tx }) = stream.next().await {

Collaborator

gdanezis Jul 20, 2022

heads up: efficient parallel execution can be best implemented if you pass in blocks of transactions rather than passing the transactions in one by one.

Contributor Author

mystenmark Jul 20, 2022

What's your argument here for efficiency? I can imagine some theoretical benefits (e.g. cache coherency, less synchronization overhead in the channels) that might be obtained from better batching, but I wouldn't expect that to have a noticeable effect here.

Also - keep in mind that even if we pass in TXes in blocks, we are going to want to farm the execution out to multiple tasks for execution parallelism anyway. (This is a TODO right now - i'm waiting until I have a working devnet to do that so I can measure how much of a speedup it is).

crates/sui-core/src/node_sync/node_follower.rs

-                              })?;
-                      }
+                                  trace!(?parent, ?digest, "waiting for parent");
+                                  // Since we no longer hold the semaphore permit, can be sure that our parent will be

Collaborator

gdanezis Jul 20, 2022

Not sure this is good enough to prevent a deadlock, what if something else takes this permit and also blocks? Is that possible?

Contributor Author

mystenmark Jul 20, 2022

If you read from the top of process_digest, you'll see that the only blocking task we do while holding the permit is downloading the cert and effects, after which we drop the permit. Downloading will always complete (successfully or otherwise), so permits cannot be held indefinitely.

crates/sui-core/src/node_sync/node_follower.rs

+                                  .forget_effects(&effects.effects.digest());
+                          }
+                          CertAndEffects::Validator(cert) => {
+                              self.state.handle_certificate(cert).await?;

Collaborator

gdanezis Jul 20, 2022

This will not work of course if we do not have the dependencies.

Contributor Author

mystenmark Jul 20, 2022

Correct - an error will be reported to the caller.

crates/sui-core/src/node_sync/node_follower.rs

+                  ) -> impl Stream<Item = SuiResult> {
+                      let futures: FuturesOrdered<_> = checkpoint_contents
+                          .iter()
+                          .map(|digests| {

Collaborator

gdanezis Jul 20, 2022

Honesty I would make the base case the case where we send and process the whole block of transaction, and the special case the case where we have a block of 1.

crates/sui-core/src/node_sync/node_follower.rs

+                      digests: impl Iterator<Item = TransactionDigest>,
+                  ) -> impl Stream<Item = SuiResult> {
+                      let futures: FuturesOrdered<_> = digests
+                          .map(|digest| {

Collaborator

gdanezis Jul 20, 2022

Same as above, better to handle blocks of transactions.

crates/sui-core/src/node_sync/node_follower.rs

+              where
+                  A: AuthorityAPI + Send + Sync + 'static + Clone,
+              {
+                  async fn handle_digest(&self, follower: &Follower<A>, digests: ExecutionDigests) -> SuiResult {

Collaborator

gdanezis Jul 20, 2022

Here again, I would send blocks of transactions from the follower on follower block boundaries.

Contributor Author

mystenmark commented Jul 20, 2022

There is a serious amount of stuff in the PR which is quite complex -- due to the nature of what we do.

Yes, my apologies - this code got fairly complex when dealing with the "wait for finality / trustworthy effects" problem, and it hasn't gotten any simpler since then. If I had time I would probably love to rewrite it and make it simpler, but we may be stuck with it for now.

Contributor Author

mystenmark commented Jul 20, 2022

Note that this does not automatically fetch parent certs, and assumes that they will be enqueued as well (or are already executed). If we're not comfortable with this assumption we can certainly add parent fetching to NodeSyncState.

Unsure what the result of this is: can e guarantee that all previous certs will be enqueued or we need extra logic for this. I am particularity concerned about the precursor transactions to shared objects, that if not present may block the execution of a shared object cert, slowing everyone down.

That's a good point - I was thinking more about checkpoints (since that's what motivated this work), and in that case i'm pretty sure it's impossible to have an orphaned tx in a fragment.

I will add support for parent fetching/execution.

mystenmark added 5 commits

July 20, 2022 13:15


          Use a single NodeSyncState owned by ActiveAuthority

76153ba


          Refactor node_sync to support execution driver

61049c4


          Execution driver uses NodeSync for cert execution

ee3603f


          Work around rust-lang/rust#99492

c61130f


          Increase limit so tests pass

b6ac1c4

mystenmark force-pushed the mlogan-exec-driver branch from ec51269 to b6ac1c4 Compare

July 20, 2022 20:25

Contributor Author

mystenmark commented Jul 20, 2022

I thought some more about fetching parents. I think this issue was somewhat underexplored in the old code. Let's restrict the discussion to shared-object TXes that have been scheduled on a validator (checkpoints and nodesync are simple because we have final effects in hand in those cases):

Its not actually guaranteed that anyone else knows what the parent certs are yet - someone has to be the first validator to try executing the cert and determining whether there are lock errors - which, if there are, will not tell us what the missing certs are.
We do know that the signers of the cert can execute the cert, of which at least f+1 are honest.
So we could ask the signers of the cert to execute it for us - all honest signers must succeed and return effects.
However, we can't actually trust the effects given to us by any given signer (as sync_authority_source_to_destination does), we need to observe f+1 identical effects before we can trust them.
- A byzantine validator can tell us about a made up cert digest that never existed and send us chasing our tail.
- More worryingly, if the point of doing this work is to avoid deadlocking (or suffering high latency), we shouldn't open the door for a byzantine validator to feed us a set of parents that it believes will trigger a deadlock (or high latency)

Ok, so, all that said, it seems like to do this correctly, after encountering owned object lock errors, we should execute the cert on other validators until we get an f+1 quorum that all give us identical effects in response. Then we can ask those same validators for the missing certs (recursively). This could potentially cause O(n^2) network traffic, so it seems we should be cautious about doing this. Also, this is a decent chunk of additional code.

Alternatively, the entire problem can be avoided by putting the responsibility on the client to push certs to as many validators as it can (which is what QuorumDriver already does) - if it fails to do so, and if gossip also fails to propogate the certs in question, then yes, some additional latency is possible. But I feel like going with the simpler, client-driven model is "the Sui way" and avoids a lot of potential pitfalls.

Currently the only user of the execution driver that doesn't also have a true effects to start with is the shared object tx case. So my questions are:

Do we expect any other users?
What is the likely impact of not doing parent fetching in this case? I believe it should be small - we already have two completely independent mechanisms for mitigating this issue (gossip and quorum driver). I think before we add a third mechanism we should wait for some evidence that the existing mechanisms are insufficient.

mystenmark requested a review from gdanezis

July 20, 2022 21:05

Contributor Author

mystenmark commented Jul 21, 2022

Discussed parent syncing with @gdanezis - decision is to add parent syncing so that execution driver is guaranteed to make progress and acts as a backstop to the fallible processes of gossip and quorum driver.

Contributor Author

mystenmark commented Jul 21, 2022

Re processing transactions in blocks - will do some benchmarking or profiling to determine whether that is necessary. Checkpoint sync (when it is working) may also serve as a good test case since it will eliminate much of the bookkeeping overhead.

mystenmark enabled auto-merge (squash)

July 21, 2022 17:04

lxfind mentioned this pull request

Fix a bug in execution driver that may be blocekd #3384

Closed

mystenmark merged commit 6e78098 into main

mystenmark deleted the mlogan-exec-driver branch

July 21, 2022 17:15

mystenmark mentioned this pull request

Add execute_cert_to_true_effects #3444

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet