Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make unified scheduler's new task code fallible #1071

Closed

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Apr 26, 2024

Problem

Currently, there's no way for the unified scheduler to propagate errors back to the callers (the replay stage) until the bank freezing.

So, the dead-block marking by the replay stage could be delayed by maliciously-crafted blocks.

Summary of Changes

Make the new task code-path return Results to forcibly return previously-scheduled transaction error when new tasks are about to be submitted to the unified scheduler to notify the replay stage earlier than reaching block boundaries.

This pr is the preparation of the last major functionality of unified scheduler: proper shutdown.

EDIT: Note that this pr only changes the interfaces. sill the actual implementation doesn't return erroos. So, this is no functional change in this pr. the immediate upcoming next pr will actually implement the shutdown (warn: the impl is robust as much as i could but is quite complex at the same time...).

(also this pr contains bunch of minor unrelated clean ups...) (EDIT: this is reverted for ease of review)

context: extracted from #1122

@ryoqun ryoqun requested a review from apfitzge April 26, 2024 06:23
Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also this pr contains bunch of minor unrelated clean ups...)

I think these may be distracting me from the actual functional diff here.

In what manner would we expect the scheduling of new tasks to fail? afaict this does not ever return an error in this PR?
I thought the major issue here is that failed transaction results (not scheduling) do not get properly propagated back to the replay thread.

@ryoqun ryoqun force-pushed the unified-scheduler-fallible-new-task branch from 29ca732 to 410dbc5 Compare April 27, 2024 06:10
@ryoqun
Copy link
Member Author

ryoqun commented Apr 30, 2024

(also this pr contains bunch of minor unrelated clean ups...)

I think these may be distracting me from the actual functional diff here.

fair point. i think i stuffed too much in this prep pr...

In what manner would we expect the scheduling of new tasks to fail? afaict this does not ever return an error in this PR? I thought the major issue here is that failed transaction results (not scheduling) do not get properly propagated back to the replay thread.

again, thanks for raising good question... That lead me to rethink the impl to begin with (thus delayed reply...) I'm reorganiging the pr queue. the renewed first prep pr is this: #1126

Also, I started to draft up the retionale of this seeming odd function signature here: #1122

I'll close this pr for now

@ryoqun ryoqun closed this Apr 30, 2024
@ryoqun ryoqun reopened this May 2, 2024
@ryoqun ryoqun force-pushed the unified-scheduler-fallible-new-task branch 2 times, most recently from 4e18ba0 to 71e36c9 Compare May 2, 2024 13:55
/// That said, calling this multiple times is completely acceptable after the error observation
/// from `schedule_execution()`. While it's not guaranteed, the same `.clone()`-ed errors of
/// the first bad transaction are usually returned across invocations,
fn recover_error_after_abort(&mut self) -> TransactionError;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the new fn for fallible new task code-path

@ryoqun ryoqun force-pushed the unified-scheduler-fallible-new-task branch 2 times, most recently from 52cd340 to 89461f5 Compare May 2, 2024 14:17
@ryoqun ryoqun force-pushed the unified-scheduler-fallible-new-task branch from 89461f5 to 27922d5 Compare May 2, 2024 14:19
// Lastly, this non-atomic nature is intentional for optimizing the fast code-path
let mut scheduler_guard = self.inner.scheduler.write().unwrap();
let scheduler = scheduler_guard.as_mut().unwrap();
return Err(scheduler.recover_error_after_abort());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while the current actual wip .recover_error_after_abort() impl panics with todo!(), this code will never reach because the current actual wip .schedule_execution() impl never returns Err(_) to begin with.

@ryoqun ryoqun requested a review from apfitzge May 2, 2024 14:22
@ryoqun
Copy link
Member Author

ryoqun commented May 2, 2024

I'll close this pr for now

I changed my mind yet again. I reopened this pr and rebooted it for yet another code-review.

Also, I started to draft up the retionale of this seeming odd function signature here: #1122

This reviving is mainly because of the mother pr (#1122) got too big

In what manner would we expect the scheduling of new tasks to fail? afaict this does not ever return an error in this PR? I thought the major issue here is that failed transaction results (not scheduling) do not get properly propagated back to the replay thread.

again, thanks for raising good question... That lead me to rethink the impl to begin with (thus delayed reply...) I'm reorganiging the pr queue. the renewed first prep pr is this: #1126

hope i documented the context in detail in source code this time in this pr...

(also this pr contains bunch of minor unrelated clean ups...)

I think these may be distracting me from the actual functional diff here.

fair point. i think i stuffed too much in this prep pr...

I reverted them in this pr now.

@apfitzge
Copy link

apfitzge commented May 6, 2024

It's difficult to tell if this is the correct interface without seeing the implementation of how we check for errors.
Is it strictly related to sending txs via the channel, or are we checking from some shared variable?

If the former, I can see how this interface makes sense. If we're checking some shared variable for an error, it seems it'd make sense to have separate calls schedule_batches, check_for_errors

@ryoqun
Copy link
Member Author

ryoqun commented May 7, 2024

It's difficult to tell if this is the correct interface without seeing the implementation of ...

thanks for trying to review this pr again. seems i failed to begin a constructive code-review session by teasing too much by this interface-only split pr...

Thanks for patience and let's pivot the review style. I created #1211 as a full-brown review-ready pr, which contains this pr changes and the actual implementation.

... how we check for errors. Is it strictly related to sending txs via the channel, or are we checking from some shared variable?

If the former, I can see how this interface makes sense. If we're checking some shared variable for an error, it seems it'd make sense to have separate calls schedule_batches, check_for_errors

the actual implementation is kind of hybrid: its initial error-condition detection is piggybacked with sending txs via channel and the actual error retrieval (and internal thread joining) is checking (or memoizing) via some shared variable as a separate call (recover_error_after_abort()). That said, this separation is rather quickly abstracted way at the immediately higher-layer: BankWithScheduler::schedule_transaction_executions() for runtime efficiency.

}

fn recover_error_after_abort(&mut self) -> TransactionError {
todo!("in later pr...");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of letting todo! into the master branch. I'd much rather seen the error recovery code in this PR. OR in some initial PR with dead_code, and then this PR simply uses those changes.

It's too easy to forget todo! as a reviewing, since github's view is often limited

Copy link
Member Author

@ryoqun ryoqun May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of letting todo! into the master branch. I'd much rather seen the error recovery code in this PR.

hmm, how about closing this pr and switching to review #1211? As the pr is the superset of this pr, there's no todo!() there. Or, ...

OR in some initial PR with dead_code, and then this PR simply uses those changes.

... if the size of that pr isn't acceptable to review in one go for you, I can chunk the pr according to this.

It's too easy to forget todo! as a reviewing, since github's view is often limited

I think we can leave some check-boxes in the pr description if it works not to forget.

imo, explicit todo!() isn't so different from implicit (undocumented-but-definitely-existing) leak sources in master. And the remaining todos is existing only in my mind at the moment... ;) speaking of it, i can dump them somewhere and maintain it if it's helpful.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, that's fine. let's just move to #1211

/// previously-scheduled bad transaction, which terminates further block verification. So,
/// almost always, the returned error isn't due to the merely scheduling of the current
/// transaction itself. At this point, calling this does nothing anymore while it's still safe
/// to do. As soon as notified, callers is expected to stop processing upcoming transactions of
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// to do. As soon as notified, callers is expected to stop processing upcoming transactions of
/// to do. As soon as notified, callers are expected to stop processing upcoming transactions of

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops. d853f8f (contains bonus wording changes)

also, this commit is cherry-picked for #1211

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 93.05556% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 82.1%. Comparing base (9403ca6) to head (d853f8f).
Report is 81 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #1071     +/-   ##
=========================================
- Coverage    82.1%    82.1%   -0.1%     
=========================================
  Files         880      880             
  Lines      235665   235714     +49     
=========================================
+ Hits       193716   193736     +20     
- Misses      41949    41978     +29     

@ryoqun
Copy link
Member Author

ryoqun commented May 10, 2024

#1071 (comment):

alright, that's fine. let's just move to #1211

Closing this pr in favor of #1211.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants