Make unified scheduler's new task code fallible #1071

ryoqun · 2024-04-26T06:23:38Z

Problem

Currently, there's no way for the unified scheduler to propagate errors back to the callers (the replay stage) until the bank freezing.

So, the dead-block marking by the replay stage could be delayed by maliciously-crafted blocks.

Summary of Changes

Make the new task code-path return Results to forcibly return previously-scheduled transaction error when new tasks are about to be submitted to the unified scheduler to notify the replay stage earlier than reaching block boundaries.

This pr is the preparation of the last major functionality of unified scheduler: proper shutdown.

EDIT: Note that this pr only changes the interfaces. sill the actual implementation doesn't return erroos. So, this is no functional change in this pr. the immediate upcoming next pr will actually implement the shutdown (warn: the impl is robust as much as i could but is quite complex at the same time...).

~~(also this pr contains bunch of minor unrelated clean ups...)~~ (EDIT: this is reverted for ease of review)

context: extracted from #1122

apfitzge

(also this pr contains bunch of minor unrelated clean ups...)

I think these may be distracting me from the actual functional diff here.

In what manner would we expect the scheduling of new tasks to fail? afaict this does not ever return an error in this PR?
I thought the major issue here is that failed transaction results (not scheduling) do not get properly propagated back to the replay thread.

ryoqun · 2024-04-30T15:25:21Z

(also this pr contains bunch of minor unrelated clean ups...)

I think these may be distracting me from the actual functional diff here.

fair point. i think i stuffed too much in this prep pr...

In what manner would we expect the scheduling of new tasks to fail? afaict this does not ever return an error in this PR? I thought the major issue here is that failed transaction results (not scheduling) do not get properly propagated back to the replay thread.

again, thanks for raising good question... That lead me to rethink the impl to begin with (thus delayed reply...) I'm reorganiging the pr queue. the renewed first prep pr is this: #1126

Also, I started to draft up the retionale of this seeming odd function signature here: #1122

I'll close this pr for now

ryoqun · 2024-05-02T14:07:24Z

runtime/src/installed_scheduler_pool.rs

+    /// That said, calling this multiple times is completely acceptable after the error observation
+    /// from `schedule_execution()`. While it's not guaranteed, the same `.clone()`-ed errors of
+    /// the first bad transaction are usually returned across invocations,
+    fn recover_error_after_abort(&mut self) -> TransactionError;


this is the new fn for fallible new task code-path

ryoqun · 2024-05-02T14:21:49Z

runtime/src/installed_scheduler_pool.rs

+                // Lastly, this non-atomic nature is intentional for optimizing the fast code-path
+                let mut scheduler_guard = self.inner.scheduler.write().unwrap();
+                let scheduler = scheduler_guard.as_mut().unwrap();
+                return Err(scheduler.recover_error_after_abort());


while the current actual wip .recover_error_after_abort() impl panics with todo!(), this code will never reach because the current actual wip .schedule_execution() impl never returns Err(_) to begin with.

ryoqun · 2024-05-02T14:26:33Z

I'll close this pr for now

I changed my mind yet again. I reopened this pr and rebooted it for yet another code-review.

Also, I started to draft up the retionale of this seeming odd function signature here: #1122

This reviving is mainly because of the mother pr (#1122) got too big

In what manner would we expect the scheduling of new tasks to fail? afaict this does not ever return an error in this PR? I thought the major issue here is that failed transaction results (not scheduling) do not get properly propagated back to the replay thread.

again, thanks for raising good question... That lead me to rethink the impl to begin with (thus delayed reply...) I'm reorganiging the pr queue. the renewed first prep pr is this: #1126

hope i documented the context in detail in source code this time in this pr...

(also this pr contains bunch of minor unrelated clean ups...)

I think these may be distracting me from the actual functional diff here.

fair point. i think i stuffed too much in this prep pr...

I reverted them in this pr now.

apfitzge · 2024-05-06T15:59:14Z

It's difficult to tell if this is the correct interface without seeing the implementation of how we check for errors.
Is it strictly related to sending txs via the channel, or are we checking from some shared variable?

If the former, I can see how this interface makes sense. If we're checking some shared variable for an error, it seems it'd make sense to have separate calls schedule_batches, check_for_errors

ryoqun · 2024-05-07T13:51:32Z

It's difficult to tell if this is the correct interface without seeing the implementation of ...

thanks for trying to review this pr again. seems i failed to begin a constructive code-review session by teasing too much by this interface-only split pr...

Thanks for patience and let's pivot the review style. I created #1211 as a full-brown review-ready pr, which contains this pr changes and the actual implementation.

... how we check for errors. Is it strictly related to sending txs via the channel, or are we checking from some shared variable?

If the former, I can see how this interface makes sense. If we're checking some shared variable for an error, it seems it'd make sense to have separate calls schedule_batches, check_for_errors

the actual implementation is kind of hybrid: its initial error-condition detection is piggybacked with sending txs via channel and the actual error retrieval (and internal thread joining) is checking (or memoizing) via some shared variable as a separate call (recover_error_after_abort()). That said, this separation is rather quickly abstracted way at the immediately higher-layer: BankWithScheduler::schedule_transaction_executions() for runtime efficiency.

apfitzge · 2024-05-08T20:35:56Z

unified-scheduler-pool/src/lib.rs

+    }
+
+    fn recover_error_after_abort(&mut self) -> TransactionError {
+        todo!("in later pr...");


I am not a fan of letting todo! into the master branch. I'd much rather seen the error recovery code in this PR. OR in some initial PR with dead_code, and then this PR simply uses those changes.

It's too easy to forget todo! as a reviewing, since github's view is often limited

I am not a fan of letting todo! into the master branch. I'd much rather seen the error recovery code in this PR.

hmm, how about closing this pr and switching to review #1211? As the pr is the superset of this pr, there's no todo!() there. Or, ...

OR in some initial PR with dead_code, and then this PR simply uses those changes.

... if the size of that pr isn't acceptable to review in one go for you, I can chunk the pr according to this.

It's too easy to forget todo! as a reviewing, since github's view is often limited

I think we can leave some check-boxes in the pr description if it works not to forget.

imo, explicit todo!() isn't so different from implicit (undocumented-but-definitely-existing) leak sources in master. And the remaining todos is existing only in my mind at the moment... ;) speaking of it, i can dump them somewhere and maintain it if it's helpful.

alright, that's fine. let's just move to #1211

apfitzge · 2024-05-08T20:36:36Z

runtime/src/installed_scheduler_pool.rs

+    /// previously-scheduled bad transaction, which terminates further block verification. So,
+    /// almost always, the returned error isn't due to the merely scheduling of the current
+    /// transaction itself. At this point, calling this does nothing anymore while it's still safe
+    /// to do. As soon as notified, callers is expected to stop processing upcoming transactions of


Suggested change

/// to do. As soon as notified, callers is expected to stop processing upcoming transactions of

/// to do. As soon as notified, callers are expected to stop processing upcoming transactions of

oops. d853f8f (contains bonus wording changes)

also, this commit is cherry-picked for #1211

codecov-commenter · 2024-05-09T08:12:39Z

Codecov Report

Attention: Patch coverage is 93.05556% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 82.1%. Comparing base (9403ca6) to head (d853f8f).
Report is 81 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #1071     +/-   ##
=========================================
- Coverage    82.1%    82.1%   -0.1%     
=========================================
  Files         880      880             
  Lines      235665   235714     +49     
=========================================
+ Hits       193716   193736     +20     
- Misses      41949    41978     +29

ryoqun · 2024-05-10T00:20:12Z

#1071 (comment):

alright, that's fine. let's just move to #1211

Closing this pr in favor of #1211.

ryoqun requested a review from apfitzge April 26, 2024 06:23

apfitzge reviewed Apr 26, 2024

View reviewed changes

ryoqun force-pushed the unified-scheduler-fallible-new-task branch from 29ca732 to 410dbc5 Compare April 27, 2024 06:10

ryoqun closed this Apr 30, 2024

ryoqun reopened this May 2, 2024

ryoqun force-pushed the unified-scheduler-fallible-new-task branch 2 times, most recently from 4e18ba0 to 71e36c9 Compare May 2, 2024 13:55

ryoqun commented May 2, 2024

View reviewed changes

ryoqun force-pushed the unified-scheduler-fallible-new-task branch 2 times, most recently from 52cd340 to 89461f5 Compare May 2, 2024 14:17

Make unified scheduler's new task code fallible

27922d5

ryoqun force-pushed the unified-scheduler-fallible-new-task branch from 89461f5 to 27922d5 Compare May 2, 2024 14:19

ryoqun commented May 2, 2024

View reviewed changes

ryoqun requested a review from apfitzge May 2, 2024 14:22

apfitzge reviewed May 8, 2024

View reviewed changes

Fix typo and improve wording

d853f8f

ryoqun closed this May 10, 2024

ryoqun mentioned this pull request Jun 25, 2024

Apply cosmetic changes to unified scheduler #1861

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make unified scheduler's new task code fallible #1071

Make unified scheduler's new task code fallible #1071

ryoqun commented Apr 26, 2024 •

edited

Loading

apfitzge left a comment

ryoqun commented Apr 30, 2024

ryoqun May 2, 2024

ryoqun May 2, 2024

ryoqun commented May 2, 2024

apfitzge commented May 6, 2024

ryoqun commented May 7, 2024

apfitzge May 8, 2024

ryoqun May 9, 2024 •

edited

Loading

apfitzge May 9, 2024

apfitzge May 8, 2024

ryoqun May 9, 2024

codecov-commenter commented May 9, 2024

ryoqun commented May 10, 2024

	/// to do. As soon as notified, callers is expected to stop processing upcoming transactions of
	/// to do. As soon as notified, callers are expected to stop processing upcoming transactions of

Make unified scheduler's new task code fallible #1071

Make unified scheduler's new task code fallible #1071

Conversation

ryoqun commented Apr 26, 2024 • edited Loading

Problem

Summary of Changes

apfitzge left a comment

Choose a reason for hiding this comment

ryoqun commented Apr 30, 2024

ryoqun May 2, 2024

Choose a reason for hiding this comment

ryoqun May 2, 2024

Choose a reason for hiding this comment

ryoqun commented May 2, 2024

apfitzge commented May 6, 2024

ryoqun commented May 7, 2024

apfitzge May 8, 2024

Choose a reason for hiding this comment

ryoqun May 9, 2024 • edited Loading

Choose a reason for hiding this comment

apfitzge May 9, 2024

Choose a reason for hiding this comment

apfitzge May 8, 2024

Choose a reason for hiding this comment

ryoqun May 9, 2024

Choose a reason for hiding this comment

codecov-commenter commented May 9, 2024

Codecov Report

ryoqun commented May 10, 2024

ryoqun commented Apr 26, 2024 •

edited

Loading

ryoqun May 9, 2024 •

edited

Loading