Finish unified scheduler plumbing with min impl #34300

ryoqun · 2023-12-01T15:26:50Z

Problem

I'm quite high in the midnight, concluding all the plumbing efforts so far with this pr.

(translate: there's the last missing piece of code (trait impl boilerplates, cli integration), before landing the unified scheduler impl)

Summary of Remedies

get some sleep.

(translate: this pr added the last plumbing with extensive unit tests after these plumbing is confirmed to work at ryoqun#15. also, end to end integration niceties: the dreadful local-cluster test and quick-and-dirty run-sanity.sh tweaks).

Context

extracted from: #33070

also note that this is the next big chunk of code (contains the real code for unified scheduler): ryoqun#15 (a bit desynced from the tip of #33070 right now...)

ryoqun · 2023-12-01T15:35:18Z

unified-scheduler-logic/Cargo.toml

adding this crate isn't strictly needed for this pr. but I'm just piggybacking..

ryoqun · 2023-12-01T15:36:27Z

unified-scheduler-pool/Cargo.toml

this crate name and solana-unified-scheduler-logic is already reserved on crates.io:

https://crates.io/crates/solana-unified-scheduler-pool
https://crates.io/crates/solana-unified-scheduler-logic

ryoqun · 2023-12-01T15:43:28Z

unified-scheduler-pool/src/lib.rs

+}
+
+#[derive(Debug)]
+pub struct DefaultTaskRunner;


oops. leftover from rename.. read this as DefaultTaskHandler...

codecov · 2023-12-01T16:37:34Z

Codecov Report

Merging #34300 (1814e1a) into master (4181ea4) will increase coverage by 0.0%.
Report is 1 commits behind head on master.
The diff coverage is 94.0%.

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #34300    +/-   ##
========================================
  Coverage    81.8%    81.8%            
========================================
  Files         820      822     +2     
  Lines      220790   221195   +405     
========================================
+ Hits       180679   181131   +452     
+ Misses      40111    40064    -47

apfitzge

Had a few initial comments, which I think should be discussed before continueing the review.

core/src/validator.rs

unified-scheduler-pool/src/lib.rs

apfitzge

Mostly happy with it, but a few concerns around context.

unified-scheduler-pool/src/lib.rs

core/src/validator.rs

unified-scheduler-pool/src/lib.rs

apfitzge · 2023-12-12T18:02:17Z

unified-scheduler-pool/src/lib.rs

+    }
+
+    fn context(&self) -> &SchedulingContext {
+        self.context.as_ref().expect("active context should exist")


Having a bit of trouble reasoning through the safety of this expect.

I can see when we create new banks in BankForks this context will always be Some if we have an installed scheduler pool.

context gets checked when we schedule txs, and when we return a scheduler to pool

When returning a scheduler:

wait_for_termination removes our context iff wait_reason is not paused

in InstalledSchedulerPool::wait_for_termination we call the scheduler wait_for_termination, then return_to_pool.

So if the wait reason is not paused, we call wait_for_termination which will take our context, then we call return_to_pool which will assert context is none.

Seems safe, but a little hard to track that down.

When scheduling txs:

We are operating on some bank within BankForks, it's difficult for me to conceptually guarantee this has not had the return_to_pool called on it.

Actual Suggestions instead of rambling thoughts

Wishing I had caught this earlier because I'm rethinking the traits.
I wonder if we could possibly get type-safety around the context by distinctly separating the wait_for_termination for paused vs not paused.

It seems the scheduler is in one of two states:

In the pool, doing nothing, waiting for context

Associated with a bank/context, and being used for scheduling/execution

It seems like the pool could have some Box<dyn InstallableScheduler>, where InstallableScheduler has some fn to transition it to Box<dyn InstalledScheduler>.
And InstalledScheduler must also have a fn to transition back to InstallableScheduler before we can return to pool, via the wait for termination.

That way, we can guarantee that if we have an installed scheduler we always have a context.
This does make the type-system more complicated, but I'm leaning towards it being a complicated enough relationship since I'm dumb and cannot explicitly verify safety around it.
wdyt?

thanks for suggestion. i don't like the .expect(...) either.. your suggestion will work for the current min impl. however, once multi-threaded, maintaining the type-safety is hard. also, ::context() isn't that important for impl-wise.

so, i dcou-ed it: c3780a6

When scheduling txs:

1. We are operating on **some** bank within `BankForks`, it's difficult for me to conceptually guarantee this has **not** had the `return_to_pool` called on it.

btw, return_to_pool consumes self. so, any wild schedulers outside the pool should have been take_scheduler()-ed without return_to_pool being called yet.

Yeah, they will have been take_scheduler'ed (which guarantees they had a context at that point) at some prior point without return_to_pool.

What's more difficult to track is that wait_for_termination is what actually removes the context, and does not consume the boxed-self. If we removed the context within return_to_pool, that would make things more clear to me at least.
Is there a reason we cannot/should not do it there?

If we removed the context within return_to_pool, that would make things more clear to me at least.

thanks for explanation. so, you prefer like this, correct?: 60b1bd2

What's more difficult to track is that wait_for_termination is what actually removes the context, and does not consume the boxed-self.

there's 2 small reasons:

imo, resources should be disposed as soon as it's no longer needed semantically. in this case, it's the context at the time of scheduler termination. so, I'd prefer the early dropping. also, mutating state (ie removing the context) when just returning to the pool sounds a bit unnatural for me.

i understand concern around panic-safety in the early dropping. on the other hand, this clearly dictates that calling wait_for_termination twice is a fatal invariant violation for the same scheduler unless paused. and ensures this won't happen on production ever.

anyway, i'm not too obsessed with this. I'm just fine with 60b1bd2. just wanted to share some my picky opinionated codding practice. ;)

@apfitzge good news. i did the hassle: 85945d7. i think i did my best to apply your suggestion. and, my prev comment's semantic ramblings are upheld as well in the commit.

I've further pushed some cleaning-up commits.

also, i had to rebase this pr with latest changes at master: d660e42

unified-scheduler-pool/src/lib.rs

ryoqun · 2023-12-18T14:01:03Z

unified-scheduler-pool/src/lib.rs

+    fn schedule_execution(&self, &(transaction, index): &(&SanitizedTransaction, usize)) {
+        let (result, timings) = &mut *self.result_with_timings.lock().expect("not poisoned");
+        if result.is_err() {
+            // just bail out early to short-circuit the processing altogether


btw, I'll create a follow-up pr to ::schedule_execution() return a result to mark block as dead as early as possible... let this pr ship for now, please. lol

This reverts commit 049a126.

This reverts commit 60b1bd2.

apfitzge

lgtm. Thanks for all the simplifications!

ryoqun · 2024-12-02T13:33:28Z

ledger-tool/src/ledger_utils.rs

+            info!("no scheduler pool is installed for block verification...");
+        }
+        BlockVerificationMethod::UnifiedScheduler => {
+            let no_transaction_status_sender = None;


this particular line introduced this bug...: anza-xyz#3861

ryoqun requested a review from apfitzge December 1, 2023 15:27

ryoqun commented Dec 1, 2023

View reviewed changes

apfitzge reviewed Dec 11, 2023

View reviewed changes

core/src/validator.rs Outdated Show resolved Hide resolved

unified-scheduler-pool/src/lib.rs Outdated Show resolved Hide resolved

unified-scheduler-pool/src/lib.rs Outdated Show resolved Hide resolved

unified-scheduler-pool/src/lib.rs Outdated Show resolved Hide resolved

ryoqun force-pushed the min-unified-scheduler branch from d538893 to 584bd35 Compare December 12, 2023 01:06

ryoqun requested a review from apfitzge December 12, 2023 01:16

ryoqun changed the title ~~Finalize unified scheduler plumbing with min impl~~ Finish unified scheduler plumbing with min impl Dec 12, 2023

apfitzge reviewed Dec 12, 2023

View reviewed changes

ryoqun requested a review from apfitzge December 13, 2023 01:49

ryoqun force-pushed the min-unified-scheduler branch 3 times, most recently from 647b86f to 049a126 Compare December 13, 2023 16:13

ryoqun commented Dec 14, 2023

View reviewed changes

unified-scheduler-pool/src/lib.rs Outdated Show resolved Hide resolved

ryoqun commented Dec 18, 2023

View reviewed changes

ryoqun added 14 commits December 18, 2023 23:05

Finalize unified scheduler plumbing with min impl

d433cff

Fix comment

8f424ae

Rename leftover type name...

b46c072

Make logging text less ambiguous

8240d2e

Make PhantomData simplyer without already used S

4e39d81

Make TaskHandler stateless again

abba35c

Introduce HandlerContext to simplify TaskHandler

ddf9d6a

Add comment for coexistence of Pool::{new,new_dyn}

fa03b65

Fix grammar

c8a7785

Remove confusing const for upcoming changes

f19627b

Demote InstalledScheduler::context() into dcou

d4ef83e

Delay drop of context up to return_to_pool()-ing

81b8d5f

Revert "Demote InstalledScheduler::context() into dcou"

d5ecd5a

This reverts commit 049a126.

Revert "Delay drop of context up to return_to_pool()-ing"

e0b5e35

This reverts commit 60b1bd2.

ryoqun added 12 commits December 18, 2023 23:05

Make context handling really type-safe

85945d7

Update comment

3373e6e

Fix grammar...

936b88c

Refine type aliases for boxed traits

2e94e59

Swap the tuple order for readability & semantics

95a2edd

Simplify PooledScheduler::result_with_timings type

183d9f1

Restore .in_sequence()

155e0d6

Use where for aesthetics

15a8484

Simplify if...

ea1083b

Fix typo...

567c2ad

Polish ::schedule_execution() a bit

01ec858

Fix rebase conflicts..

d660e42

ryoqun force-pushed the min-unified-scheduler branch from d4acb29 to d660e42 Compare December 18, 2023 14:17

ryoqun added 2 commits December 19, 2023 00:07

Make test more readable

c6c29ea

Fix test failures after rebase...

1814e1a

apfitzge approved these changes Dec 18, 2023

View reviewed changes

ryoqun merged commit d2b5afc into solana-labs:master Dec 19, 2023
46 checks passed

willhickey mentioned this pull request Mar 28, 2024

v1.18 commits - please ignore anza-xyz/agave#475

Closed

ryoqun commented Dec 2, 2024

View reviewed changes

ryoqun mentioned this pull request Dec 2, 2024

Make unified-scheduler use transaction_status_sender in ledger-tool anza-xyz/agave#3861

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish unified scheduler plumbing with min impl #34300

Finish unified scheduler plumbing with min impl #34300

ryoqun commented Dec 1, 2023 •

edited

Loading

ryoqun Dec 1, 2023

ryoqun Dec 1, 2023

ryoqun Dec 1, 2023

codecov bot commented Dec 1, 2023 •

edited

Loading

apfitzge left a comment

apfitzge left a comment

apfitzge Dec 12, 2023

ryoqun Dec 13, 2023 •

edited

Loading

ryoqun Dec 13, 2023 •

edited

Loading

apfitzge Dec 13, 2023

ryoqun Dec 14, 2023 •

edited

Loading

ryoqun Dec 14, 2023 •

edited

Loading

ryoqun Dec 18, 2023

ryoqun Dec 18, 2023

apfitzge left a comment

ryoqun Dec 2, 2024

Finish unified scheduler plumbing with min impl #34300

Finish unified scheduler plumbing with min impl #34300

Conversation

ryoqun commented Dec 1, 2023 • edited Loading

Problem

Summary of Remedies

Context

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 1, 2023 • edited Loading

Codecov Report

apfitzge left a comment

Choose a reason for hiding this comment

apfitzge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Actual Suggestions instead of rambling thoughts

ryoqun Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

ryoqun Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

ryoqun Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apfitzge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryoqun commented Dec 1, 2023 •

edited

Loading

codecov bot commented Dec 1, 2023 •

edited

Loading

ryoqun Dec 13, 2023 •

edited

Loading

ryoqun Dec 13, 2023 •

edited

Loading

ryoqun Dec 14, 2023 •

edited

Loading

ryoqun Dec 14, 2023 •

edited

Loading