Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish unified scheduler plumbing with min impl #34300

Merged
merged 28 commits into from
Dec 19, 2023

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Dec 1, 2023

Problem

I'm quite high in the midnight, concluding all the plumbing efforts so far with this pr.

(translate: there's the last missing piece of code (trait impl boilerplates, cli integration), before landing the unified scheduler impl)

Summary of Remedies

get some sleep.

(translate: this pr added the last plumbing with extensive unit tests after these plumbing is confirmed to work at ryoqun#15. also, end to end integration niceties: the dreadful local-cluster test and quick-and-dirty run-sanity.sh tweaks).

Context

extracted from: #33070

also note that this is the next big chunk of code (contains the real code for unified scheduler): ryoqun#15 (a bit desynced from the tip of #33070 right now...)

@ryoqun ryoqun requested a review from apfitzge December 1, 2023 15:27
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding this crate isn't strictly needed for this pr. but I'm just piggybacking..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this crate name and solana-unified-scheduler-logic is already reserved on crates.io:

https://crates.io/crates/solana-unified-scheduler-pool
https://crates.io/crates/solana-unified-scheduler-logic

}

#[derive(Debug)]
pub struct DefaultTaskRunner;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops. leftover from rename.. read this as DefaultTaskHandler...

Copy link

codecov bot commented Dec 1, 2023

Codecov Report

Merging #34300 (1814e1a) into master (4181ea4) will increase coverage by 0.0%.
Report is 1 commits behind head on master.
The diff coverage is 94.0%.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #34300    +/-   ##
========================================
  Coverage    81.8%    81.8%            
========================================
  Files         820      822     +2     
  Lines      220790   221195   +405     
========================================
+ Hits       180679   181131   +452     
+ Misses      40111    40064    -47     

Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a few initial comments, which I think should be discussed before continueing the review.

core/src/validator.rs Outdated Show resolved Hide resolved
unified-scheduler-pool/src/lib.rs Outdated Show resolved Hide resolved
unified-scheduler-pool/src/lib.rs Outdated Show resolved Hide resolved
unified-scheduler-pool/src/lib.rs Outdated Show resolved Hide resolved
@ryoqun ryoqun requested a review from apfitzge December 12, 2023 01:16
@ryoqun ryoqun changed the title Finalize unified scheduler plumbing with min impl Finish unified scheduler plumbing with min impl Dec 12, 2023
Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly happy with it, but a few concerns around context.

unified-scheduler-pool/src/lib.rs Show resolved Hide resolved
unified-scheduler-pool/src/lib.rs Outdated Show resolved Hide resolved
core/src/validator.rs Show resolved Hide resolved
unified-scheduler-pool/src/lib.rs Outdated Show resolved Hide resolved
}

fn context(&self) -> &SchedulingContext {
self.context.as_ref().expect("active context should exist")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a bit of trouble reasoning through the safety of this expect.

  1. I can see when we create new banks in BankForks this context will always be Some if we have an installed scheduler pool.
  2. context gets checked when we schedule txs, and when we return a scheduler to pool

When returning a scheduler:

  1. wait_for_termination removes our context iff wait_reason is not paused
  2. in InstalledSchedulerPool::wait_for_termination we call the scheduler wait_for_termination, then return_to_pool.
  3. So if the wait reason is not paused, we call wait_for_termination which will take our context, then we call return_to_pool which will assert context is none.

Seems safe, but a little hard to track that down.

When scheduling txs:

  1. We are operating on some bank within BankForks, it's difficult for me to conceptually guarantee this has not had the return_to_pool called on it.

Actual Suggestions instead of rambling thoughts

Wishing I had caught this earlier because I'm rethinking the traits.
I wonder if we could possibly get type-safety around the context by distinctly separating the wait_for_termination for paused vs not paused.

It seems the scheduler is in one of two states:

  1. In the pool, doing nothing, waiting for context
  2. Associated with a bank/context, and being used for scheduling/execution

It seems like the pool could have some Box<dyn InstallableScheduler>, where InstallableScheduler has some fn to transition it to Box<dyn InstalledScheduler>.
And InstalledScheduler must also have a fn to transition back to InstallableScheduler before we can return to pool, via the wait for termination.

That way, we can guarantee that if we have an installed scheduler we always have a context.
This does make the type-system more complicated, but I'm leaning towards it being a complicated enough relationship since I'm dumb and cannot explicitly verify safety around it.
wdyt?

Copy link
Member Author

@ryoqun ryoqun Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for suggestion. i don't like the .expect(...) either.. your suggestion will work for the current min impl. however, once multi-threaded, maintaining the type-safety is hard. also, ::context() isn't that important for impl-wise.

so, i dcou-ed it: c3780a6

Copy link
Member Author

@ryoqun ryoqun Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When scheduling txs:

1. We are operating on **some** bank within `BankForks`, it's difficult for me to conceptually guarantee this has **not** had the `return_to_pool` called on it.

btw, return_to_pool consumes self. so, any wild schedulers outside the pool should have been take_scheduler()-ed without return_to_pool being called yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they will have been take_scheduler'ed (which guarantees they had a context at that point) at some prior point without return_to_pool.

What's more difficult to track is that wait_for_termination is what actually removes the context, and does not consume the boxed-self. If we removed the context within return_to_pool, that would make things more clear to me at least.
Is there a reason we cannot/should not do it there?

Copy link
Member Author

@ryoqun ryoqun Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we removed the context within return_to_pool, that would make things more clear to me at least.

thanks for explanation. so, you prefer like this, correct?: 60b1bd2

What's more difficult to track is that wait_for_termination is what actually removes the context, and does not consume the boxed-self.

there's 2 small reasons:

  1. imo, resources should be disposed as soon as it's no longer needed semantically. in this case, it's the context at the time of scheduler termination. so, I'd prefer the early dropping. also, mutating state (ie removing the context) when just returning to the pool sounds a bit unnatural for me.
  2. i understand concern around panic-safety in the early dropping. on the other hand, this clearly dictates that calling wait_for_termination twice is a fatal invariant violation for the same scheduler unless paused. and ensures this won't happen on production ever.

anyway, i'm not too obsessed with this. I'm just fine with 60b1bd2. just wanted to share some my picky opinionated codding practice. ;)

Copy link
Member Author

@ryoqun ryoqun Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apfitzge good news. i did the hassle: 85945d7. i think i did my best to apply your suggestion. and, my prev comment's semantic ramblings are upheld as well in the commit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've further pushed some cleaning-up commits.

also, i had to rebase this pr with latest changes at master: d660e42

@ryoqun ryoqun requested a review from apfitzge December 13, 2023 01:49
@ryoqun ryoqun force-pushed the min-unified-scheduler branch 3 times, most recently from 647b86f to 049a126 Compare December 13, 2023 16:13
fn schedule_execution(&self, &(transaction, index): &(&SanitizedTransaction, usize)) {
let (result, timings) = &mut *self.result_with_timings.lock().expect("not poisoned");
if result.is_err() {
// just bail out early to short-circuit the processing altogether
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I'll create a follow-up pr to ::schedule_execution() return a result to mark block as dead as early as possible... let this pr ship for now, please. lol

Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks for all the simplifications!

@ryoqun ryoqun merged commit d2b5afc into solana-labs:master Dec 19, 2023
46 checks passed
info!("no scheduler pool is installed for block verification...");
}
BlockVerificationMethod::UnifiedScheduler => {
let no_transaction_status_sender = None;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this particular line introduced this bug...: anza-xyz#3861

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants