-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finish unified scheduler plumbing with min impl #34300
Conversation
unified-scheduler-logic/Cargo.toml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding this crate isn't strictly needed for this pr. but I'm just piggybacking..
unified-scheduler-pool/Cargo.toml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this crate name and solana-unified-scheduler-logic
is already reserved on crates.io:
https://crates.io/crates/solana-unified-scheduler-pool
https://crates.io/crates/solana-unified-scheduler-logic
unified-scheduler-pool/src/lib.rs
Outdated
} | ||
|
||
#[derive(Debug)] | ||
pub struct DefaultTaskRunner; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops. leftover from rename.. read this as DefaultTaskHandler
...
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #34300 +/- ##
========================================
Coverage 81.8% 81.8%
========================================
Files 820 822 +2
Lines 220790 221195 +405
========================================
+ Hits 180679 181131 +452
+ Misses 40111 40064 -47 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a few initial comments, which I think should be discussed before continueing the review.
d538893
to
584bd35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly happy with it, but a few concerns around context
.
unified-scheduler-pool/src/lib.rs
Outdated
} | ||
|
||
fn context(&self) -> &SchedulingContext { | ||
self.context.as_ref().expect("active context should exist") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a bit of trouble reasoning through the safety of this expect
.
- I can see when we create new banks in
BankForks
this context will always beSome
if we have an installed scheduler pool. context
gets checked when we schedule txs, and when we return a scheduler to pool
When returning a scheduler:
wait_for_termination
removes our context iffwait_reason
is not paused- in
InstalledSchedulerPool::wait_for_termination
we call the schedulerwait_for_termination
, thenreturn_to_pool
. - So if the wait reason is not paused, we call
wait_for_termination
which will take our context, then we callreturn_to_pool
which will assert context is none.
Seems safe, but a little hard to track that down.
When scheduling txs:
- We are operating on some bank within
BankForks
, it's difficult for me to conceptually guarantee this has not had thereturn_to_pool
called on it.
Actual Suggestions instead of rambling thoughts
Wishing I had caught this earlier because I'm rethinking the traits.
I wonder if we could possibly get type-safety around the context by distinctly separating the wait_for_termination
for paused vs not paused.
It seems the scheduler is in one of two states:
- In the pool, doing nothing, waiting for context
- Associated with a bank/context, and being used for scheduling/execution
It seems like the pool could have some Box<dyn InstallableScheduler>
, where InstallableScheduler
has some fn to transition it to Box<dyn InstalledScheduler>
.
And InstalledScheduler
must also have a fn to transition back to InstallableScheduler
before we can return to pool, via the wait for termination.
That way, we can guarantee that if we have an installed scheduler we always have a context.
This does make the type-system more complicated, but I'm leaning towards it being a complicated enough relationship since I'm dumb and cannot explicitly verify safety around it.
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for suggestion. i don't like the .expect(...)
either.. your suggestion will work for the current min impl. however, once multi-threaded, maintaining the type-safety is hard. also, ::context()
isn't that important for impl-wise.
so, i dcou-ed it: c3780a6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When scheduling txs:
1. We are operating on **some** bank within `BankForks`, it's difficult for me to conceptually guarantee this has **not** had the `return_to_pool` called on it.
btw, return_to_pool
consumes self
. so, any wild schedulers outside the pool should have been take_scheduler()
-ed without return_to_pool
being called yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, they will have been take_scheduler
'ed (which guarantees they had a context at that point) at some prior point without return_to_pool
.
What's more difficult to track is that wait_for_termination
is what actually removes the context, and does not consume the boxed-self. If we removed the context within return_to_pool
, that would make things more clear to me at least.
Is there a reason we cannot/should not do it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we removed the context within return_to_pool, that would make things more clear to me at least.
thanks for explanation. so, you prefer like this, correct?: 60b1bd2
What's more difficult to track is that wait_for_termination is what actually removes the context, and does not consume the boxed-self.
there's 2 small reasons:
- imo, resources should be disposed as soon as it's no longer needed semantically. in this case, it's the context at the time of scheduler termination. so, I'd prefer the early dropping. also, mutating state (ie removing the context) when just returning to the pool sounds a bit unnatural for me.
- i understand concern around panic-safety in the early dropping. on the other hand, this clearly dictates that calling
wait_for_termination
twice is a fatal invariant violation for the same scheduler unless paused. and ensures this won't happen on production ever.
anyway, i'm not too obsessed with this. I'm just fine with 60b1bd2. just wanted to share some my picky opinionated codding practice. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've further pushed some cleaning-up commits.
also, i had to rebase this pr with latest changes at master: d660e42
647b86f
to
049a126
Compare
fn schedule_execution(&self, &(transaction, index): &(&SanitizedTransaction, usize)) { | ||
let (result, timings) = &mut *self.result_with_timings.lock().expect("not poisoned"); | ||
if result.is_err() { | ||
// just bail out early to short-circuit the processing altogether |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I'll create a follow-up pr to ::schedule_execution()
return a result to mark block as dead as early as possible... let this pr ship for now, please. lol
d4acb29
to
d660e42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Thanks for all the simplifications!
info!("no scheduler pool is installed for block verification..."); | ||
} | ||
BlockVerificationMethod::UnifiedScheduler => { | ||
let no_transaction_status_sender = None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this particular line introduced this bug...: anza-xyz#3861
Problem
I'm quite high in the midnight, concluding all the plumbing efforts so far with this pr.
(translate: there's the last missing piece of code (trait impl boilerplates, cli integration), before landing the unified scheduler impl)
Summary of Remedies
get some sleep.
(translate: this pr added the last plumbing with extensive unit tests after these plumbing is confirmed to work at ryoqun#15. also, end to end integration niceties: the dreadful local-cluster test and quick-and-dirty
run-sanity.sh
tweaks).Context
extracted from: #33070
also note that this is the next big chunk of code (contains the real code for unified scheduler): ryoqun#15 (a bit desynced from the tip of #33070 right now...)