Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return back stale out-of-pool scheduler by timeout #1690

Merged
merged 5 commits into from
Jun 13, 2024

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Jun 11, 2024

Problem

From #1211:

Specifically, there are following several termination conditions which it must handle respectively:

...
5. There could be many active (= taken-out-of-the-pool) unified schedulers under very forky network conditions for extended duration of lack of rooting. In that case, many native os threads will be created just to sit idling collectively because each unified scheduler instance separately creates and manages its own set of threads individually (for perf reasons). So, idling scheduler should be returned back to the pool. Then, eventually those pooled-back schedulers will be retired according to the (3) after the forky situation is resolved.

Summary of Changes

This pr addresses (5).

The impl is a bit involved.

In a gist, BankWithScheduler now registers a callback (TimeoutListener) to solScCleaner to detect its own stalemate from outside. for that end, solScCleaner unconditionally invokes the callback after fixed duration (>> 1 slot = 400ms). Then the callback returns back schedulers to the pool, if the bank isn't yet frozen. So, there's new state of BankWithScheduler in which it has stale scheduler. that's modeled with new SchedulerStatus with tri-state enum, replacing Option.

This whole setup is needed, because this stale scheduler returning must be done transparently from the view point of transaction scheduling. Accordingly, the scheduling code path is adjusted to revive the stale state when it detects and re-take the scheduler from the pool (= called resuming in code) with the last-saved ResultWithTimings in the SchedulerStaus::Stale.

Thanks to the nature of this ouf-of-band mechanism, there's no addtional overhead in the latency-sensitive code path and it works even if there's no newly-arriving transactions at all.

Lastly, note that encapsulating this inside solana-unifinied-scheduler would necessitate introduction of some kind of synchronization mechanism. Also, this is why i used the term session in the crate not to associate 1-to-1 for slot-to-session. it's now 1-to-N. (most of time, it's practically still 1-to-1, though).

@ryoqun ryoqun force-pushed the stale-taken-scheduler-returning branch from 8e84265 to 23f8c65 Compare June 11, 2024 15:16
@ryoqun ryoqun requested a review from apfitzge June 11, 2024 15:25
@ryoqun
Copy link
Member Author

ryoqun commented Jun 11, 2024

I still need to add tests and write some comments. but i think this is ready for review in impl wise...

runtime/src/installed_scheduler_pool.rs Outdated Show resolved Hide resolved

self.with_active_scheduler(f)
}
SchedulerStatus::Unavailable => panic!(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again should have an error message here.

Unavailable seems to be used for more than just transitions (at first glance), so I am still trying to convince myself this cannot happen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: 06b7bff

runtime/src/installed_scheduler_pool.rs Outdated Show resolved Hide resolved
Comment on lines 535 to 540
scheduler.maybe_transition_from_active_to_stale(|scheduler| {
let (result_with_timings, uninstalled_scheduler) =
scheduler.wait_for_termination(false);
uninstalled_scheduler.return_to_pool();
(scheduler_pool, result_with_timings)
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to make sure I understand this correctly.
The closure here is called before we transition to Stale. So we wait for termination while the scheduler is still in active state.

I guess I'm not sure what happens here; it seems this intended to kill an extremely long running scheduler. But description of wait_for_termination says

    /// This function blocks the current thread while waiting for the scheduler to complete all of
    /// the executions for the scheduled transactions and to return the finalized
    /// `ResultWithTimings`. This function still blocks for short period of time even in the case
    /// of aborted schedulers to gracefully shutdown the scheduler (like thread joining).

So if it's extremely long-running, aren't we still blocked here? I assume I am missing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The closure here is called before we transition to Stale. So we wait for termination while the scheduler is still in active state.

yes. this is true.

it seems this intended to kill an extremely long running scheduler.

well, this isn't true. the intention here is to gracefully terminate extremely long idling scheduler. This is to reclaim the idling os native threads to return them to the scheduler pool.

So if it's extremely long-running, aren't we still blocked here? I assume I am missing something.

this is correct. The situation is same both for blockstore processor and unified scheduler from the very beginnings.

Copy link
Member Author

@ryoqun ryoqun Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, also note that even while the scheduler constantly running with new transactions are arriving at the time of TimeoutListener triggering, this wait_for_termiantion() will eventually return. That's because we're taking the write lock and it will stop newer transactions are entered to the scheduler. So it's guaranteed that there's some amount of bounded work before completion of wait_for_termination() at this point, assuming there's no infinite loop bug in svm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, added some explanation from this convo: 06b7bff

@ryoqun ryoqun requested a review from apfitzge June 12, 2024 06:52
Comment on lines 548 to 550
// Re-register a new timeout listener only after acquiring the read lock;
// Otherwise, the listener would again put scheduler into Stale before the read
// lock, causing panic below.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops forgot to mention this is extremely rare race condition...


let (result, timings) = bank.wait_for_completed_scheduler().unwrap();
assert_matches!(result, Ok(()));
assert_eq!(timings.metrics[ExecuteTimingType::CheckUs], 246);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is asserting that ResultWithTimings are correctly carried over across active=>stale=>active transitions.

Copy link

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit on comments, and maybe a clone we can remove - otherwise happy with PR. Feel free to merge and do small follow-up PR for those items if we agree

@@ -290,8 +290,15 @@ impl WaitReason {
#[allow(clippy::large_enum_variant)]
#[derive(Debug)]
pub enum SchedulerStatus {
/// Unified scheduler is disabled or installed scheduler is consumed by wait_for_termination().
/// Note that transition to Unavailable from {Active, Stale} is one-way (i.e. one-time).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem strictly true, because it seems to also be used as an intermediate state when transitioning between active <-> stale?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, nice you're spotting this. I've omitted for simplicity here. I will do a follow-up for completeness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be addressed by: #1797

@@ -519,15 +526,15 @@ impl BankWithSchedulerInner {
);
Err(SchedulerAborted)
}
SchedulerStatus::Stale(_pool, _result_with_timings) => {
SchedulerStatus::Stale(pool, _result_with_timings) => {
let pool = pool.clone();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to clone the pool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because this pool is borrowed and we need to re-lock scheduler with the write access:

error[E0505]: cannot move out of `scheduler` because it is borrowed
   --> runtime/src/installed_scheduler_pool.rs:532:22
    |
517 |         let scheduler = self.scheduler.read().unwrap();
    |             --------- binding `scheduler` declared here
518 |         match &*scheduler {
    |                 --------- borrow of `scheduler` occurs here
...
532 |                 drop(scheduler);
    |                      ^^^^^^^^^ move out of `scheduler` occurs here
...
552 |                 pool.register_timeout_listener(self.do_create_timeout_listener());
    |                 ---- borrow later used here

For more information about this error, try `rustc --explain E0505`.
error: could not compile `solana-runtime` (lib) due to 1 previous error

@ryoqun ryoqun merged commit 3342215 into anza-xyz:master Jun 13, 2024
40 checks passed
samkim-crypto pushed a commit to samkim-crypto/agave that referenced this pull request Jul 31, 2024
* Return back stale out-of-pool scheduler by timeout

* Add tests

* Document and proper panic messages

* Avoid recursion

* Tweak comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants