-
Notifications
You must be signed in to change notification settings - Fork 1.6k
PVF: Remove rayon
and some uses of tokio
#7153
Conversation
1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed. 2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1] [^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet. 3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop. 4. The order of thread selection was flipped to make (3) sound (see note in code). I left some TODO's related to panics which I'm going to address soon as part of #7045.
Addresses a couple of follow-up TODOs from #7153.
@@ -68,13 +65,12 @@ pub fn worker_event_loop<F, Fut>( | |||
|
|||
// Run the main worker loop. | |||
let rt = Runtime::new().expect("Creates tokio runtime. If this panics the worker will die and the host will detect that and deal with it."); | |||
let handle = rt.handle(); | |||
let err = rt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need tokio and futures if everything async-related is already purged? Filesystem interactions are sync in nature, and for the worker reading from the socket is blocking too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! I just didn't remove the rest of async to keep this PR focused. But I can do it here if you want.
Note that we still need to remove the dependencies on polkadot-node-core-pvf
and tracing-gum
to fully remove the dependency on tokio
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How's tracing-gum related to tokio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just ran cargo tree -e normal
in the crate and saw tokio
several times in the output, e.g. under sc-network
and libp2p
crates. I have no idea how tracing-gum
works though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's jaeger, even though gum only uses hashing from it.
You're right! I just didn't remove the rest of async to keep this PR focused. But I can do it here if you want.
Better to polish the rest of the code and properly synchronize threads and take care of removing tokio later (if it's possible)
- Measure the CPU time in the prepare thread, so the observed time is not affected by any delays in joining on the thread. - Measure the full CPU time in the execute thread.
Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std docs. Considered also using a condvar to signal the CPU thread to end, in place of an mpsc channel. This was not done because `Condvar::wait_timeout_while` is documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not documented as such. Also, we would need a separate condvar, to avoid this case: the worker thread finishes its job, notifies the condvar, the CPU thread returns first, and we join on it and not the worker thread. So it was simpler to leave this part as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, almost there
@slumber I pushed another change, can you please take a look? Thank you for helping! |
node/core/pvf/worker/src/common.rs
Outdated
#[derive(Clone, Copy)] | ||
pub enum WaitOutcome { | ||
JobFinished, | ||
CpuTimedOut, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Wouldn't JobPending
and JobTimedOut
sound better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or remove the prefix completely: Finished
, TimedOut
, Pending
.
WaitOutcome::JobFinished => { | ||
let _ = cpu_time_monitor_tx.send(()); | ||
execute_thread.join().unwrap_or_else(|e| { | ||
// TODO: Use `Panic` error once that is implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe have an issue for this , rather than TODO in the code ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already addressed it here, it's approved so I'll merge it right after this PR. (I had planned to do these changes right after another so I left the TODO as a marker for myself, did the change, and set 7155's merge target to this branch. Will do issues instead in the future. 👍)
None | ||
}, | ||
} | ||
// Join on the thread handle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comment seems superfluous
node/core/pvf/worker/src/prepare.rs
Outdated
Err(PrepareError::TimedOut) | ||
}, | ||
Ok(None) => Err(PrepareError::IoErr( | ||
"error communicating over finished channel".into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"error communicating over finished channel".into(), | |
"error communicating over closed channel".into(), |
/// Block the thread while it waits on the condvar or on a timeout. If the timeout is hit, | ||
/// returns `None`. | ||
#[cfg_attr(not(any(target_os = "linux", feature = "jemalloc-allocator")), allow(dead_code))] | ||
pub fn wait_for_threads_with_timeout(cond: &Cond, dur: Duration) -> Option<WaitOutcome> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have an extra variant in WaitOutcome
instead of the None
? It would then have the same return type as wait_for_threads
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better the way it is, the variant wouldn't be applicable in a couple of places so we'd need extra unreachable!
s. And if we set it to a different variant here it would trigger the condvar not being pending anymore.
* PVF: Remove `rayon` and some uses of `tokio` 1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed. 2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1] [^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet. 3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop. 4. The order of thread selection was flipped to make (3) sound (see note in code). I left some TODO's related to panics which I'm going to address soon as part of #7045. * PVF: Vote invalid on panics in execution thread (after a retry) Also make sure we kill the worker process on panic errors and internal errors to potentially clear any error states independent of the candidate. * Address a couple of TODOs Addresses a couple of follow-up TODOs from #7153. * Add some documentation to implementer's guide * Fix compile error * Fix compile errors * Fix compile error * Update roadmap/implementers-guide/src/node/utility/candidate-validation.md Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com> * Address comments + couple other changes (see message) - Measure the CPU time in the prepare thread, so the observed time is not affected by any delays in joining on the thread. - Measure the full CPU time in the execute thread. * Implement proper thread synchronization Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std docs. Considered also using a condvar to signal the CPU thread to end, in place of an mpsc channel. This was not done because `Condvar::wait_timeout_while` is documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not documented as such. Also, we would need a separate condvar, to avoid this case: the worker thread finishes its job, notifies the condvar, the CPU thread returns first, and we join on it and not the worker thread. So it was simpler to leave this part as is. * Catch panics in threads so we always notify condvar * Use `WaitOutcome` enum instead of bool condition variable * Fix retry timeouts to depend on exec timeout kind * Address review comments * Make the API for condvars in workers nicer * Add a doc * Use condvar for memory stats thread * Small refactor * Enumerate internal validation errors in an enum * Fix comment * Add a log * Fix test * Update variant naming * Address a missed TODO --------- Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
* master: (60 commits) Ensure all `StorageVersion`s on Rococo/Westend are correct and migration hooks pass (#7251) Try-runtime proper return types (#7146) Have OCW mined election once a week on Westend (#7248) Bump enumn from 0.1.5 to 0.1.8 (#7226) Companion to #14183: FRAME: Allow message ID to be mutated in `ProcessMessage` (#7262) Remove TODO comment (#7260) Fix build (#7261) Update syn (#7258) Use Message Queue pallet for UMP dispatch (#6271) Freeze chain if there are byzantine threshold + 1 invalid votes against a local candidate (#7225) Revert chain if at least f+1 validators voted against a candidate (#7151) Ensure all `StorageVersion`s on Polkadot/Kusama are correct (#7199) Forgotten pub reexport for `GlobalConsensusParachainConvertsFor` (#7238) PVF: Vote invalid on panics in execution thread (after a retry) (#7155) PVF: Remove `rayon` and some uses of `tokio` (#7153) [xcm] Foreign global consensus parachain LocationToAccountId converter (#7016) Update docs (#7230) Bump parity-db to 0.4.8 (#7231) Merge branch 'master' of https://github.com/paritytech/polkadot (#7224) Relax the watermark rule in the runtime (#7188) ...
…slashing-client * ao-past-session-slashing-runtime: (61 commits) Ensure all `StorageVersion`s on Rococo/Westend are correct and migration hooks pass (#7251) Try-runtime proper return types (#7146) Have OCW mined election once a week on Westend (#7248) Bump enumn from 0.1.5 to 0.1.8 (#7226) Companion to #14183: FRAME: Allow message ID to be mutated in `ProcessMessage` (#7262) Remove TODO comment (#7260) Fix build (#7261) Update syn (#7258) Use Message Queue pallet for UMP dispatch (#6271) Freeze chain if there are byzantine threshold + 1 invalid votes against a local candidate (#7225) Revert chain if at least f+1 validators voted against a candidate (#7151) Ensure all `StorageVersion`s on Polkadot/Kusama are correct (#7199) Forgotten pub reexport for `GlobalConsensusParachainConvertsFor` (#7238) PVF: Vote invalid on panics in execution thread (after a retry) (#7155) PVF: Remove `rayon` and some uses of `tokio` (#7153) [xcm] Foreign global consensus parachain LocationToAccountId converter (#7016) Update docs (#7230) Bump parity-db to 0.4.8 (#7231) Merge branch 'master' of https://github.com/paritytech/polkadot (#7224) Relax the watermark rule in the runtime (#7188) ...
PULL REQUEST
Overview
We were using
rayon
to spawn a superfluous threadpool and thread to do execution, so it was removed.We were using
rayon
to set a threadpool-specific thread stack size, and AFAIK we couldn't do that withtokio
(it's possible per-runtime but not per-thread). Since we want to removetokio
from the workers anyway, I changed it to spawn threads with thestd::thread
API instead oftokio
.1std::thread
API is not async, we could no longerselect!
on the threads as futures, so theselect!
was removed in favor of non-async coordination using sync primitives.I left some TODO's related to panics which I'm going to address soon as part of #7045.
Related issues
Gets us closer to paritytech/polkadot-sdk#649.
Footnotes
NOTE: This PR does not totally remove the
tokio
dependency just yet. ↩