Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

PVF: Remove rayon and some uses of tokio #7153

Merged
merged 16 commits into from
May 16, 2023
Merged

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented May 1, 2023

PULL REQUEST

Overview

  1. We were using rayon to spawn a superfluous threadpool and thread to do execution, so it was removed.

  2. We were using rayon to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with tokio (it's possible per-runtime but not per-thread). Since we want to remove tokio from the workers anyway, I changed it to spawn threads with the std::thread API instead of tokio.1

  1. Since std::thread API is not async, we could no longer select! on the threads as futures, so the select! was removed in favor of non-async coordination using sync primitives.

I left some TODO's related to panics which I'm going to address soon as part of #7045.

Related issues

Gets us closer to paritytech/polkadot-sdk#649.

Footnotes

  1. NOTE: This PR does not totally remove the tokio dependency just yet.

1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed.

2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1]

[^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet.

3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop.

4. The order of thread selection was flipped to make (3) sound (see note in code).

I left some TODO's related to panics which I'm going to address soon as part of #7045.
@mrcnski mrcnski added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. labels May 1, 2023
mrcnski added a commit that referenced this pull request May 1, 2023
Addresses a couple of follow-up TODOs from
#7153.
node/core/pvf/worker/src/prepare.rs Outdated Show resolved Hide resolved
node/core/pvf/worker/src/executor_intf.rs Outdated Show resolved Hide resolved
@@ -68,13 +65,12 @@ pub fn worker_event_loop<F, Fut>(

// Run the main worker loop.
let rt = Runtime::new().expect("Creates tokio runtime. If this panics the worker will die and the host will detect that and deal with it.");
let handle = rt.handle();
let err = rt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need tokio and futures if everything async-related is already purged? Filesystem interactions are sync in nature, and for the worker reading from the socket is blocking too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I just didn't remove the rest of async to keep this PR focused. But I can do it here if you want.

Note that we still need to remove the dependencies on polkadot-node-core-pvf and tracing-gum to fully remove the dependency on tokio.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How's tracing-gum related to tokio?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just ran cargo tree -e normal in the crate and saw tokio several times in the output, e.g. under sc-network and libp2p crates. I have no idea how tracing-gum works though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's jaeger, even though gum only uses hashing from it.

You're right! I just didn't remove the rest of async to keep this PR focused. But I can do it here if you want.

Better to polish the rest of the code and properly synchronize threads and take care of removing tokio later (if it's possible)

node/core/pvf/worker/src/execute.rs Outdated Show resolved Hide resolved
- Measure the CPU time in the prepare thread, so the observed time is not
  affected by any delays in joining on the thread.

- Measure the full CPU time in the execute thread.
Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std
docs.

Considered also using a condvar to signal the CPU thread to end, in place of an
mpsc channel. This was not done because `Condvar::wait_timeout_while` is
documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not
documented as such. Also, we would need a separate condvar, to avoid this case:
the worker thread finishes its job, notifies the condvar, the CPU thread returns
first, and we join on it and not the worker thread. So it was simpler to leave
this part as is.
@mrcnski mrcnski requested review from slumber and bkchr May 2, 2023 09:54
node/core/pvf/worker/src/execute.rs Outdated Show resolved Hide resolved
node/core/pvf/worker/src/common.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@slumber slumber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, almost there

node/core/pvf/worker/src/common.rs Outdated Show resolved Hide resolved
node/core/pvf/worker/src/execute.rs Outdated Show resolved Hide resolved
node/core/pvf/worker/src/execute.rs Outdated Show resolved Hide resolved
@mrcnski mrcnski requested a review from slumber May 2, 2023 17:30
@mrcnski
Copy link
Contributor Author

mrcnski commented May 3, 2023

@slumber I pushed another change, can you please take a look? Thank you for helping!

node/core/pvf/worker/src/common.rs Outdated Show resolved Hide resolved
node/core/pvf/worker/src/common.rs Outdated Show resolved Hide resolved
@mrcnski mrcnski self-assigned this May 15, 2023
#[derive(Clone, Copy)]
pub enum WaitOutcome {
JobFinished,
CpuTimedOut,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Wouldn't JobPending and JobTimedOut sound better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or remove the prefix completely: Finished, TimedOut, Pending.

WaitOutcome::JobFinished => {
let _ = cpu_time_monitor_tx.send(());
execute_thread.join().unwrap_or_else(|e| {
// TODO: Use `Panic` error once that is implemented.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have an issue for this , rather than TODO in the code ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already addressed it here, it's approved so I'll merge it right after this PR. (I had planned to do these changes right after another so I left the TODO as a marker for myself, did the change, and set 7155's merge target to this branch. Will do issues instead in the future. 👍)

None
},
}
// Join on the thread handle.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment seems superfluous

Err(PrepareError::TimedOut)
},
Ok(None) => Err(PrepareError::IoErr(
"error communicating over finished channel".into(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"error communicating over finished channel".into(),
"error communicating over closed channel".into(),

/// Block the thread while it waits on the condvar or on a timeout. If the timeout is hit,
/// returns `None`.
#[cfg_attr(not(any(target_os = "linux", feature = "jemalloc-allocator")), allow(dead_code))]
pub fn wait_for_threads_with_timeout(cond: &Cond, dur: Duration) -> Option<WaitOutcome> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have an extra variant in WaitOutcome instead of the None ? It would then have the same return type as wait_for_threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better the way it is, the variant wouldn't be applicable in a couple of places so we'd need extra unreachable!s. And if we set it to a different variant here it would trigger the condvar not being pending anymore.

@mrcnski mrcnski merged commit 973da0e into master May 16, 2023
@mrcnski mrcnski deleted the mrcnski/pvf-remove-rayon-tokio branch May 16, 2023 18:35
paritytech-processbot bot pushed a commit that referenced this pull request May 16, 2023
* PVF: Remove `rayon` and some uses of `tokio`

1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed.

2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1]

[^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet.

3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop.

4. The order of thread selection was flipped to make (3) sound (see note in code).

I left some TODO's related to panics which I'm going to address soon as part of #7045.

* PVF: Vote invalid on panics in execution thread (after a retry)

Also make sure we kill the worker process on panic errors and internal errors to
potentially clear any error states independent of the candidate.

* Address a couple of TODOs

Addresses a couple of follow-up TODOs from
#7153.

* Add some documentation to implementer's guide

* Fix compile error

* Fix compile errors

* Fix compile error

* Update roadmap/implementers-guide/src/node/utility/candidate-validation.md

Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>

* Address comments + couple other changes (see message)

- Measure the CPU time in the prepare thread, so the observed time is not
  affected by any delays in joining on the thread.

- Measure the full CPU time in the execute thread.

* Implement proper thread synchronization

Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std
docs.

Considered also using a condvar to signal the CPU thread to end, in place of an
mpsc channel. This was not done because `Condvar::wait_timeout_while` is
documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not
documented as such. Also, we would need a separate condvar, to avoid this case:
the worker thread finishes its job, notifies the condvar, the CPU thread returns
first, and we join on it and not the worker thread. So it was simpler to leave
this part as is.

* Catch panics in threads so we always notify condvar

* Use `WaitOutcome` enum instead of bool condition variable

* Fix retry timeouts to depend on exec timeout kind

* Address review comments

* Make the API for condvars in workers nicer

* Add a doc

* Use condvar for memory stats thread

* Small refactor

* Enumerate internal validation errors in an enum

* Fix comment

* Add a log

* Fix test

* Update variant naming

* Address a missed TODO

---------

Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
ordian added a commit that referenced this pull request May 23, 2023
* master: (60 commits)
  Ensure all `StorageVersion`s on Rococo/Westend are correct and migration hooks pass (#7251)
  Try-runtime proper return types (#7146)
  Have OCW mined election once a week on Westend (#7248)
  Bump enumn from 0.1.5 to 0.1.8 (#7226)
  Companion to #14183: FRAME: Allow message ID to be mutated in `ProcessMessage` (#7262)
  Remove TODO comment (#7260)
  Fix build (#7261)
  Update syn (#7258)
  Use Message Queue pallet for UMP dispatch (#6271)
  Freeze chain if there are byzantine threshold + 1 invalid votes against a local candidate (#7225)
  Revert chain if at least f+1 validators voted against a candidate (#7151)
  Ensure all `StorageVersion`s on Polkadot/Kusama are correct (#7199)
  Forgotten pub reexport for `GlobalConsensusParachainConvertsFor` (#7238)
  PVF: Vote invalid on panics in execution thread (after a retry) (#7155)
  PVF: Remove `rayon` and some uses of `tokio` (#7153)
  [xcm] Foreign global consensus parachain LocationToAccountId converter (#7016)
  Update docs (#7230)
  Bump parity-db to 0.4.8 (#7231)
  Merge branch 'master' of https://github.com/paritytech/polkadot (#7224)
  Relax the watermark rule in the runtime (#7188)
  ...
ordian added a commit that referenced this pull request May 23, 2023
…slashing-client

* ao-past-session-slashing-runtime: (61 commits)
  Ensure all `StorageVersion`s on Rococo/Westend are correct and migration hooks pass (#7251)
  Try-runtime proper return types (#7146)
  Have OCW mined election once a week on Westend (#7248)
  Bump enumn from 0.1.5 to 0.1.8 (#7226)
  Companion to #14183: FRAME: Allow message ID to be mutated in `ProcessMessage` (#7262)
  Remove TODO comment (#7260)
  Fix build (#7261)
  Update syn (#7258)
  Use Message Queue pallet for UMP dispatch (#6271)
  Freeze chain if there are byzantine threshold + 1 invalid votes against a local candidate (#7225)
  Revert chain if at least f+1 validators voted against a candidate (#7151)
  Ensure all `StorageVersion`s on Polkadot/Kusama are correct (#7199)
  Forgotten pub reexport for `GlobalConsensusParachainConvertsFor` (#7238)
  PVF: Vote invalid on panics in execution thread (after a retry) (#7155)
  PVF: Remove `rayon` and some uses of `tokio` (#7153)
  [xcm] Foreign global consensus parachain LocationToAccountId converter (#7016)
  Update docs (#7230)
  Bump parity-db to 0.4.8 (#7231)
  Merge branch 'master' of https://github.com/paritytech/polkadot (#7224)
  Relax the watermark rule in the runtime (#7188)
  ...
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

5 participants