This repository has been archived by the owner on Nov 15, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
PVF: Vote invalid on panics in execution thread #7045
Labels
J7-mentor
An easy task where a mentor is available. Please indicate in the issue who the mentor could be.
T4-parachains_engineering
This PR/Issue is related to Parachains performance, stability, maintenance.
U1-asap
No need to stop dead in your tracks, however issue should be addressed as soon as possible.
Comments
mrcnski
added
T4-parachains_engineering
This PR/Issue is related to Parachains performance, stability, maintenance.
J7-mentor
An easy task where a mentor is available. Please indicate in the issue who the mentor could be.
labels
Apr 11, 2023
mrcnski
added
the
U1-asap
No need to stop dead in your tracks, however issue should be addressed as soon as possible.
label
Apr 12, 2023
Setting as high priority (required before parathreads). If someone figures out how to trigger a panic in execution they could stall the chain. Fortunately it's a quick fix. |
Assuming somebody figures that out, wouldn’t this panic on the collator as well? If they are malicious It would then panic in backing and if backers are malicious, that is the definition of a garbage candidate. Wdyt? |
Ok, I just re-read, disregard the above comment as it doesn't make sense |
mrcnski
added a commit
that referenced
this issue
May 1, 2023
1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed. 2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1] [^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet. 3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop. 4. The order of thread selection was flipped to make (3) sound (see note in code). I left some TODO's related to panics which I'm going to address soon as part of #7045.
This was referenced May 1, 2023
paritytech-processbot bot
pushed a commit
that referenced
this issue
May 16, 2023
* PVF: Remove `rayon` and some uses of `tokio` 1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed. 2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1] [^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet. 3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop. 4. The order of thread selection was flipped to make (3) sound (see note in code). I left some TODO's related to panics which I'm going to address soon as part of #7045. * PVF: Vote invalid on panics in execution thread (after a retry) Also make sure we kill the worker process on panic errors and internal errors to potentially clear any error states independent of the candidate. * Address a couple of TODOs Addresses a couple of follow-up TODOs from #7153. * Add some documentation to implementer's guide * Fix compile error * Fix compile errors * Fix compile error * Update roadmap/implementers-guide/src/node/utility/candidate-validation.md Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com> * Address comments + couple other changes (see message) - Measure the CPU time in the prepare thread, so the observed time is not affected by any delays in joining on the thread. - Measure the full CPU time in the execute thread. * Implement proper thread synchronization Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std docs. Considered also using a condvar to signal the CPU thread to end, in place of an mpsc channel. This was not done because `Condvar::wait_timeout_while` is documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not documented as such. Also, we would need a separate condvar, to avoid this case: the worker thread finishes its job, notifies the condvar, the CPU thread returns first, and we join on it and not the worker thread. So it was simpler to leave this part as is. * Catch panics in threads so we always notify condvar * Use `WaitOutcome` enum instead of bool condition variable * Fix retry timeouts to depend on exec timeout kind * Address review comments * Make the API for condvars in workers nicer * Add a doc * Use condvar for memory stats thread * Small refactor * Enumerate internal validation errors in an enum * Fix comment * Add a log * Fix test * Update variant naming * Address a missed TODO --------- Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
github-project-automation
bot
moved this from In Progress
to Done
in PVF Security Hardening
May 16, 2023
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
J7-mentor
An easy task where a mentor is available. Please indicate in the issue who the mentor could be.
T4-parachains_engineering
This PR/Issue is related to Parachains performance, stability, maintenance.
U1-asap
No need to stop dead in your tracks, however issue should be addressed as soon as possible.
ISSUE
Overview
We have two criteria for internal errors (errors not raising a dispute):
See the note on paritytech/polkadot-sdk#661 for a full explanation.
In execution, we currently treat panics as internal errors1. But I would say that panics do not meet the above criteria - panics can happen for any reason and are not 100% a local issue. For example, someone could find some bug in wasmtime that causes a panic and craft a PVF to exploit it. In such a case it would be better to vote against (invalid candidate).
Proposal
Footnotes
See https://github.com/paritytech/polkadot/blob/dbae30efe0/node/core/pvf/src/execute/worker.rs#L332. It's kind of implicit and easy to miss, because panics in the execution thread are returned as errors when we call
join
.2 ↩The text was updated successfully, but these errors were encountered: