Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Add PVF module documentation #6293

Merged
merged 9 commits into from
Nov 23, 2022
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion node/core/pvf/src/executor_intf.rs
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ pub fn prevalidate(code: &[u8]) -> Result<RuntimeBlob, sc_executor_common::error
}

/// Runs preparation on the given runtime blob. If successful, it returns a serialized compiled
/// artifact which can then be used to pass into [`execute`] after writing it to the disk.
/// artifact which can then be used to pass into `Executor::execute` after writing it to the disk.
pub fn prepare(blob: RuntimeBlob) -> Result<Vec<u8>, sc_executor_common::error::WasmError> {
sc_executor_wasmtime::prepare_runtime_artifact(blob, &CONFIG.semantics)
}
Expand Down
59 changes: 2 additions & 57 deletions node/core/pvf/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,64 +16,9 @@

#![warn(missing_docs)]

//! A crate that implements PVF validation host.
//! A crate that implements the PVF validation host.
//!
//! This crate provides a simple API. You first [`start`] the validation host, which gives you the
//! [handle][`ValidationHost`] and the future you need to poll.
//!
//! Then using the handle the client can send two types of requests:
//!
//! (a) PVF execution. This accepts the PVF [`params`][`polkadot_parachain::primitives::ValidationParams`]
//! and the PVF [code][`Pvf`], prepares (verifies and compiles) the code, and then executes PVF
//! with the `params`.
//!
//! (b) Heads up. This request allows to signal that the given PVF may be needed soon and that it
//! should be prepared for execution.
//!
//! The preparation results are cached for some time after they either used or was signaled in heads up.
//! All requests that depends on preparation of the same PVF are bundled together and will be executed
//! as soon as the artifact is prepared.
//!
//! # Priority
//!
//! PVF execution requests can specify the [priority][`Priority`] with which the given request should
//! be handled. Different priority levels have different effects. This is discussed below.
//!
//! Preparation started by a heads up signal always starts in with the background priority. If there
//! is already a request for that PVF preparation under way the priority is inherited. If after heads
//! up, a new PVF execution request comes in with a higher priority, then the original task's priority
//! will be adjusted to match the new one if it's larger.
//!
//! Priority can never go down, only up.
//!
//! # Under the hood
//!
//! Under the hood, the validation host is built using a bunch of communicating processes, not
//! dissimilar to actors. Each of such "processes" is a future task that contains an event loop that
//! processes incoming messages, potentially delegating sub-tasks to other "processes".
//!
//! Two of these processes are queues. The first one is for preparation jobs and the second one is for
//! execution. Both of the queues are backed by separate pools of workers of different kind.
//!
//! Preparation workers handle preparation requests by preverifying and instrumenting PVF wasm code,
//! and then passing it into the compiler, to prepare the artifact.
//!
//! Artifact is a final product of preparation. If the preparation succeeded, then the artifact will
//! contain the compiled code usable for quick execution by a worker later on.
//!
//! If the preparation failed, then the worker will still write the artifact with the error message.
//! We save the artifact with the error so that we don't try to prepare the artifacts that are broken
//! repeatedly.
//!
//! The artifact is saved on disk and is also tracked by an in memory table. This in memory table
//! doesn't contain the artifact contents though, only a flag that the given artifact is compiled.
//!
//! The execute workers will be fed by the requests from the execution queue, which is basically a
//! combination of a path to the compiled artifact and the
//! [`params`][`polkadot_parachain::primitives::ValidationParams`].
//!
//! Each fixed interval of time a pruning task will run. This task will remove all artifacts that
//! weren't used or received a heads up signal for a while.
//! This is responsible for handling requests to prepare and execute PVF code blobs.

mod artifacts;
mod error;
Expand Down
2 changes: 1 addition & 1 deletion node/core/pvf/src/priority.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ pub enum Priority {
Normal,
/// This priority is used for requests that are required to be processed as soon as possible.
///
/// For example, backing is on critical path and require execution as soon as possible.
/// For example, backing is on a critical path and requires execution as soon as possible.
Critical,
}

Expand Down
1 change: 0 additions & 1 deletion roadmap/implementers-guide/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ Then install and build the book:
```sh
cargo install mdbook mdbook-linkcheck mdbook-graphviz mdbook-mermaid mdbook-last-changed
mdbook serve roadmap/implementers-guide
open http://localhost:3000
```

## Specification
Expand Down
2 changes: 2 additions & 0 deletions roadmap/implementers-guide/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@
- [Utility Subsystems](node/utility/README.md)
- [Availability Store](node/utility/availability-store.md)
- [Candidate Validation](node/utility/candidate-validation.md)
- [PVF](node/utility/pvf.md)
- [Provisioner](node/utility/provisioner.md)
- [Network Bridge](node/utility/network-bridge.md)
- [Gossip Support](node/utility/gossip-support.md)
Expand All @@ -71,6 +72,7 @@
- [PVF Pre-Checking](node/utility/pvf-prechecker.md)
- [Data Structures and Types](types/README.md)
- [Candidate](types/candidate.md)
- [PVF](types/pvf.md)
- [Backing](types/backing.md)
- [Availability](types/availability.md)
- [Overseer and Subsystem Protocol](types/overseer-protocol.md)
Expand Down
1 change: 0 additions & 1 deletion roadmap/implementers-guide/src/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,4 +48,3 @@ exactly one downward message queue.
Also of use is the [Substrate Glossary](https://substrate.dev/docs/en/knowledgebase/getting-started/glossary).

[0]: https://wiki.polkadot.network/docs/learn-consensus
[1]: #pvf
6 changes: 3 additions & 3 deletions roadmap/implementers-guide/src/node/utility/pvf-prechecker.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ This subsytem does not produce any output messages either. The subsystem will, h

If the node is running in a collator mode, this subsystem will be disabled. The PVF pre-checker subsystem keeps track of the PVFs that are relevant for the subsystem.

To be relevant for the subsystem, a PVF must be returned by `pvfs_require_precheck` [`pvfs_require_precheck` runtime API][PVF pre-checking runtime API] in any of the active leaves. If the PVF is not present in any of the active leaves, it ceases to be relevant.
To be relevant for the subsystem, a PVF must be returned by the [`pvfs_require_precheck` runtime API][PVF pre-checking runtime API] in any of the active leaves. If the PVF is not present in any of the active leaves, it ceases to be relevant.

When a PVF just becomes relevant, the subsystem will send a message to the [Candidate Validation] subsystem asking for the pre-check.

Upon receving a message from the candidate-validation subsystem, the pre-checker will note down that the PVF has its judgement and will also sign and submit a [`PvfCheckStatement`] via the [`submit_pvf_check_statement` runtime API][PVF pre-checking runtime API]. In case, a judgement was received for a PVF that is no longer in view it is ignored. It is possible that the candidate validation was not able to check the PVF. In that case, the PVF pre-checker will abstain and won't submit any check statements.
Upon receving a message from the candidate-validation subsystem, the pre-checker will note down that the PVF has its judgement and will also sign and submit a [`PvfCheckStatement`][PvfCheckStatement] via the [`submit_pvf_check_statement` runtime API][PVF pre-checking runtime API]. In case, a judgement was received for a PVF that is no longer in view it is ignored. It is possible that the candidate validation was not able to check the PVF. In that case, the PVF pre-checker will abstain and won't submit any check statements.

Since a vote only is valid during [one session][overview], the subsystem will have to resign and submit the statements for for the new session. The new session is assumed to be started if at least one of the leaves has a greater session index that was previously observed in any of the leaves.

Expand All @@ -28,4 +28,4 @@ If the node is not in the active validator set, it will still perform all the ch
[Runtime API]: runtime-api.md
[PVF pre-checking runtime API]: ../../runtime-api/pvf-prechecking.md
[Candidate Validation]: candidate-validation.md
[`PvfCheckStatement`]: ../../types/pvf-prechecking.md
[PvfCheckStatement]: ../../types/pvf-prechecking.md#pvfcheckstatement
114 changes: 114 additions & 0 deletions roadmap/implementers-guide/src/node/utility/pvf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# PVF

The `pvf` module is responsible for handling preparation and execution subtasks
for PVF code blobs.

## Entrypoint

This crate provides a simple API. You first `start` the validation host, which
gives you the [handle][ValidationHost] and the future you need to poll.

Then using the handle the client can send three types of requests:

(a) PVF pre-checking. This takes the PVF [code][Pvf] and tries to prepare it
(verify and compile) in order to pre-check its validity.

(b) PVF execution. This accepts the PVF [`params`][ValidationParams] and the PVF
[code][Pvf], prepares (verifies and compiles) the code, and then executes PVF
with the `params`.

(c) Heads up. This request allows to signal that the given PVF may be needed
soon and that it should be prepared for execution.

The preparation results are cached for some time after they either used or was
signaled in heads up. All requests that depends on preparation of the same PVF
are bundled together and will be executed as soon as the artifact is prepared.

## Priority

PVF execution requests can specify the [priority][Priority] with which the
given request should be handled. Different priority levels have different
effects. This is discussed below.

Preparation started by a heads up signal always starts with the background
priority. If there is already a request for that PVF preparation under way the
priority is inherited. If after heads up, a new PVF execution request comes in
with a higher priority, then the original task's priority will be adjusted to
match the new one if it's larger.

Priority can never go down, only up.

## Mitigating disputes

### Retrying execution requests

If the execution request fails during **preparation**, we will retry if it is
possible that the preparation error was transient (i.e. it was of type
`PrepareError::Panic`, `PrepareError::TimedOut`, or
`PrepareError::DidNotMakeIt`). We will only retry preparation if another
requests comes in after 15 minutes, to ensure any potential transient conditions
had time to be resolved. We will retry up to 5 times. See
`can_retry_prepare_after_failure`.

If the actual **execution** of the artifact fails, we will retry once if it was
an `InvalidCandidate::AmbiguousWorkerDeath` error, after a 1 second delay to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we really go with 1 second? Andronik suggested 3 seconds back then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh oops, I saw that this was resolved so I didn't make the change. I should have read more closely, because it sounds like we do want to go with 3 seconds. I can raise a separate PR, LMK.

allow any potential transient conditions to clear. This occurs outside this
module, in the Candidate Validation subsystem.

### Preparation timeouts

We use a timeout for preparation to limit the amount of time it can take. As the
time for preparation can vary depending on the machine and load on the machine,
this can potentially lead to disputes where some validators are able to execute
a PVF and others aren't.

One mitigation we have in place is a more lenient timeout for preparation during
execution than during pre-checking. The rationale is that the PVF has already
passed pre-checking, so we know it should be valid, and we allow it to take
longer than expected as this is likely due to an issue with the machine and not
the PVF.

## Under the hood

### The flow

Under the hood, the validation host is built using a bunch of communicating
processes, not dissimilar to actors. Each of such "processes" is a future task
that contains an event loop that processes incoming messages, potentially
delegating sub-tasks to other "processes".

Two of these processes are queues. The first one is for preparation jobs and the
second one is for execution. Both of the queues are backed by separate pools of
workers of different kind.

Preparation workers handle preparation requests by prevalidating and
instrumenting PVF wasm code, and then passing it into the compiler, to prepare
the artifact.

### Artifacts

An artifact is the final product of preparation. If the preparation succeeded,
then the artifact will contain the compiled code usable for quick execution by a
worker later on.

If the preparation failed, then the worker will still write the artifact with
the error message. We save the artifact with the error so that we don't try to
prepare the artifacts that are broken repeatedly.

The artifact is saved on disk and is also tracked by an in memory table. This in
memory table doesn't contain the artifact contents though, only a flag that the
given artifact is compiled.

A pruning task will run at a fixed interval of time. This task will remove all
artifacts that weren't used or received a heads up signal for a while.

### Execution

The execute workers will be fed by the requests from the execution queue, which
is basically a combination of a path to the compiled artifact and the
[`params`][ValidationParams].

[ValidationHost]: ../../types/pvf.md#validationhost
[Pvf]: ../../types/pvf.md#pvf
[ValidationParams]: ../../types/candidate.md#validationparams
[Priority]: ../../types/pvf.md#priority
16 changes: 16 additions & 0 deletions roadmap/implementers-guide/src/types/candidate.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,22 @@ struct CandidateDescriptor {
}
```

## `ValidationParams`

```rust
/// Validation parameters for evaluating the parachain validity function.
pub struct ValidationParams {
/// Previous head-data.
pub parent_head: HeadData,
/// The collation body.
pub block_data: BlockData,
/// The current relay-chain block number.
pub relay_parent_number: RelayChainBlockNumber,
/// The relay-chain block's storage root.
pub relay_parent_storage_root: Hash,
}
```

## `PersistedValidationData`

The validation data provides information about how to create the inputs for validation of a candidate. This information is derived from the chain state and will vary from para to para, although some of the fields may be the same for every para.
Expand Down
2 changes: 2 additions & 0 deletions roadmap/implementers-guide/src/types/chain.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

Types pertaining to the relay-chain - events, structures, etc.

TODO: These no longer exist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can go ahead and drop it then


## Block Import Event

```rust
Expand Down
4 changes: 1 addition & 3 deletions roadmap/implementers-guide/src/types/overseer-protocol.md
Original file line number Diff line number Diff line change
Expand Up @@ -681,9 +681,7 @@ enum ProvisionerMessage {

The Runtime API subsystem is responsible for providing an interface to the state of the chain's runtime.

This is fueled by an auxiliary type encapsulating all request types defined in the Runtime API section of the guide.

> To do: link to the Runtime API section. Not possible currently because of https://github.com/Michael-F-Bryan/mdbook-linkcheck/issues/25. Once v0.7.1 is released it will work.
This is fueled by an auxiliary type encapsulating all request types defined in the [Runtime API section](../runtime-api) of the guide.

```rust
enum RuntimeApiRequest {
Expand Down
2 changes: 2 additions & 0 deletions roadmap/implementers-guide/src/types/pvf-prechecking.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# PVF Pre-checking types

## `PvfCheckStatement`

> ⚠️ This type was added in v2.

One of the main units of information on which PVF pre-checking voting is build is the `PvfCheckStatement`.
Expand Down
37 changes: 37 additions & 0 deletions roadmap/implementers-guide/src/types/pvf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# PVF Types

## `ValidationHost`

```rust
/// A handle to the async process serving the validation host requests.
pub struct ValidationHost {
to_host_tx: mpsc::Sender<ToHost>,
}
```

## `Pvf`

```rust
/// A struct that carries code of a parachain validation function and its hash.
pub struct Pvf {
pub(crate) code: Arc<Vec<u8>>,
pub(crate) code_hash: ValidationCodeHash,
}
```

## `Priority`

```rust
/// A priority assigned to execution of a PVF.
pub enum Priority {
/// Normal priority for things that do not require immediate response, but still need to be
/// done pretty quick.
///
/// Approvals and disputes fall into this category.
Normal,
/// This priority is used for requests that are required to be processed as soon as possible.
///
/// For example, backing is on a critical path and requires execution as soon as possible.
Critical,
}
```