Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PVF execution priority #4837

Merged
merged 59 commits into from
Oct 9, 2024
Merged

Conversation

AndreiEres
Copy link
Contributor

@AndreiEres AndreiEres commented Jun 19, 2024

Resolves #4632

The new logic optimizes the distribution of execution jobs for disputes, approvals, and backings. Testing shows improved finality lag and candidate checking times, especially under heavy network load.

Approach

This update adds prioritization to the PVF execution queue. The logic partially implements the suggestions from #4632 (comment).

We use thresholds to determine how much a current priority can "steal" from lower ones:

  • Disputes: 70%
  • Approvals: 80%
  • Backing System Parachains: 100%
  • Backing: 100%

A threshold indicates the portion of the current priority that can be allocated from lower priorities.

For example:

  • Disputes take 70%, leaving 30% for approvals and all backings.
  • 80% of the remaining goes to approvals, which is 30% * 80% = 24% of the original 100%.
  • If we used parts of the original 100%, approvals couldn't take more than 24%, even if there are no disputes.

Assuming a maximum of 12 executions per block, with a 6-second window, 2 CPU cores, and a 2-second run time, we get these distributions:

  • With disputes: 8 disputes, 3 approvals, 1 backing
  • Without disputes: 9 approvals, 3 backings

It's worth noting that when there are no disputes, if there's only one backing job, we continue processing approvals regardless of their fulfillment status.

Versi Testing 40/20

Testing showed a slight difference in finality lag and candidate checking time between this pull request and its base on the master branch. The more loaded the network, the greater the observed difference.

Testing Parameters:

  • 40 validators (4 malicious)
  • 20 gluttons with 2 seconds of PVF execution time
  • 6 VRF modulo samples
  • 12 required approvals

Pasted Graphic 3
Pasted Graphic 4

Versi Testing 80/40

For this test, we compared the master branch with the branch from #5616. The second branch is based on the current one but removes backing jobs that have exceeded their time limits. We excluded malicious nodes to reduce noise from disputing and banning validators. The results show that, under the same load, nodes experience less finality lag and reduced recovery and check time. Even parachains are functioning with a shorter block time, although it remains over 6 seconds.

Testing Parameters:

  • 80 validators (0 malicious)
  • 40 gluttons with 2 seconds of PVF execution time
  • 6 VRF modulo samples
  • 30 required approvals

image

@AndreiEres AndreiEres changed the title [WIP] PVF execution priority [WIP] Add PVF execution priority Jun 19, 2024
@AndreiEres AndreiEres added T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Jun 19, 2024
@AndreiEres AndreiEres changed the title [WIP] Add PVF execution priority Add PVF execution priority Jun 20, 2024
@AndreiEres AndreiEres marked this pull request as ready for review June 20, 2024 10:32
polkadot/node/subsystem-types/src/messages.rs Outdated Show resolved Hide resolved
polkadot/node/core/pvf/src/priority.rs Outdated Show resolved Hide resolved
polkadot/node/core/pvf/src/execute/queue.rs Outdated Show resolved Hide resolved
AndreiEres and others added 6 commits June 26, 2024 11:55
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
@AndreiEres
Copy link
Contributor Author

Updated the logic according to #4632 (comment) which looks reasonable

Copy link
Contributor

@alexggh alexggh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, good job, I like the simplicity, I left you some comments, nothing major.

As we discussed online, it would be nice if we can run some integrations tests to understand how the system behaves on the limit conditions, I think we can simulate that by making the pvf execution longer, so that we fall behind on work.

polkadot/node/subsystem-types/src/messages.rs Outdated Show resolved Hide resolved
polkadot/node/subsystem-types/src/messages.rs Outdated Show resolved Hide resolved
polkadot/node/subsystem-types/src/messages.rs Outdated Show resolved Hide resolved
polkadot/node/subsystem-types/src/messages.rs Outdated Show resolved Hide resolved
polkadot/node/core/pvf/src/execute/queue.rs Outdated Show resolved Hide resolved
polkadot/node/core/pvf/src/execute/queue.rs Outdated Show resolved Hide resolved
polkadot/node/core/pvf/src/execute/queue.rs Outdated Show resolved Hide resolved
polkadot/node/core/pvf/src/execute/queue.rs Outdated Show resolved Hide resolved
@AndreiEres
Copy link
Contributor Author

Testing results added to the description

Copy link
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @AndreiEres . This is very good work and the tests results show it. Just have a few more nits here and there, but otherwise good to go from my side.

@burdges any concerns with the percentages here:

// If we used parts of the original 100%, approvals can't take more than 24%,
?

)
}

fn is_fulfilled(&self, priority: &PvfExecPriority) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is a bit confusing in the absence of comments. Why not actually return if we can pickup a work item in the specified priority queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the name, we have to check if we reached a threshold for the specified priority

polkadot/node/core/pvf/src/execute/queue.rs Outdated Show resolved Hide resolved
// A threshold in percentages, the portion a current priority can "steal" from lower ones.
// For example:
// Disputes take 70%, leaving 30% for approvals and all backings.
// 80% of the remaining goes to approvals, which is 30% * 80% = 24% of the original 100%.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way of expressing the percentages is not easy. You have to do this math everytime. Can it be made simpler to not require the reader to do much math ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to update the docs to make it clearer. Unfortunately I didn't come up with a better solution.

polkadot/node/core/pvf/src/execute/queue.rs Outdated Show resolved Hide resolved
polkadot/node/core/candidate-validation/src/lib.rs Outdated Show resolved Hide resolved
///
/// This system might seem complex, but we operate with the remaining percentages because:
/// - Not all job types are present in each block. If we used parts of the original 100%,
/// approvals could not exceed 24%, even if there are no disputes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've trouble parsing "even if there are no disputes" here. If there are no backing jobs then approvals takes like 80% of the total, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that we will skip jobs if we only have approvals in the queue and they have reached 80%? If every current priority has reached its limit, we will select the highest one to execute. In that case, we will continue picking approvals in the absence of other jobs.

/// - Not all job types are present in each block. If we used parts of the original 100%,
/// approvals could not exceed 24%, even if there are no disputes.
/// - We cannot fully prioritize backing system parachains over backing other parachains based
/// on the distribution of the original 100%.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we might've some gradiations among the different system parachains, so do whatever you think is best for now. AssetHub could maybe halt without killing anything, although this depends upon how its used. Among the chains that must run "enough" during an epoch:

  • Sassafras tickets collection, if we ever moved it onto a parachain (maybe never).
  • Validator elections should maintain a fresh validator set, even if we've lost many for some reason.
  • If validator elections runs then we must run bridgehub enough, and any future DKG chains.
  • Some governance chain without smart contracts (collectives?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is just a first step. I agree it needs to be e reconsidered once we start out to move governance and staking out of the relay chain.

However, I think that a better algorithm would actually consider to target specific block times for the system parachains in times of overload.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaik, critical system paracahins should typically not require numerous blocks, specific block times, etc, but some require like one block from each validator during the epoch, including enough slack so spam cannot censor them.

This assumes the critical system paracahins cannot be tricked into wasting their own blocks, which likely becomes a risk only if the d-day governance supports smart contracts.

Copy link
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good, thank you! 👍

Comment on lines 686 to 691
const PRIORITY_ALLOCATION_THRESHOLDS: &'static [(PvfExecPriority, usize)] = &[
(PvfExecPriority::Dispute, 70),
(PvfExecPriority::Approval, 80),
(PvfExecPriority::BackingSystemParas, 100),
(PvfExecPriority::Backing, 100),
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How cool would it be to put that into the parachain host config? (Not in this PR ofc, just an idea)

Copy link
Contributor Author

@AndreiEres AndreiEres Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I believe we'll have more information to make a decision after using it in production for a while. @sandreim WDYT?

/// to separate and prioritize execution jobs by request type.
/// The order is important, because we iterate through the values and assume it is going from higher
/// to lowest priority.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash, EnumIter)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are already leaking to validation what we want to have validated here, we should likely keep the old name PvfExecKind and infer "priority" from this, together with other information. E.g. whether a validation is still worthwhile (ttl PR based on this one).

@AndreiEres AndreiEres added this pull request to the merge queue Oct 9, 2024
Merged via the queue into master with commit e294d62 Oct 9, 2024
190 of 201 checks passed
@AndreiEres AndreiEres deleted the AndreiEres/pvf-execution-priority branch October 9, 2024 17:44
ordian added a commit that referenced this pull request Oct 11, 2024
* master: (28 commits)
  `substrate-node`: removed excessive polkadot-sdk features (#5925)
  Rename QueueEvent::StartWork (#6015)
  [ci] Remove quick-benchmarks-omni from GitLab (#6014)
  Set larger timeout for cmd.yml (#6006)
  Fix `0003-beefy-and-mmr` test (#6003)
  Remove redundant XCMs from dry run's forwarded xcms (#5913)
  Add RadiumBlock bootnodes to Coretime Polkadot Chain spec (#5967)
  Bump strum from 0.26.2 to 0.26.3 (#5943)
  Add PVF execution priority (#4837)
  Snowbridge V2 docs (#5902)
  Fix u256 conversion in BABE (#5994)
  [ci] Move test-linux-stable-no-try-runtime to GHA (#5979)
  Bump PoV request timeout (#5924)
  [Release/CI] Github flow to build `polkadot`/`polkadot-parachain` rc binaries and deb package (#5963)
  [ci] Remove short-benchmarks from Gitlab (#5988)
  Disable flaky tests reported in 5972/5973/5974 (#5976)
  Bump some dependencies (#5886)
  bump zombienet version and set request for k8s (#5968)
  [omni-bencher] Make all runtimes work (#5872)
  Omni-Node renamings (#5915)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

PVF: prioritise execution depending on context
7 participants