Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5 / 5] Introduce approval-voting-parallel #4849

Merged
merged 73 commits into from
Sep 26, 2024

Conversation

alexggh
Copy link
Contributor

@alexggh alexggh commented Jun 20, 2024

This is the implementation of the approach described here: #1617 (comment) & #1617 (comment) & #1617 (comment).

Description of changes

The end goal is to have an architecture where we have single subsystem(approval-voting-parallel) and multiple worker types that would full-fill the work that currently is fulfilled by the approval-distribution and approval-voting subsystems. The main loop of the new subsystem would do just the distribution of work to the workers.

The new subsystem will have:

  • N approval-distribution workers: This would do the work that is currently being done by the approval-distribution subsystem and in addition to that will also perform the crypto-checks that an assignment is valid and that a vote is correctly signed. Work is assigned via the following formula: worker_index = msg.validator % WORKER_COUNT, this guarantees that all assignments and approvals from the same validator reach the same worker.
  • 1 approval-voting worker: This would receive an already valid message and do everything the approval-voting currently does, except the crypto-checking that has been moved already to the approval-distribution worker.

On the hot path of processing messages no synchronisation and waiting is needed between approval-distribution and approval-voting workers.

Screenshot 2024-06-07 at 11 28 08

Guidelines for reading

The full implementation is broken in 5 PRs and all of them are self-contained and improve things incrementally even without the parallelisation being implemented/enabled, the reason this approach was taken instead of a big-bang PR, is to make things easier to review and reduced the risk of breaking this critical subsystems.

After reading the full description of this PR, the changes should be read in the following order:

  1. [1 / 5] Optimize logic for gossiping assignments #4848, some other micro-optimizations for networks with a high number of validators. This change gives us a speed up by itself without any other changes.
  2. [2 / 5] Make approval-distribution logic runnable on a separate thread #4845 , this contains only interface changes to decouple the subsystem from the Context and be able to run multiple instances of the subsystem on different threads. No functional changes
  3. [3 / 5] Move crypto checks in the approval-distribution #4928, moving of the crypto checks from approval-voting in approval-distribution, so that the approval-distribution has no reason to wait after approval-voting anymore. This change gives us a speed up by itself without any other changes.
  4. [4 / 5] Make approval-voting runnable on a worker thread #4846, interface changes to make approval-voting runnable on a separate thread. No functional changes
  5. This PR, where we instantiate an approval-voting-parallel subsystem that runs on different workers the logic currently in approval-distribution and approval-voting.
  6. The next step after this changes get merged and deploy would be to bring all the files from approval-distribution, approval-voting, approval-voting-parallel into a single rust crate, to make it easier to maintain and understand the structure.

Results

Running subsystem-benchmarks with 1000 validators 100 fully ocuppied cores and triggering all assignments and approvals for all tranches

Approval does not lags behind.

Master

Chain selection approved  after 72500 ms hash=0x0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a

With this PoC

Chain selection approved  after 3500 ms hash=0x0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a

Gathering enough assignments

Enough assignments are gathered in less than 500ms, so that gives un a guarantee that un-necessary work does not get triggered, on master on the same benchmark because the subsystems fall behind on work, that number goes above 32 seconds on master.

Screenshot 2024-06-20 at 15 48 22

Cpu usage:

Master

CPU usage, seconds                     total   per block
approval-distribution                96.9436      9.6944
approval-voting                     117.4676     11.7468
test-environment                     44.0092      4.4009

With this PoC

CPU usage, seconds                     total   per block
approval-distribution                 0.0014      0.0001 --- unused
approval-voting                       0.0437      0.0044.  --- unused
approval-voting-parallel              5.9560      0.5956
approval-voting-parallel-0           22.9073      2.2907
approval-voting-parallel-1           23.0417      2.3042
approval-voting-parallel-2           22.0445      2.2045
approval-voting-parallel-3           22.7234      2.2723
approval-voting-parallel-4           21.9788      2.1979
approval-voting-parallel-5           23.0601      2.3060
approval-voting-parallel-6           22.4805      2.2481
approval-voting-parallel-7           21.8330      2.1833
approval-voting-parallel-db          37.1954      3.7195.  --- the approval-voting thread.

Enablement strategy

Because just some trivial plumbing is needed in approval-distribution and approval-voting to be able to run things in parallel and because this subsystems plays a critical part in the system this PR proposes that we keep both ways of running the approval work, as separated subsystems and just a single subsystem(approval-voting-parallel) which has multiple workers for the distribution work and one worker for the approval-voting work and switch between them with a comandline flag.

The benefits for this is twofold.

  1. With the same polkadot binary we can easily switch just a few validators to use the parallel approach and gradually make this the default way of running, if now issues arise.
  2. In the worst case scenario were it becomes the default way of running things, but we discover there are critical issues with it we have the path to quickly disable it by asking validators to adjust their command line flags.

Next steps

  • Make sure through various testing we are not missing anything
  • Polish the implementations to make them production ready
  • Add Unittest Tests for approval-voting-parallel.
  • Define and implement the strategy for rolling this change, so that the blast radius is minimal(single validator) in case there are problems with the implementation.
  • Versi long running tests.
  • Add relevant metrics.

@ordian @eskimor @sandreim @AndreiEres, let me know what you think.

@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-5-5 branch from a591635 to bd5529d Compare June 20, 2024 12:02
@alexggh alexggh changed the title Introduce approval-voting-parallel [5 / 5] Introduce approval-voting-parallel Jun 20, 2024
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-4-5 branch from cb57906 to 4b3f489 Compare July 2, 2024 11:40
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-5-5 branch from bd5529d to 5da132d Compare July 2, 2024 11:40
@alexggh alexggh changed the base branch from alexaggh/approval-voting-parallel-4-5 to alexaggh/approval-voting-parallel-2-5 July 2, 2024 11:46
@alexggh alexggh changed the base branch from alexaggh/approval-voting-parallel-2-5 to master July 2, 2024 11:51
@alexggh alexggh changed the base branch from master to alexaggh/approval-voting-parallel-2-5 July 2, 2024 12:04
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-2-5 branch from 7add070 to 3a0ba90 Compare July 3, 2024 12:16
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-5-5 branch from 5da132d to 1592d0b Compare July 3, 2024 12:16
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-2-5 branch from 3a0ba90 to 78bb23d Compare July 5, 2024 11:36
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-5-5 branch from 1592d0b to 0b5e321 Compare July 5, 2024 11:36
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-2-5 branch from 78bb23d to daf343f Compare July 5, 2024 13:13
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-5-5 branch from 0b5e321 to 49a0312 Compare July 5, 2024 13:13
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-2-5 branch from daf343f to 25f1f0a Compare July 10, 2024 14:42
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-5-5 branch from 49a0312 to 0746ba3 Compare July 10, 2024 14:42
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-2-5 branch from 25f1f0a to d13e1c8 Compare July 12, 2024 07:04
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-5-5 branch from 0746ba3 to cd2d303 Compare July 12, 2024 07:04
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-5-5 branch from cd2d303 to aa42eff Compare July 12, 2024 07:15
@alexggh alexggh marked this pull request as ready for review July 12, 2024 07:22
github-merge-queue bot pushed a commit that referenced this pull request Jul 16, 2024
This is part of the work to further optimize the approval subsystems, if
you want to understand the full context start with reading
#4849 (comment),
however that's not necessary, as this change is self-contained and nodes
would benefit from it regardless of subsequent changes landing or not.

While testing with 1000 validators I found out that the logic for
determining the validators an assignment should be gossiped to is taking
a lot of time, because it always iterated through all the peers, to
determine which are X and Y neighbours and to which we should randomly
gossip(4 samples).

This could be actually optimised, so we don't have to iterate through
all peers for each new assignment, by fetching the list of X and Y peer
ids from the topology first and then stopping the loop once we took the
4 random samples.

With this improvements we reduce the total CPU time spent in
approval-distribution with 15% on networks with 500 validators and 20%
on networks with 1000 validators.

## Test coverage:

`propagates_assignments_along_unshared_dimension` and
`propagates_locally_generated_assignment_to_both_dimensions` cover
already logic and they passed, confirm that there is no breaking change.

Additionally, the approval voting benchmark measure the traffic sent to
other peers, so I confirmed that for various network size there is no
difference in the size of the traffic sent to other peers.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 25, 2024
@alexggh alexggh added this pull request to the merge queue Sep 25, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Sep 25, 2024
@alexggh alexggh added this pull request to the merge queue Sep 25, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 25, 2024
@alexggh alexggh added this pull request to the merge queue Sep 26, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 26, 2024
@alexggh alexggh added this pull request to the merge queue Sep 26, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 26, 2024
@alexggh alexggh added this pull request to the merge queue Sep 26, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 26, 2024
@alexggh alexggh added this pull request to the merge queue Sep 26, 2024
Merged via the queue into master with commit b16237a Sep 26, 2024
221 checks passed
@alexggh alexggh deleted the alexaggh/approval-voting-parallel-5-5 branch September 26, 2024 13:08
github-merge-queue bot pushed a commit that referenced this pull request Oct 24, 2024
The approval-voting-parallel introduced with
#4849 has been tested on
`versi` and approximately 3 weeks on parity's existing kusama nodes
paritytech/devops#3583, things worked as
expected, so enable it by default on all kusama nodes in the next
release.

The next step will be enabling by default on polkadot if no issue
arrises while running on kusama.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
mordamax pushed a commit to paritytech-stg/polkadot-sdk that referenced this pull request Oct 25, 2024
The approval-voting-parallel introduced with
paritytech#4849 has been tested on
`versi` and approximately 3 weeks on parity's existing kusama nodes
paritytech/devops#3583, things worked as
expected, so enable it by default on all kusama nodes in the next
release.

The next step will be enabling by default on polkadot if no issue
arrises while running on kusama.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
mordamax pushed a commit to paritytech-stg/polkadot-sdk that referenced this pull request Oct 25, 2024
The approval-voting-parallel introduced with
paritytech#4849 has been tested on
`versi` and approximately 3 weeks on parity's existing kusama nodes
paritytech/devops#3583, things worked as
expected, so enable it by default on all kusama nodes in the next
release.

The next step will be enabling by default on polkadot if no issue
arrises while running on kusama.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
mordamax pushed a commit to paritytech-stg/polkadot-sdk that referenced this pull request Oct 25, 2024
The approval-voting-parallel introduced with
paritytech#4849 has been tested on
`versi` and approximately 3 weeks on parity's existing kusama nodes
paritytech/devops#3583, things worked as
expected, so enable it by default on all kusama nodes in the next
release.

The next step will be enabling by default on polkadot if no issue
arrises while running on kusama.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
mordamax pushed a commit to paritytech-stg/polkadot-sdk that referenced this pull request Oct 25, 2024
The approval-voting-parallel introduced with
paritytech#4849 has been tested on
`versi` and approximately 3 weeks on parity's existing kusama nodes
paritytech/devops#3583, things worked as
expected, so enable it by default on all kusama nodes in the next
release.

The next step will be enabling by default on polkadot if no issue
arrises while running on kusama.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
mordamax pushed a commit to paritytech-stg/polkadot-sdk that referenced this pull request Oct 25, 2024
The approval-voting-parallel introduced with
paritytech#4849 has been tested on
`versi` and approximately 3 weeks on parity's existing kusama nodes
paritytech/devops#3583, things worked as
expected, so enable it by default on all kusama nodes in the next
release.

The next step will be enabling by default on polkadot if no issue
arrises while running on kusama.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
mordamax pushed a commit to paritytech-stg/polkadot-sdk that referenced this pull request Oct 29, 2024
The approval-voting-parallel introduced with
paritytech#4849 has been tested on
`versi` and approximately 3 weeks on parity's existing kusama nodes
paritytech/devops#3583, things worked as
expected, so enable it by default on all kusama nodes in the next
release.

The next step will be enabling by default on polkadot if no issue
arrises while running on kusama.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
mordamax pushed a commit to paritytech-stg/polkadot-sdk that referenced this pull request Oct 29, 2024
The approval-voting-parallel introduced with
paritytech#4849 has been tested on
`versi` and approximately 3 weeks on parity's existing kusama nodes
paritytech/devops#3583, things worked as
expected, so enable it by default on all kusama nodes in the next
release.

The next step will be enabling by default on polkadot if no issue
arrises while running on kusama.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

5 participants