Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine the sweet spot for num execute_workers_max_num & prepare_workers_max_num #4126

Open
1 task
alexggh opened this issue Apr 15, 2024 · 16 comments
Open
1 task
Assignees

Comments

@alexggh
Copy link
Contributor

alexggh commented Apr 15, 2024

With the upcoming async backing changes where we are increasing the
parachain authoring time to 2 seconds instead of the max 500ms, 2
execution workers might prove not to be enough when multiple parachains
produce blocks that take 2 seconds to verify we will create very easy a
backlog of candidates we need to verify.

In the best case scenario a validator has to verify at least 7
candidates, the 6 tranch0 assignments and the candidate it helps with
backing, so if all of them take 2 seconds on the worst case scenario you
end-up needing 14s of execution time each block, so spliting that
between two workers you would need 7s of execution each 6s, that get us
in a situation where the pvf execution workers become the bottleneck of
the system.

Old PR where we changed this in the past: paritytech/polkadot#4273

Remaining work

@eskimor
Copy link
Member

eskimor commented Apr 15, 2024

Thanks @alexggh and @s0me0ne-unkn0wn for raising the issue.

Let's double the workers and go for Kusama first.

Other things to consider:

  1. New networking stack has been merged and if successfully enabled on Parity Kusama validators, we should make it the default soon.
  2. Together with our optimizations in approval voting (and more to come), we should be able to scale up the number of validators. Would be good to reach 1000 this year, this also reduces the amount of validation work needed per validator - by a factor.

@alexggh
Copy link
Contributor Author

alexggh commented Apr 15, 2024

Let's double the workers and go for Kusama first.

Unfortunately, the number of workers is hardcoded in the node excutable,

execute_workers_max_num: 2,
, so we can't really deploy it just for kusama since it would go in a node release.

However, I do think increasing the executor numbers from 2 to 4 should be uber low risk.

Here we can seee that https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware, our recommended HW spec is 4 hardware cores and 32GiB of ram, so we definitely should have space for 4 executions in parallel.

@sandreim
Copy link
Contributor

sandreim commented Apr 17, 2024

Let's double the workers and go for Kusama first.

Unfortunately, the number of workers is hardcoded in the node excutable,

execute_workers_max_num: 2,

, so we can't really deploy it just for kusama since it would go in a node release.

We can do this for Kusama only, see run_inner_node It just is a bit of plumbing work to do.

However, I do think increasing the executor numbers from 2 to 4 should be uber low risk.

Here we can seee that https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware, our recommended HW spec is 4 hardware cores and 32GiB of ram, so we definitely should have space for 4 executions in parallel.

That is a concern for me, if we want to allow 4 executions in parallel this means we need more resources for all the other stuff node is doing, building/importing relay chain blocks, networking, parachain consensus, etc.

@alexggh
Copy link
Contributor Author

alexggh commented Apr 17, 2024

That is a concern for me, if we want to allow 4 executions in parallel this means we need more resources for all the other stuff node is doing, building/importing relay chain blocks, networking, parachain consensus, et

Not sure, how that correlates, increasing this to 4 pvf executions would speed up the time we approve and back candidate, which are things we want to do as fast as we can.

Agree that those 2 extra threads would increase the total consumption of resources in the system, on the cpu size 2 threads should be the tipping point since we already have plenty of threads spawned. On the memory size the PVF execution seems to be limited to const EXTRA_HEAP_PAGES: u32 = 2048 and pub const DEFAULT_NATIVE_STACK_MAX: u32 = 256 * 1024 * 1024; so that around 256 MiB per worker, that's just a theoretical of maximum extra 512MiB needed. With 32GiB recommended that means just 1.6% of the total recommended memory.

Is that what you are referring to ?

@alexggh
Copy link
Contributor Author

alexggh commented Apr 17, 2024

However with the blunder we did yesterday on polkadot, I agree we should actually thread carefully here, so I will invest some time on the extra plumbing to enable it just on kusama first.

@sandreim
Copy link
Contributor

That is a concern for me, if we want to allow 4 executions in parallel this means we need more resources for all the other stuff node is doing, building/importing relay chain blocks, networking, parachain consensus, et

Not sure, how that correlates, increasing this to 4 pvf executions would speed up the time we approve and back candidate, which are things we want to do as fast as we can.

I agree that 4 pvfs executed in parallel would speed things up, but it would eat the CPU resources of the approval subsystems for example, so we need to think in terms of total resource consumption of the node and manage the load such that we don't get additional PVFs to execute when the system is loaded.

Agree that those 2 extra threads would increase the total consumption of resources in the system, on the cpu size 2 threads should be the tipping point since we already have plenty of threads spawned. On the memory size the PVF execution seems to be limited to const EXTRA_HEAP_PAGES: u32 = 2048 and pub const DEFAULT_NATIVE_STACK_MAX: u32 = 256 * 1024 * 1024; so that around 256 MiB per worker, that's just a theoretical of maximum extra 512MiB needed. With 32GiB recommended that means just 1.6% of the total recommended memory.

Is that what you are referring to ?
In terms of memory we should be fine yeah.

I am more concerned about the situation when we have longer PVFs execution times. Ideally at least 75% of the node CPU should be spent on executing PVFs, but as we've seen this is not the case.

What I propose to do instead of determining the sweet spot is a dynamic way of allocating CPU resources to PVF compilation and execution.

We reserve a pool of 4 workers (4 CPUs) dedicated for PVF. We then implement the priorities for dispatching work to this pool. The goal should be to prioritise finality above liveness of parachains.

  1. dispute PVF execution
  2. approval PVF execution
  3. backing PVF execution
  4. PVF compilation

We can cap ongoing PVF work to 1-2 at a time, but since PVF compilation takes a lot of time compared to execution we can choose to kill an ongoing PVF compilation if the CPU resources are required for disputes and there is no free worker. This should be a rare event.

This however doesn't solve what we observed on Kusama/Polkadot, but it should reduce the amount of new work created when finality is lagging.

@sandreim
Copy link
Contributor

However with the blunder we did yesterday on polkadot, I agree we should actually thread carefully here, so I will invest some time on the extra plumbing to enable it just on kusama first.

IMO was not really a blunder. The system worked as expected in the end, but we had different expectations on the duration and magnitude of the event.

@s0me0ne-unkn0wn
Copy link
Contributor

  • dispute PVF execution
  • approval PVF execution
  • backing PVF execution
  • PVF compilation

Compilation/preparation and execution pipelines are separate and use different workers. The preparation pipeline is prioritized, and the execution uses a best-effort approach (nearly FIFO in most situations, but that changes if executor parameters change on the session boundary or if a candidate from a previous session having different execution parameters should be validated).

It quite makes sense to work on execution queue prioritization, considering the prolonged execution times. But I'm not sure we'd benefit from killing preparation workers (more precisely, the preparation worker, as we only have one) to give resources to the execution workers. The preparation worker only occupies a single CPU core. It makes more sense to bump hardware requirements to me than to develop non-trivial algorithms trying to manage node resources in the software.

@sandreim
Copy link
Contributor

sandreim commented Apr 17, 2024

Compilation/preparation and execution pipelines are separate and use different workers. The preparation pipeline is prioritized, and the execution uses a best-effort approach (nearly FIFO in most situations, but that changes if executor parameters change on the session boundary or if a candidate from a previous session having different execution parameters should be validated).

Even if there are separate pipelines with different workers, they still use the same physical CPUs, so from that perspective I think it is a good idea to not start preparation/compilation if we have a large backlog of PVF executions due to high node/system load.

But I'm not sure we'd benefit from killing preparation workers (more precisely, the preparation worker, as we only have one) to give resources to the execution workers.

In the context of what happened yesterday it doesn't really make sense to kill it. Maybe we should still allow or even prioritise it if we need to do it as part of participating in a dispute or approve a candidate deep in the unfinalized chain

The preparation worker only occupies a single CPU core. It makes more sense to bump hardware requirements to me than to develop non-trivial algorithms trying to manage node resources in the software.

Yes, bumping hardware requirements is needed, but we need to have some numbers that justify increasing. With more validators we should be doing less work in terms of PVF executions but likely more work in approval signature checking for example.

@alexggh
Copy link
Contributor Author

alexggh commented Apr 17, 2024

This however doesn't solve what we observed on Kusama/Polkadot, but it should reduce the amount of new work created when finality is lagging.

This is not triggered by that.

We can cap ongoing PVF work to 1-2 at a time, but since PVF compilation takes a lot of time compared to execution we can

That's what we actually have right now, we have 1 worker for compilation and 2 workers for execution in disjunct worker pools.

dispute PVF execution
approval PVF execution
backing PVF execution

Putting backing at the end would actually affect liveness of the parachains, because when you have several PVF executions taking around 2seconds, you can easily build a back-log where candidates don't get backed in time.

Now, the way I think we should approach this problem.

  1. Can we build a more clever scheduling algorithm for execution and preparation that takes advantage of the node HW, obviously yes, but I do think that will easily become complex, since scheduling is never an easy problem.

  2. Is the 2 PVF execution workers, the best number we can safely have in your system right now(its only merit is that it works). I tend to think it is not, it is just number pulled out of the hat and I tend to think we can and should increase it a bit given our recommended hardware specifications.

So, what do you think about me investigating the time on safely rolling out the increase of PVF execution workers to 4 on kusama and then on polkadot and in parallel we can also have a backlog ticket to build a dynamic scheduler for this, which I think would take a longer time to implement and properly validate.

Or do you think we should invest from the beginning in building a dynamic scheduler.

@sandreim
Copy link
Contributor

sandreim commented Apr 17, 2024

Putting backing at the end would actually affect liveness of the parachains, because when you have several PVF executions taking around 2seconds, you can easily build a back-log where candidates don't get backed in time.

Yes, we'd want to backpressure backing if there is a lot of load in approval voting. If we don't do it, it will soon be even more work for approval voting which eventually leads to slow finality and slower block production if we have authoring backoff, If we don't have backoff, then could lead to OOM. I would frame it as a producer/consumer problem. We shouldn't produce more work if consumption doesn't keep up.

Now, the way I think we should approach this problem.

  1. Can we build a more clever scheduling algorithm for execution and preparation that takes advantage of the node HW, obviously yes, but I do think that will easily become complex, since scheduling is never an easy problem.
  2. Is the 2 PVF execution workers, the best number we can safely have in your system right now(its only merit is that it works). I tend to think it is not, it is just number pulled out of the hat and I tend to think we can and should increase it a bit given our recommended hardware specifications.

So, what do you think about me investigating the time on safely rolling out the increase of PVF execution workers to 4 on kusama and then on polkadot and in parallel we can also have a backlog ticket to build a dynamic scheduler for this, which I think would take a longer time to implement and properly validate.

👍🏼 and at same time we have to consider raising HW specs based on the data. We could run some gluttons and see what the impact of having 100% blockspace utilization with 2 workers vs 4 are.

Or do you think we should invest from the beginning in building a dynamic scheduler.

This can happen later if we bump specs anyway and should be driven by higher block space utilization trends.

github-merge-queue bot pushed a commit that referenced this issue Apr 19, 2024
…4172)

Related to #4126
discussion

Currently all preparations have same priority and this is not ideal in
all cases. This change should improve the finality time in the context
of on-demand parachains and when `ExecutorParams` are updated on-chain
and a rebuild of all artifacts is required. The desired effect is to
speed up approval and dispute PVF executions which require preparation
and delay backing executions which require preparation.

---------

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
alexggh added a commit that referenced this issue Apr 23, 2024
Part of #4126 we want
to safely increase the execute_workers_max_num gradually from chain to
chain and assess if there are any negative impacts.

This PR performs the necessary plumbing to be able to increase it based
on the chain id, it increase the number of execution workers from 2 to 4
on test network but lives kusama and polkadot unchanged until we gather
more data.

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh
Copy link
Contributor Author

alexggh commented Apr 23, 2024

Plumbing PR #4252 to be able to increase the execution workers based on the chain ID, it will take a few releases until an increase reaches polkadot, but I think we don't have any reason to rush this, so it should be fine to move slow on this.

github-merge-queue bot pushed a commit that referenced this issue Apr 23, 2024
Add a metric to be able to understand the time jobs are waiting in the
execution queue waiting for an available worker.
#4126

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
github-merge-queue bot pushed a commit that referenced this issue Apr 24, 2024
Part of #4126 we want
to safely increase the execute_workers_max_num gradually from chain to
chain and assess if there are any negative impacts.

This PR performs the necessary plumbing to be able to increase it based
on the chain id, it increase the number of execution workers from 2 to 4
on test network but lives kusama and polkadot unchanged until we gather
more data.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh
Copy link
Contributor Author

alexggh commented Apr 30, 2024

Did some simulations to estimate the CPU HW needs for a network with 500 validators, 100 cores, in normal conditions the configuration is here and a run of it here.

Reference hardware: https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware

Estimated CPU usage with benchmarks

CPU usage, seconds                                     per block
# Approval usage with 5 no-shows per candidate
approval-distribution                                     0.7141
approval-voting                                           1.0567
# Availability distribution usage
availability-distribution                                 0.0250
availability-store                                        0.1670
bitfield-distribution                                     0.0246
# Availability recovery usage
availability-recovery                                     2.7548

Adding all that it would consume ~4.7s of a single CPU core, this are the system that we know consume the most of our CPU time, but we are still missing a lot of other subsystem, a safety margin would be to double that, so let's assume everything besides PVF execution consumes 9s of CPU time.

non_pvf_subsystems_cpu_time_per_block = 9s
reference_hw_cpu_count = 4 cores
reference_hw_total_available_cpu_time_per_block = 4 * 6s = 24s
current_pvf_execution_time_allocated_per_block = 2 * 6s = 12s # 50%
# reference_hw_total_available_cpu_time_per_block - non_pvf_subsystems_cpu_time_per_block - current_pvf_execution_time_allocated_per_block
spare_cpu_time_on_reference_hardware_per_block = 3s 

What average parachain execution time could we support with 2 execution threads ?

At minimum validators would need to verify at least 7 candidates per block(6 random vrf assignments and 1 backing candidate), with 2 execution threads we have a maximum of 12s CPU time, so the maximum theoretical parachain execution time can not go farther than 1.7 seconds per candidate.

The current average on kusama is around 200ms, so it would take around ~10x increase in the execution time for all parachains to reach the maximum throughput.

My conclusions

With this numbers, we can say there isn't much spare cpu time, so I would concur that 2 pvf execution threads is actually the safe choice for a reference HW with 4 HW cpu cores, because that hard caps the PVF execution time at maximum 50% of the available CPU time.

Going to 4 pvf execution threads, would increase the available PVF execution time, but it has the downside that at the theoretical limit it could steal valuable CPU time from other mission critical subsystems, while with just 2 execution workers if a lot of work gets queued for PVF execution, then we will be slow on backing and approvals, but since no-show approvals are accepted late and backing needs to happen in just a window of time, we actually end-up in a situation where we don't back new candidates and give the network time to catch up with the approval work.

2 pvf execution threads, does not properly take advantage of validators having way more than 4 HW cores, but building a dynamic scheduling based on the HW the node is running on, would introduce a source of indeterminism in the network, since there is no guarantee other nodes have the same HW underneath them.

@sandreim
Copy link
Contributor

sandreim commented May 2, 2024

Thanks @alexggh for coming up with this neat analysis.

In order to get the full picture we also need to consider statement-distrbution load which I've seen to be around 15-30% of CPU on Kusama at current scale and also keep in mind that the libp2p based networking stack is usually 50% of the total node side CPU usage(modulo PVF executions) in the tests we've run so far at maximum scale. Issue being tracked here: #702.

@sandreim
Copy link
Contributor

sandreim commented May 2, 2024

However, the new litep2p stack should alleviate the issue mentioned above.

@alexggh alexggh moved this from Backlog to In Progress in parachains team board May 9, 2024
@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/rfc-increasing-recommended-minimum-core-count-for-reference-hardware/8156/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

5 participants