Determine the sweet spot for num execute_workers_max_num & prepare_workers_max_num #4126

alexggh · 2024-04-15T10:27:26Z

With the upcoming async backing changes where we are increasing the
parachain authoring time to 2 seconds instead of the max 500ms, 2
execution workers might prove not to be enough when multiple parachains
produce blocks that take 2 seconds to verify we will create very easy a
backlog of candidates we need to verify.

In the best case scenario a validator has to verify at least 7
candidates, the 6 tranch0 assignments and the candidate it helps with
backing, so if all of them take 2 seconds on the worst case scenario you
end-up needing 14s of execution time each block, so spliting that
between two workers you would need 7s of execution each 6s, that get us
in a situation where the pvf execution workers become the bottleneck of
the system.

Old PR where we changed this in the past: paritytech/polkadot#4273

Remaining work

After https://polkadot.subsquare.io/referenda/1051?tab=timeline goes into effect bump the number of PVF execution workers from 2 to 4 on kusama and polkadot.

eskimor · 2024-04-15T13:51:03Z

Thanks @alexggh and @s0me0ne-unkn0wn for raising the issue.

Let's double the workers and go for Kusama first.

Other things to consider:

New networking stack has been merged and if successfully enabled on Parity Kusama validators, we should make it the default soon.
Together with our optimizations in approval voting (and more to come), we should be able to scale up the number of validators. Would be good to reach 1000 this year, this also reduces the amount of validation work needed per validator - by a factor.

alexggh · 2024-04-15T13:57:18Z

Let's double the workers and go for Kusama first.

Unfortunately, the number of workers is hardcoded in the node excutable,

polkadot-sdk/polkadot/node/core/pvf/src/host.rs

Line 204 in 6f73b74

execute_workers_max_num: 2,

, so we can't really deploy it just for kusama since it would go in a node release.

However, I do think increasing the executor numbers from 2 to 4 should be uber low risk.

Here we can seee that https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware, our recommended HW spec is 4 hardware cores and 32GiB of ram, so we definitely should have space for 4 executions in parallel.

sandreim · 2024-04-17T08:26:35Z

Let's double the workers and go for Kusama first.

Unfortunately, the number of workers is hardcoded in the node excutable,

polkadot-sdk/polkadot/node/core/pvf/src/host.rs

Line 204 in 6f73b74

execute_workers_max_num: 2,

, so we can't really deploy it just for kusama since it would go in a node release.

We can do this for Kusama only, see run_inner_node It just is a bit of plumbing work to do.

However, I do think increasing the executor numbers from 2 to 4 should be uber low risk.

Here we can seee that https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware, our recommended HW spec is 4 hardware cores and 32GiB of ram, so we definitely should have space for 4 executions in parallel.

That is a concern for me, if we want to allow 4 executions in parallel this means we need more resources for all the other stuff node is doing, building/importing relay chain blocks, networking, parachain consensus, etc.

alexggh · 2024-04-17T08:39:39Z

That is a concern for me, if we want to allow 4 executions in parallel this means we need more resources for all the other stuff node is doing, building/importing relay chain blocks, networking, parachain consensus, et

Not sure, how that correlates, increasing this to 4 pvf executions would speed up the time we approve and back candidate, which are things we want to do as fast as we can.

Agree that those 2 extra threads would increase the total consumption of resources in the system, on the cpu size 2 threads should be the tipping point since we already have plenty of threads spawned. On the memory size the PVF execution seems to be limited to const EXTRA_HEAP_PAGES: u32 = 2048 and pub const DEFAULT_NATIVE_STACK_MAX: u32 = 256 * 1024 * 1024; so that around 256 MiB per worker, that's just a theoretical of maximum extra 512MiB needed. With 32GiB recommended that means just 1.6% of the total recommended memory.

Is that what you are referring to ?

alexggh · 2024-04-17T08:41:03Z

However with the blunder we did yesterday on polkadot, I agree we should actually thread carefully here, so I will invest some time on the extra plumbing to enable it just on kusama first.

sandreim · 2024-04-17T08:57:58Z

That is a concern for me, if we want to allow 4 executions in parallel this means we need more resources for all the other stuff node is doing, building/importing relay chain blocks, networking, parachain consensus, et

Not sure, how that correlates, increasing this to 4 pvf executions would speed up the time we approve and back candidate, which are things we want to do as fast as we can.

I agree that 4 pvfs executed in parallel would speed things up, but it would eat the CPU resources of the approval subsystems for example, so we need to think in terms of total resource consumption of the node and manage the load such that we don't get additional PVFs to execute when the system is loaded.

Agree that those 2 extra threads would increase the total consumption of resources in the system, on the cpu size 2 threads should be the tipping point since we already have plenty of threads spawned. On the memory size the PVF execution seems to be limited to const EXTRA_HEAP_PAGES: u32 = 2048 and pub const DEFAULT_NATIVE_STACK_MAX: u32 = 256 * 1024 * 1024; so that around 256 MiB per worker, that's just a theoretical of maximum extra 512MiB needed. With 32GiB recommended that means just 1.6% of the total recommended memory.

Is that what you are referring to ?
In terms of memory we should be fine yeah.

I am more concerned about the situation when we have longer PVFs execution times. Ideally at least 75% of the node CPU should be spent on executing PVFs, but as we've seen this is not the case.

What I propose to do instead of determining the sweet spot is a dynamic way of allocating CPU resources to PVF compilation and execution.

We reserve a pool of 4 workers (4 CPUs) dedicated for PVF. We then implement the priorities for dispatching work to this pool. The goal should be to prioritise finality above liveness of parachains.

dispute PVF execution
approval PVF execution
backing PVF execution
PVF compilation

We can cap ongoing PVF work to 1-2 at a time, but since PVF compilation takes a lot of time compared to execution we can choose to kill an ongoing PVF compilation if the CPU resources are required for disputes and there is no free worker. This should be a rare event.

This however doesn't solve what we observed on Kusama/Polkadot, but it should reduce the amount of new work created when finality is lagging.

sandreim · 2024-04-17T09:07:27Z

However with the blunder we did yesterday on polkadot, I agree we should actually thread carefully here, so I will invest some time on the extra plumbing to enable it just on kusama first.

IMO was not really a blunder. The system worked as expected in the end, but we had different expectations on the duration and magnitude of the event.

s0me0ne-unkn0wn · 2024-04-17T09:11:44Z

dispute PVF execution

approval PVF execution

backing PVF execution

PVF compilation

Compilation/preparation and execution pipelines are separate and use different workers. The preparation pipeline is prioritized, and the execution uses a best-effort approach (nearly FIFO in most situations, but that changes if executor parameters change on the session boundary or if a candidate from a previous session having different execution parameters should be validated).

It quite makes sense to work on execution queue prioritization, considering the prolonged execution times. But I'm not sure we'd benefit from killing preparation workers (more precisely, the preparation worker, as we only have one) to give resources to the execution workers. The preparation worker only occupies a single CPU core. It makes more sense to bump hardware requirements to me than to develop non-trivial algorithms trying to manage node resources in the software.

sandreim · 2024-04-17T09:24:40Z

Compilation/preparation and execution pipelines are separate and use different workers. The preparation pipeline is prioritized, and the execution uses a best-effort approach (nearly FIFO in most situations, but that changes if executor parameters change on the session boundary or if a candidate from a previous session having different execution parameters should be validated).

Even if there are separate pipelines with different workers, they still use the same physical CPUs, so from that perspective I think it is a good idea to not start preparation/compilation if we have a large backlog of PVF executions due to high node/system load.

But I'm not sure we'd benefit from killing preparation workers (more precisely, the preparation worker, as we only have one) to give resources to the execution workers.

In the context of what happened yesterday it doesn't really make sense to kill it. Maybe we should still allow or even prioritise it if we need to do it as part of participating in a dispute or approve a candidate deep in the unfinalized chain

The preparation worker only occupies a single CPU core. It makes more sense to bump hardware requirements to me than to develop non-trivial algorithms trying to manage node resources in the software.

Yes, bumping hardware requirements is needed, but we need to have some numbers that justify increasing. With more validators we should be doing less work in terms of PVF executions but likely more work in approval signature checking for example.

alexggh · 2024-04-17T09:26:14Z

This however doesn't solve what we observed on Kusama/Polkadot, but it should reduce the amount of new work created when finality is lagging.

This is not triggered by that.

We can cap ongoing PVF work to 1-2 at a time, but since PVF compilation takes a lot of time compared to execution we can

That's what we actually have right now, we have 1 worker for compilation and 2 workers for execution in disjunct worker pools.

dispute PVF execution
approval PVF execution
backing PVF execution

Putting backing at the end would actually affect liveness of the parachains, because when you have several PVF executions taking around 2seconds, you can easily build a back-log where candidates don't get backed in time.

Now, the way I think we should approach this problem.

Can we build a more clever scheduling algorithm for execution and preparation that takes advantage of the node HW, obviously yes, but I do think that will easily become complex, since scheduling is never an easy problem.
Is the 2 PVF execution workers, the best number we can safely have in your system right now(its only merit is that it works). I tend to think it is not, it is just number pulled out of the hat and I tend to think we can and should increase it a bit given our recommended hardware specifications.

So, what do you think about me investigating the time on safely rolling out the increase of PVF execution workers to 4 on kusama and then on polkadot and in parallel we can also have a backlog ticket to build a dynamic scheduler for this, which I think would take a longer time to implement and properly validate.

Or do you think we should invest from the beginning in building a dynamic scheduler.

sandreim · 2024-04-17T09:39:02Z

Putting backing at the end would actually affect liveness of the parachains, because when you have several PVF executions taking around 2seconds, you can easily build a back-log where candidates don't get backed in time.

Yes, we'd want to backpressure backing if there is a lot of load in approval voting. If we don't do it, it will soon be even more work for approval voting which eventually leads to slow finality and slower block production if we have authoring backoff, If we don't have backoff, then could lead to OOM. I would frame it as a producer/consumer problem. We shouldn't produce more work if consumption doesn't keep up.

Now, the way I think we should approach this problem.

Can we build a more clever scheduling algorithm for execution and preparation that takes advantage of the node HW, obviously yes, but I do think that will easily become complex, since scheduling is never an easy problem.

Is the 2 PVF execution workers, the best number we can safely have in your system right now(its only merit is that it works). I tend to think it is not, it is just number pulled out of the hat and I tend to think we can and should increase it a bit given our recommended hardware specifications.

So, what do you think about me investigating the time on safely rolling out the increase of PVF execution workers to 4 on kusama and then on polkadot and in parallel we can also have a backlog ticket to build a dynamic scheduler for this, which I think would take a longer time to implement and properly validate.

👍🏼 and at same time we have to consider raising HW specs based on the data. We could run some gluttons and see what the impact of having 100% blockspace utilization with 2 workers vs 4 are.

Or do you think we should invest from the beginning in building a dynamic scheduler.

This can happen later if we bump specs anyway and should be driven by higher block space utilization trends.

…4172) Related to #4126 discussion Currently all preparations have same priority and this is not ideal in all cases. This change should improve the finality time in the context of on-demand parachains and when `ExecutorParams` are updated on-chain and a rebuild of all artifacts is required. The desired effect is to speed up approval and dispute PVF executions which require preparation and delay backing executions which require preparation. --------- Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>

Part of #4126 we want to safely increase the execute_workers_max_num gradually from chain to chain and assess if there are any negative impacts. This PR performs the necessary plumbing to be able to increase it based on the chain id, it increase the number of execution workers from 2 to 4 on test network but lives kusama and polkadot unchanged until we gather more data. Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

alexggh · 2024-04-23T08:14:29Z

Plumbing PR #4252 to be able to increase the execution workers based on the chain ID, it will take a few releases until an increase reaches polkadot, but I think we don't have any reason to rush this, so it should be fine to move slow on this.

Add a metric to be able to understand the time jobs are waiting in the execution queue waiting for an available worker. #4126 Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

Part of #4126 we want to safely increase the execute_workers_max_num gradually from chain to chain and assess if there are any negative impacts. This PR performs the necessary plumbing to be able to increase it based on the chain id, it increase the number of execution workers from 2 to 4 on test network but lives kusama and polkadot unchanged until we gather more data. --------- Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

alexggh · 2024-04-30T06:09:42Z

Did some simulations to estimate the CPU HW needs for a network with 500 validators, 100 cores, in normal conditions the configuration is here and a run of it here.

Reference hardware: https://wiki.polkadot.network/docs/maintain-guides-how-to-validate-polkadot#reference-hardware

Estimated CPU usage with benchmarks

CPU usage, seconds                                     per block
# Approval usage with 5 no-shows per candidate
approval-distribution                                     0.7141
approval-voting                                           1.0567
# Availability distribution usage
availability-distribution                                 0.0250
availability-store                                        0.1670
bitfield-distribution                                     0.0246
# Availability recovery usage
availability-recovery                                     2.7548

Adding all that it would consume ~4.7s of a single CPU core, this are the system that we know consume the most of our CPU time, but we are still missing a lot of other subsystem, a safety margin would be to double that, so let's assume everything besides PVF execution consumes 9s of CPU time.

non_pvf_subsystems_cpu_time_per_block = 9s
reference_hw_cpu_count = 4 cores
reference_hw_total_available_cpu_time_per_block = 4 * 6s = 24s
current_pvf_execution_time_allocated_per_block = 2 * 6s = 12s # 50%
# reference_hw_total_available_cpu_time_per_block - non_pvf_subsystems_cpu_time_per_block - current_pvf_execution_time_allocated_per_block
spare_cpu_time_on_reference_hardware_per_block = 3s

What average parachain execution time could we support with 2 execution threads ?

At minimum validators would need to verify at least 7 candidates per block(6 random vrf assignments and 1 backing candidate), with 2 execution threads we have a maximum of 12s CPU time, so the maximum theoretical parachain execution time can not go farther than 1.7 seconds per candidate.

The current average on kusama is around 200ms, so it would take around ~10x increase in the execution time for all parachains to reach the maximum throughput.

My conclusions

With this numbers, we can say there isn't much spare cpu time, so I would concur that 2 pvf execution threads is actually the safe choice for a reference HW with 4 HW cpu cores, because that hard caps the PVF execution time at maximum 50% of the available CPU time.

Going to 4 pvf execution threads, would increase the available PVF execution time, but it has the downside that at the theoretical limit it could steal valuable CPU time from other mission critical subsystems, while with just 2 execution workers if a lot of work gets queued for PVF execution, then we will be slow on backing and approvals, but since no-show approvals are accepted late and backing needs to happen in just a window of time, we actually end-up in a situation where we don't back new candidates and give the network time to catch up with the approval work.

2 pvf execution threads, does not properly take advantage of validators having way more than 4 HW cores, but building a dynamic scheduling based on the HW the node is running on, would introduce a source of indeterminism in the network, since there is no guarantee other nodes have the same HW underneath them.

sandreim · 2024-05-02T12:33:02Z

Thanks @alexggh for coming up with this neat analysis.

In order to get the full picture we also need to consider statement-distrbution load which I've seen to be around 15-30% of CPU on Kusama at current scale and also keep in mind that the libp2p based networking stack is usually 50% of the total node side CPU usage(modulo PVF executions) in the tests we've run so far at maximum scale. Issue being tracked here: #702.

sandreim · 2024-05-02T12:34:23Z

However, the new litep2p stack should alleviate the issue mentioned above.

Polkadot-Forum · 2024-05-20T14:53:10Z

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/rfc-increasing-recommended-minimum-core-count-for-reference-hardware/8156/1

alexggh added this to parachains team board Apr 15, 2024

github-project-automation bot moved this to Backlog in parachains team board Apr 15, 2024

alexggh self-assigned this Apr 15, 2024

sandreim mentioned this issue Apr 17, 2024

Use higher priority for PVF preparation in dispute/approval context #4172

Merged

alexggh mentioned this issue Apr 23, 2024

Add metric for time spent waiting in the execution queue #4250

Merged

alexggh mentioned this issue Apr 23, 2024

Plumbing to increase pvf workers configuration based on chain id #4252

Merged

alexggh moved this from Backlog to In Progress in parachains team board May 9, 2024

sandreim mentioned this issue May 29, 2024

PVF: prioritise execution depending on context #4632

Closed

sandreim mentioned this issue Jul 16, 2024

Subsystem benchmarks: determine node CPU usage for 1000 validators and 200 full occupied cores #5035

Open

alexggh mentioned this issue Sep 30, 2024

Enabling 1k validators & 200 cores on polkadot #5867

Open

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine the sweet spot for num execute_workers_max_num & prepare_workers_max_num #4126

Determine the sweet spot for num execute_workers_max_num & prepare_workers_max_num #4126

alexggh commented Apr 15, 2024 •

edited

Loading

eskimor commented Apr 15, 2024

alexggh commented Apr 15, 2024

sandreim commented Apr 17, 2024 •

edited

Loading

alexggh commented Apr 17, 2024

alexggh commented Apr 17, 2024

sandreim commented Apr 17, 2024

sandreim commented Apr 17, 2024

s0me0ne-unkn0wn commented Apr 17, 2024

sandreim commented Apr 17, 2024 •

edited

Loading

alexggh commented Apr 17, 2024 •

edited

Loading

sandreim commented Apr 17, 2024 •

edited

Loading

alexggh commented Apr 23, 2024

alexggh commented Apr 30, 2024 •

edited

Loading

sandreim commented May 2, 2024

sandreim commented May 2, 2024

Polkadot-Forum commented May 20, 2024

Determine the sweet spot for num execute_workers_max_num & prepare_workers_max_num #4126

Determine the sweet spot for num execute_workers_max_num & prepare_workers_max_num #4126

Comments

alexggh commented Apr 15, 2024 • edited Loading

Remaining work

eskimor commented Apr 15, 2024

alexggh commented Apr 15, 2024

sandreim commented Apr 17, 2024 • edited Loading

alexggh commented Apr 17, 2024

alexggh commented Apr 17, 2024

sandreim commented Apr 17, 2024

sandreim commented Apr 17, 2024

s0me0ne-unkn0wn commented Apr 17, 2024

sandreim commented Apr 17, 2024 • edited Loading

alexggh commented Apr 17, 2024 • edited Loading

sandreim commented Apr 17, 2024 • edited Loading

alexggh commented Apr 23, 2024

alexggh commented Apr 30, 2024 • edited Loading

Estimated CPU usage with benchmarks

What average parachain execution time could we support with 2 execution threads ?

My conclusions

sandreim commented May 2, 2024

sandreim commented May 2, 2024

Polkadot-Forum commented May 20, 2024

alexggh commented Apr 15, 2024 •

edited

Loading

sandreim commented Apr 17, 2024 •

edited

Loading

sandreim commented Apr 17, 2024 •

edited

Loading

alexggh commented Apr 17, 2024 •

edited

Loading

sandreim commented Apr 17, 2024 •

edited

Loading

alexggh commented Apr 30, 2024 •

edited

Loading