candidate-validation subsystem makes bounded channels pointless #708

eskimor · 2023-02-22T13:06:54Z

candidate-validation unconditionally spawns tasks on each new message.

This is bad for at least 3 reasons:

It defeats the point of the bounded message channel, which is to back-pressure on the sending subsystem if necessary.
Related to 1, this unconditional spawning is essentially unbounded - hence we might run into situations where we run out of memory.
Due to the unboundedness also latency is unbounded.
It weakens our understanding of the overall system. The bounded channel size to candidate-validation is always 0, although it might already be severely loaded.

Superficially everything seems to work as is, but some checking on these unbounded queuing (there is more than one, if I remember correctly) is in order. It might be causing hidden issues.

How it should work: Indeed spawn a couple of workers in parallel, but track how many are active - once too many, start blocking on incoming messages.

sandreim · 2023-02-22T13:38:20Z

Isn't this still bounded by the maximum amount of cores x forks ? IIRC we can only run 2 PVFs in parallel (this means we cannot get severely loaded), but there is no backpressure as you pointed out.

Indeed we could just stop processing messages based on the validation backend load. We could implement this using futures unordered instead of tasks so it's easy to add bounds to parallel work.

rphmeier · 2023-02-22T20:58:13Z

I think we do more than 2 PVFs in parallel, based on the number of worker child processes the node is configured to have.

Worth noting that candidate-backing, dispute-coordinator, and approval-voting all spawn background tasks to handle the message send and wait for the validation result. This offloads the unboundedness to the sending side, as well. Although as Andrei points out, it's still bounded by the amount of work those subsystems have to do themselves.

s0me0ne-unkn0wn · 2023-02-27T12:57:11Z

I think we do more than 2 PVFs in parallel, based on the number of worker child processes the node is configured to have.

Currently, the PVF host config limits queues to 2 execution workers and one preparation worker.

I agree it should be checked. The candidate validation subsystem serves as an aggregation point for validation requests from different subsystems to pass them to the PVF hosts, so it should implement proper backpressure and buffering.

rphmeier · 2023-02-28T04:57:29Z

off topic side note, but I don't think that 2 execution workers will be enough. Probably need 6 or 8

s0me0ne-unkn0wn · 2023-02-28T11:01:13Z

Probably need 6 or 8

It was the case one day (until November 2021, according to commit history), but it turned out (#4273) that the more workers we have, the less deterministically the PVF host behaves. All the validators feature different hardware specs and different background loads and executing a lot of PVFs in parallel results in timeouts and OOM conditions on the part of them. Also, it's rather hard to test in Versi environment where all the nodes run in nearly identical conditions.

Now we switched from measuring wallclock timeouts to measuring CPU time in preparation jobs, and we're going to introduce preparation worker memory limit (#6687). That would allow us to make preparation more deterministic and to lift those restrictions on the number of running workers at least partially. But more needs to be done for execution workers.

eskimor · 2023-02-28T16:39:20Z

Indeed, quite likely that we need to bump up minimum hardware spec requirements for this.

Polkadot-Forum · 2023-04-06T09:55:34Z

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/ux-implications-of-pvf-executor-environment-versioning/2519/3

* Starts with upgrade. * First pass at upgrade. * Fix jsonrpsee dependencies and command file. * Updates test-parachain * Snowblink and snowbridge runtime upgrades. * Updates readme with Polkadot version. * Fix weights after merge. Co-authored-by: claravanstaden <Cats 4 life!>

This PR aims to channel the backpressure of the PVF host's preparation and execution queues to the candidate validation subsystem consumers. Related: #708

Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.61 to 1.0.62. - [Release notes](https://github.com/serde-rs/json/releases) - [Commits](serde-rs/json@v1.0.61...v1.0.62) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>

sandreim · 2024-05-29T13:35:50Z

implemented by #2125

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

the-right-joyce added this to parachains team board Oct 18, 2023

the-right-joyce moved this to Backlog in parachains team board Oct 18, 2023

s0me0ne-unkn0wn mentioned this issue Nov 2, 2023

Make candidate validation bounded again #2125

Merged

s0me0ne-unkn0wn moved this from Backlog to In Progress in parachains team board Nov 2, 2023

s0me0ne-unkn0wn moved this from In Progress to Review in progress in parachains team board Nov 2, 2023

github-merge-queue bot pushed a commit that referenced this issue Jan 21, 2024

Make candidate validation bounded again (#2125)

d37a456

This PR aims to channel the backpressure of the PVF host's preparation and execution queues to the candidate validation subsystem consumers. Related: #708

sandreim closed this as completed May 29, 2024

github-project-automation bot moved this from Review in progress to Completed in parachains team board May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

candidate-validation subsystem makes bounded channels pointless #708

candidate-validation subsystem makes bounded channels pointless #708

eskimor commented Feb 22, 2023 •

edited

Loading

sandreim commented Feb 22, 2023 •

edited

Loading

rphmeier commented Feb 22, 2023 •

edited

Loading

s0me0ne-unkn0wn commented Feb 27, 2023

rphmeier commented Feb 28, 2023

s0me0ne-unkn0wn commented Feb 28, 2023

eskimor commented Feb 28, 2023

Polkadot-Forum commented Apr 6, 2023

sandreim commented May 29, 2024

candidate-validation subsystem makes bounded channels pointless #708

candidate-validation subsystem makes bounded channels pointless #708

Comments

eskimor commented Feb 22, 2023 • edited Loading

sandreim commented Feb 22, 2023 • edited Loading

rphmeier commented Feb 22, 2023 • edited Loading

s0me0ne-unkn0wn commented Feb 27, 2023

rphmeier commented Feb 28, 2023

s0me0ne-unkn0wn commented Feb 28, 2023

eskimor commented Feb 28, 2023

Polkadot-Forum commented Apr 6, 2023

sandreim commented May 29, 2024

eskimor commented Feb 22, 2023 •

edited

Loading

sandreim commented Feb 22, 2023 •

edited

Loading

rphmeier commented Feb 22, 2023 •

edited

Loading