Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

candidate-validation subsystem makes bounded channels pointless #708

Closed
eskimor opened this issue Feb 22, 2023 · 8 comments
Closed

candidate-validation subsystem makes bounded channels pointless #708

eskimor opened this issue Feb 22, 2023 · 8 comments

Comments

@eskimor
Copy link
Member

eskimor commented Feb 22, 2023

candidate-validation unconditionally spawns tasks on each new message.

This is bad for at least 3 reasons:

  1. It defeats the point of the bounded message channel, which is to back-pressure on the sending subsystem if necessary.
  2. Related to 1, this unconditional spawning is essentially unbounded - hence we might run into situations where we run out of memory.
  3. Due to the unboundedness also latency is unbounded.
  4. It weakens our understanding of the overall system. The bounded channel size to candidate-validation is always 0, although it might already be severely loaded.

Superficially everything seems to work as is, but some checking on these unbounded queuing (there is more than one, if I remember correctly) is in order. It might be causing hidden issues.

How it should work: Indeed spawn a couple of workers in parallel, but track how many are active - once too many, start blocking on incoming messages.

@sandreim
Copy link
Contributor

sandreim commented Feb 22, 2023

Isn't this still bounded by the maximum amount of cores x forks ? IIRC we can only run 2 PVFs in parallel (this means we cannot get severely loaded), but there is no backpressure as you pointed out.

Indeed we could just stop processing messages based on the validation backend load. We could implement this using futures unordered instead of tasks so it's easy to add bounds to parallel work.

@rphmeier
Copy link
Contributor

rphmeier commented Feb 22, 2023

I think we do more than 2 PVFs in parallel, based on the number of worker child processes the node is configured to have.

Worth noting that candidate-backing, dispute-coordinator, and approval-voting all spawn background tasks to handle the message send and wait for the validation result. This offloads the unboundedness to the sending side, as well. Although as Andrei points out, it's still bounded by the amount of work those subsystems have to do themselves.

@s0me0ne-unkn0wn
Copy link
Contributor

I think we do more than 2 PVFs in parallel, based on the number of worker child processes the node is configured to have.

Currently, the PVF host config limits queues to 2 execution workers and one preparation worker.

I agree it should be checked. The candidate validation subsystem serves as an aggregation point for validation requests from different subsystems to pass them to the PVF hosts, so it should implement proper backpressure and buffering.

@rphmeier
Copy link
Contributor

off topic side note, but I don't think that 2 execution workers will be enough. Probably need 6 or 8

@s0me0ne-unkn0wn
Copy link
Contributor

Probably need 6 or 8

It was the case one day (until November 2021, according to commit history), but it turned out (#4273) that the more workers we have, the less deterministically the PVF host behaves. All the validators feature different hardware specs and different background loads and executing a lot of PVFs in parallel results in timeouts and OOM conditions on the part of them. Also, it's rather hard to test in Versi environment where all the nodes run in nearly identical conditions.

Now we switched from measuring wallclock timeouts to measuring CPU time in preparation jobs, and we're going to introduce preparation worker memory limit (#6687). That would allow us to make preparation more deterministic and to lift those restrictions on the number of running workers at least partially. But more needs to be done for execution workers.

@eskimor
Copy link
Member Author

eskimor commented Feb 28, 2023

Indeed, quite likely that we need to bump up minimum hardware spec requirements for this.

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/ux-implications-of-pvf-executor-environment-versioning/2519/3

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@s0me0ne-unkn0wn s0me0ne-unkn0wn moved this from Backlog to In Progress in parachains team board Nov 2, 2023
@s0me0ne-unkn0wn s0me0ne-unkn0wn moved this from In Progress to Review in progress in parachains team board Nov 2, 2023
claravanstaden added a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
* Starts with upgrade.

* First pass at upgrade.

* Fix jsonrpsee dependencies and command file.

* Updates test-parachain

* Snowblink and snowbridge runtime upgrades.

* Updates readme with Polkadot version.

* Fix weights after merge.

Co-authored-by: claravanstaden <Cats 4 life!>
github-merge-queue bot pushed a commit that referenced this issue Jan 21, 2024
This PR aims to channel the backpressure of the PVF host's preparation
and execution queues to the candidate validation subsystem consumers.

Related: #708
bkchr pushed a commit that referenced this issue Apr 10, 2024
Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.61 to 1.0.62.
- [Release notes](https://github.com/serde-rs/json/releases)
- [Commits](serde-rs/json@v1.0.61...v1.0.62)

Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
@sandreim
Copy link
Contributor

implemented by #2125

@github-project-automation github-project-automation bot moved this from Review in progress to Completed in parachains team board May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Completed
Development

No branches or pull requests

5 participants