-
Notifications
You must be signed in to change notification settings - Fork 744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
candidate-validation subsystem makes bounded channels pointless #708
Comments
Isn't this still bounded by the maximum amount of cores x forks ? IIRC we can only run 2 PVFs in parallel (this means we cannot get severely loaded), but there is no backpressure as you pointed out. Indeed we could just stop processing messages based on the validation backend load. We could implement this using futures unordered instead of tasks so it's easy to add bounds to parallel work. |
I think we do more than 2 PVFs in parallel, based on the number of worker child processes the node is configured to have. Worth noting that candidate-backing, dispute-coordinator, and approval-voting all spawn background tasks to handle the message send and wait for the validation result. This offloads the unboundedness to the sending side, as well. Although as Andrei points out, it's still bounded by the amount of work those subsystems have to do themselves. |
Currently, the PVF host config limits queues to 2 execution workers and one preparation worker. I agree it should be checked. The candidate validation subsystem serves as an aggregation point for validation requests from different subsystems to pass them to the PVF hosts, so it should implement proper backpressure and buffering. |
off topic side note, but I don't think that 2 execution workers will be enough. Probably need 6 or 8 |
It was the case one day (until November 2021, according to commit history), but it turned out (#4273) that the more workers we have, the less deterministically the PVF host behaves. All the validators feature different hardware specs and different background loads and executing a lot of PVFs in parallel results in timeouts and OOM conditions on the part of them. Also, it's rather hard to test in Versi environment where all the nodes run in nearly identical conditions. Now we switched from measuring wallclock timeouts to measuring CPU time in preparation jobs, and we're going to introduce preparation worker memory limit (#6687). That would allow us to make preparation more deterministic and to lift those restrictions on the number of running workers at least partially. But more needs to be done for execution workers. |
Indeed, quite likely that we need to bump up minimum hardware spec requirements for this. |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/ux-implications-of-pvf-executor-environment-versioning/2519/3 |
* Starts with upgrade. * First pass at upgrade. * Fix jsonrpsee dependencies and command file. * Updates test-parachain * Snowblink and snowbridge runtime upgrades. * Updates readme with Polkadot version. * Fix weights after merge. Co-authored-by: claravanstaden <Cats 4 life!>
This PR aims to channel the backpressure of the PVF host's preparation and execution queues to the candidate validation subsystem consumers. Related: #708
Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.61 to 1.0.62. - [Release notes](https://github.com/serde-rs/json/releases) - [Commits](serde-rs/json@v1.0.61...v1.0.62) Signed-off-by: dependabot-preview[bot] <support@dependabot.com> Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
implemented by #2125 |
candidate-validation unconditionally spawns tasks on each new message.
This is bad for at least 3 reasons:
Superficially everything seems to work as is, but some checking on these unbounded queuing (there is more than one, if I remember correctly) is in order. It might be causing hidden issues.
How it should work: Indeed spawn a couple of workers in parallel, but track how many are active - once too many, start blocking on incoming messages.
The text was updated successfully, but these errors were encountered: