Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF: Add a new class of "possibly invalid" errors #660

Closed
mrcnski opened this issue Apr 10, 2023 · 5 comments
Closed

PVF: Add a new class of "possibly invalid" errors #660

mrcnski opened this issue Apr 10, 2023 · 5 comments
Assignees
Labels
C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be.

Comments

@mrcnski
Copy link
Contributor

mrcnski commented Apr 10, 2023

ISSUE

Overview

PVF execution plays a critical role in our dispute process. If we execute a candidate block and find it is invalid, we vote against it. If some validators' votes do not agree, a dispute is initiated.

Now, in the real world we may try to execute a candidate and it fails due to a hardware fault, operator error, some rare bug, etc. In this case, voting against would initiate a dispute, which is not what we want -- it may get the validator slashed!

So, right now we have two main failure states:

  1. InvalidCandidate, where we always vote against.
  2. InternalError, where we never vote against, but do retry once (since PVF: Don't dispute on missing artifact polkadot#7011) to let transient error conditions clear.

And we do actually have a third state: InvalidCandidate::AmbiguousWorkerDeath, which happens when the worker process dies for an unknown reason. In this case we retry once, but if it still fails, then we do vote against. This would happen if the PVF execution takes up a large amount of memory, causing OOM, in a way that is reproducible.

Proposal

So, I am proposing that we extend this third category, giving it a catch-all name like PossiblyInvalid. We would retry these, and only on continued failure deem the candidate invalid.

Other errors that we could treat this way:

  1. Timeout due to exceeded CPU time. Currently this always results in InvalidCandidate. But, under conditions of high local load we may do very little work, while still counting CPU time, eventually timing out for a valid candidate. This can be retried after a delay, in the hopes that the load died down. We have to be careful though. It may lead to yet more increase to load, on the other hand we may prevent disputes by retrying!
    a. (We may also want to detect conditions of high load and deal with it somehow.)
    b. NOTE: There may not be enough time for a retry in candidate validation from backing.
  2. We can treat RuntimeConstruction this way. Right now it always results in Invalid, though it should be a local issue. PVF: Consider treating RuntimeConstruction as an internal execution error #661 would be the full fix, but in the meantime it would make sense to retry in this case before voting against.
  3. Any others?

Having this separate variant would also just make the code easier to reason about, and easier to add more "retry-then-vote-against" cases in the future.

Also, it would be nice to make these three distinct categories clear in the documentation!

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@the-right-joyce the-right-joyce added C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. T8-parachains_engineering and removed J7-mentor labels Aug 25, 2023
@jpserrat
Copy link
Contributor

Hey @mrcnski, I'm looking for another issue to work on. Do you still recommend this one or is there another that you think would be better?

@mrcnski
Copy link
Contributor Author

mrcnski commented Oct 11, 2023

Yeah, it would be good to have! Just note that for your other PR, we will need to make a fix so that the "1. Timeout due to exceeded CPU time" case works correctly. I'll add a comment there.

BTW, I'm at a work retreat right now, so may be slow to respond.

@eagr
Copy link
Contributor

eagr commented Oct 13, 2023

enum ValidationError {
    // preparation issue that are deterministic
    Deterministic,
    // may-be-transient preparation issue caused by internal conditions
    Internal,
    // vote-against execution issue
    Invalid,
    // fail-once-more-vote-against execution issue
    PossiblyInvalid,
}

What do you think? @mrcnski

@mrcnski
Copy link
Contributor Author

mrcnski commented Oct 17, 2023

That looks great to me @eagr! I would just make a few changes:

enum ValidationError {
    // preparation issue that is deterministic
    Preparation,
    // may-be-transient issue with preparation or execution, caused by internal conditions
    Internal,
    // vote-against execution issue
    Invalid,
    // fail-once-more-vote-against execution issue
    PossiblyInvalid,
}

And FYI, per discussions with @Overkillus we may not want to retry in backing. Only in approval. Can be a separate issue from this one though.

@mrcnski
Copy link
Contributor Author

mrcnski commented Nov 22, 2023

Closing as completed in #2406.

We decided in #2438 that we won't do "1. Timeout due to exceeded CPU time".

For "2. We can treat RuntimeConstruction this way", there is an issue already for the full fix #661. It should be pretty simple now that substrate is no longer in a separate repo. 😀

@mrcnski mrcnski closed this as completed Nov 22, 2023
claravanstaden added a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* extract common parts of relay loops: begin

* merge client impls

* backoff in exchange loop

* reconnect without clone
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* extract common parts of relay loops: begin

* merge client impls

* backoff in exchange loop

* reconnect without clone
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* extract common parts of relay loops: begin

* merge client impls

* backoff in exchange loop

* reconnect without clone
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
* extract common parts of relay loops: begin

* merge client impls

* backoff in exchange loop

* reconnect without clone
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* extract common parts of relay loops: begin

* merge client impls

* backoff in exchange loop

* reconnect without clone
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* extract common parts of relay loops: begin

* merge client impls

* backoff in exchange loop

* reconnect without clone
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* extract common parts of relay loops: begin

* merge client impls

* backoff in exchange loop

* reconnect without clone
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* extract common parts of relay loops: begin

* merge client impls

* backoff in exchange loop

* reconnect without clone
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* extract common parts of relay loops: begin

* merge client impls

* backoff in exchange loop

* reconnect without clone
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
* extract common parts of relay loops: begin

* merge client impls

* backoff in exchange loop

* reconnect without clone
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024
* extract common parts of relay loops: begin

* merge client impls

* backoff in exchange loop

* reconnect without clone
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024
* extract common parts of relay loops: begin

* merge client impls

* backoff in exchange loop

* reconnect without clone
bkchr pushed a commit that referenced this issue Apr 10, 2024
* extract common parts of relay loops: begin

* merge client impls

* backoff in exchange loop

* reconnect without clone
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be.
Projects
Status: Completed
Development

Successfully merging a pull request may close this issue.

4 participants