PVF: Add a new class of "possibly invalid" errors #660

mrcnski · 2023-04-10T16:41:55Z

ISSUE

Overview

PVF execution plays a critical role in our dispute process. If we execute a candidate block and find it is invalid, we vote against it. If some validators' votes do not agree, a dispute is initiated.

Now, in the real world we may try to execute a candidate and it fails due to a hardware fault, operator error, some rare bug, etc. In this case, voting against would initiate a dispute, which is not what we want -- it may get the validator slashed!

So, right now we have two main failure states:

InvalidCandidate, where we always vote against.
InternalError, where we never vote against, but do retry once (since PVF: Don't dispute on missing artifact polkadot#7011) to let transient error conditions clear.

And we do actually have a third state: InvalidCandidate::AmbiguousWorkerDeath, which happens when the worker process dies for an unknown reason. In this case we retry once, but if it still fails, then we do vote against. This would happen if the PVF execution takes up a large amount of memory, causing OOM, in a way that is reproducible.

Proposal

So, I am proposing that we extend this third category, giving it a catch-all name like PossiblyInvalid. We would retry these, and only on continued failure deem the candidate invalid.

Other errors that we could treat this way:

Timeout due to exceeded CPU time. Currently this always results in InvalidCandidate. But, under conditions of high local load we may do very little work, while still counting CPU time, eventually timing out for a valid candidate. This can be retried after a delay, in the hopes that the load died down. We have to be careful though. It may lead to yet more increase to load, on the other hand we may prevent disputes by retrying!
a. (We may also want to detect conditions of high load and deal with it somehow.)
b. NOTE: There may not be enough time for a retry in candidate validation from backing.
We can treat RuntimeConstruction this way. Right now it always results in Invalid, though it should be a local issue. PVF: Consider treating RuntimeConstruction as an internal execution error #661 would be the full fix, but in the meantime it would make sense to retry in this case before voting against.
Any others?

Having this separate variant would also just make the code easier to reason about, and easier to add more "retry-then-vote-against" cases in the future.

Also, it would be nice to make these three distinct categories clear in the documentation!

The text was updated successfully, but these errors were encountered:

jpserrat · 2023-10-11T00:15:07Z

Hey @mrcnski, I'm looking for another issue to work on. Do you still recommend this one or is there another that you think would be better?

mrcnski · 2023-10-11T09:10:32Z

Yeah, it would be good to have! Just note that for your other PR, we will need to make a fix so that the "1. Timeout due to exceeded CPU time" case works correctly. I'll add a comment there.

BTW, I'm at a work retreat right now, so may be slow to respond.

eagr · 2023-10-13T18:03:50Z

enum ValidationError {
    // preparation issue that are deterministic
    Deterministic,
    // may-be-transient preparation issue caused by internal conditions
    Internal,
    // vote-against execution issue
    Invalid,
    // fail-once-more-vote-against execution issue
    PossiblyInvalid,
}

What do you think? @mrcnski

mrcnski · 2023-10-17T10:56:15Z

That looks great to me @eagr! I would just make a few changes:

enum ValidationError {
    // preparation issue that is deterministic
    Preparation,
    // may-be-transient issue with preparation or execution, caused by internal conditions
    Internal,
    // vote-against execution issue
    Invalid,
    // fail-once-more-vote-against execution issue
    PossiblyInvalid,
}

And FYI, per discussions with @Overkillus we may not want to retry in backing. Only in approval. Can be a separate issue from this one though.

mrcnski · 2023-11-22T18:18:14Z

Closing as completed in #2406.

We decided in #2438 that we won't do "1. Timeout due to exceeded CPU time".

For "2. We can treat RuntimeConstruction this way", there is an issue already for the full fix #661. It should be pretty simple now that substrate is no longer in a separate repo. 😀

…led (paritytech#660)

* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone

mrcnski added T4-parachains_engineering labels Apr 10, 2023

This was referenced Apr 11, 2023

PVF: Vote invalid on panics in execution thread paritytech/polkadot#7045

Closed

Remove polling from PVF preparation memory tracker #719

Open

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

the-right-joyce added C1-mentor A task where a mentor is available. Please indicate in the issue who the mentor could be. T8-parachains_engineering and removed J7-mentor labels Aug 25, 2023

Overkillus mentioned this issue Sep 6, 2023

High Level PVF Nondeterminism #1434

Open

mrcnski mentioned this issue Oct 11, 2023

change prepare worker to use fork instead of threads #1685

Merged

eagr mentioned this issue Oct 20, 2023

Refactor ValidationError #1958

Closed

the-right-joyce removed the T8-parachains_engineering label Oct 23, 2023

This was referenced Nov 20, 2023

Refactor ValidationError #2406

Merged

Retry timeout #2438

Closed

mrcnski closed this as completed Nov 22, 2023

mrcnski assigned eagr Nov 26, 2023

claravanstaden added a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023

Adds fake beacon initial sync config to use if beacon sync isn't enab…

1a93322

…led (paritytech#660)

helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024

Add release workflow CI (paritytech#660)

f1b2242

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024

Extract common part of relay loops (paritytech#660)

a62a215

* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024

Extract common part of relay loops (paritytech#660)

b4dfb54

* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024

Extract common part of relay loops (paritytech#660)

cee5300

* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024

Extract common part of relay loops (paritytech#660)

5b04829

* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

Extract common part of relay loops (paritytech#660)

27f82f2

* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

Extract common part of relay loops (paritytech#660)

33f42de

* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

Extract common part of relay loops (paritytech#660)

3b215c9

* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

Extract common part of relay loops (paritytech#660)

861e8fd

* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

Extract common part of relay loops (paritytech#660)

ce836bc

* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

Extract common part of relay loops (paritytech#660)

2e83e73

* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024

Extract common part of relay loops (paritytech#660)

cdebb4f

* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024

Extract common part of relay loops (paritytech#660)

a3ad330

* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone

bkchr pushed a commit that referenced this issue Apr 10, 2024

Extract common part of relay loops (#660)

44bf842

* extract common parts of relay loops: begin * merge client impls * backoff in exchange loop * reconnect without clone

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVF: Add a new class of "possibly invalid" errors #660

PVF: Add a new class of "possibly invalid" errors #660

mrcnski commented Apr 10, 2023 •

edited

Loading

jpserrat commented Oct 11, 2023

mrcnski commented Oct 11, 2023

eagr commented Oct 13, 2023

mrcnski commented Oct 17, 2023

mrcnski commented Nov 22, 2023

PVF: Add a new class of "possibly invalid" errors #660

PVF: Add a new class of "possibly invalid" errors #660

Comments

mrcnski commented Apr 10, 2023 • edited Loading

ISSUE

Overview

Proposal

jpserrat commented Oct 11, 2023

mrcnski commented Oct 11, 2023

eagr commented Oct 13, 2023

mrcnski commented Oct 17, 2023

mrcnski commented Nov 22, 2023

mrcnski commented Apr 10, 2023 •

edited

Loading