Engine API: add validAncestorHash to engine_executePayload #84

djrtwo · 2021-10-13T18:12:37Z

NOTE: Currently based on #83 and pointed to that PR as the base. Trying to keep the diff legible while continuing to be able to work. Will rebase and point this PR to main once #83 is merged

INVALID is insufficient to parse the validity of ancestor blocks that previously returned SYNCING. As per discussions at interop, add validAncestorHash in the return value to point to the latest (by block number) valid ancestor to the payload that is being processed. In the event of INVALID, this can thus invalidate an arbitrary number of blocks in the chain starting with the payload.

Did discuss this with @holiman to ensure that this is feasible in geth.

Additionally:

Provide guidance on how long to try to resolve a payload's dependencies before responding with SYNCING -- SLOTS_PER_SECOND / 30 (0.4s)
Do not make PoW vs PoS an exceptional case wrt SYNCING return value. Enforcing finding of dependencies of blocks with PoW ancestors rather than returning SYNCING could result in deadlocks.

mkalinin

Looks good to me! Made a couple of suggestions

src/engine/specification.md

holiman · 2021-10-14T06:43:31Z

cc @karalabe PTAL?

djrtwo · 2021-10-18T17:40:04Z

@karalabe and @holiman

Looking to get clarity on the "correct ancestor" requirement in this PR

holiman · 2021-10-19T07:51:00Z

A couple of cases,

EL is behind. We're trying to catch up, and we're currently at height X. CL tells us about M (way ahead). We can't say much about it, since we'd have to reach out to the network and see if X is an ancestor of M. In this case, we could report back "Sorry, we're at height X, can't evaluate height M yet".
EL is syncing. CL tells us about M. The 'last good' state we know if is genesis. Technically, we can't even be certain that genesis is an ancestor of M.
We have the parent, the parent is fine, but this block is bad. We can return parentHash.
We have the parent blocks, but one of the ancestors failed validation. This scenario has happened when there have been consensus splits, and long sidechains have developed. I'm not sure what the expectation on EL is in this case, whether to track sidechains or not (currently we keep sidechains until we move things from hot db to ancient-db, after 30K blocks. At this point, we remove non-canon blocks).

But it may be difficult to pinpoint the valid ancestor, as opposed to a 'current valid known block/height' (which may or may not be an ancestor).

djrtwo · 2021-10-19T19:47:13Z

Due to the requirement that CL optimistically process beacon blocks when EL returns SYNCING (which can be because of missing a few blocks or the entire chain), when CL gets an INVALID response, it does not have enough information to validate or invalidate and previously SYNCING blocks.

The impetus could then be on either CL or EL to figure things out. The proposed solution has EL try to figure it out by essentially saying the depth of the branch in question is invalid.

The primary problem I now see is if CL is asking about EL blocks that are entirely unavailable (e.g. due to it being an invalid chain of >30k+ depth that has been pruned). This would lead the EL not being able to respond with anything other than SYNCING because it cannot find the branch that a payload belongs too.
(Note that the 30k+ depth case would require that there is a 1/3 attacker that does something slashable and can only try to trick (likely just deadlock) syncing nodes.)

Your cases:

This is the returning of SYNCING and is expected behavior for this case or even much deeper cases
Again, in this case, EL would return SYNCING because it does not have he info to evaluate an unknown branch here
Yep, looks good
This is likely workable in most cases. The problem case would be if the beacon chain optimistically synced 30k+ blocks that had an invalid EL and which was pruned away from EL (this is ~3.5 days of blocks). We thus have (a very unlikely) edge case where validators tricked a CL into thinking some beacon branch was viable but ultimately had an invalid EL. This is a failure case and might require manual intervention (which is fine), but the problem is can EL or CL figure out that there was a failure. I suspect that as currently specified, EL would just keep returning SYNCING which would prevent CL from "making decisions" and serving data from endpoints as canonical (which is good), but you could end up in a deadlock of sorts whereby EL never can resolve a branch being queried abot.

iiuc, the only time the latest valid ancestor would be inaccessible is if this "bad" chain was pruned and EL could no longer find it when CL tries to insert a payload. So the questions are:

In such a 30k+ bad chain depth, can EL remember adequate info to respond instead with INVALID. Answer: No, because it's not just a node that previously had the info and pruned it. It is also new nodes that are coming online
Can CL detect this issue in a reasonable way. That is, can it -- through just seeing SYNCING responses -- decide that it might be on an unavailable/bad chain and need to look elsewhere or throw a critical error

Or do we make an assumption that CL will in almost all realistic cases not be led down a simultaneously invalid and unavailable EL chain when syncing because this would require a 1/3 slashable actor. And if it did happen, it would only result in a small set nodes being tricked and put into a bad sync state.

djrtwo · 2021-10-27T16:22:51Z

Okay, based on our convo today, we are going to keep the functionality as specified here -- return the most recent validHash in the event that the payload is INVALID.

There are still some cases where EL might never be able to resolve SYNCINGdue to many reasons -- e.g. peering issues, an unavailable invalid chain, etc. But this is no different than today.

How to surface such a sync failure is not in the scope of this PR and generally not in the scope of the core of the engine API

src/engine/specification.md

mkalinin · 2021-10-28T12:56:30Z

src/engine/specification.md

-4. In the case when the parent block is unknown, client software **MUST** pull the block from the network and take one of the following actions depending on the parent block properties:
-  - If the parent block is a PoW block as per [EIP-3675](https://eips.ethereum.org/EIPS/eip-3675#specification) definition, then all missing dependencies of the payload **MUST** be pulled from the network and validated accordingly. The call **MUST** be responded according to the validity of the payload and the chain of its ancestors.
-  - If the parent block is a PoS block as per [EIP-3675](https://eips.ethereum.org/EIPS/eip-3675#specification) definition, then the call **MAY** be responded with `SYNCING` status and sync process **SHOULD** be initiated accordingly.
+3. Client software **MUST** return `{status: SYNCING, lastestValidHash: None}` if the client software does not have the requisite data available locally to validate the payload and cannot retrieve this required data in less than `SLOTS_PER_SECOND / 30` (0.4s in the Mainnet configuration) or if the sync process is already in progress. In the event that requisite data to validate the payload is missing (e.g. does not have payload identified by `parentHash`), the client software **SHOULD** initiate the sync process.


What is the rationale of the timeout? Is the timer starts ticking after receiving the call and including the corresponding disk/cache access that tries to pull out the parent's state? If we want to give EL some time to go to the wire and pull e.g. a parent block then would this timeout cover the execution of a parent block or just pulling it from the wire?

I think the timeout only includes non-local retrieval. If you have the info locally (in a disk or memory cache) you should not be responding with SYNCING because you are in fact synced

The idea is that we want to give some guidance on when EL should decide, "I don't have the info locally required and don't have enough time to quickly sync to respond to the call"

So, the timer starts ticking when EL client initiates the sync process and ends when the missing block is received and executed, i.e. the timeout is applied to the sync process. I think the timeout mechanism should be specified with more details to avoid bugs due to misreading the spec.

I am thinking about cases when this timeout is beneficial for the node. Here is what comes on mind:

Temporal outage of EL software leading to EL's head behind the CL's one. If the diff is one EL block only then validator will likely not miss its attestation opportunity as the dependency will be resolved without EL turning into SYNCING; it it's more than a block then it's likely gonna turn into SYNCING

IMO, this mechanism could be even more beneficial if it targeted logical beacon chain clock instead of just absolute time. I.e. "if initiated sync process isn't finished until TIMESTAMP + (SLOTS_PER_SECOND / 3 - 1) then EL must respond with SYNCING". SLOTS_PER_SECOND / 3 - 1 -- here 1 is a placeholder and should be the time normally needed to execute the payload + respond back to CL + CL to update its fork choice state. The target here is to have the beacon block accepted before the attestation boundary.

Though, any of this approaches look complicated and bug prone. Taking into account the infrequency of such cases to appear do we really want this mechanism to be implemented? And if we do, the next question is the spec a good place for it or it could be a part of implementation? My current impression is that this is premature optimisation and when the need of this optimisation will be seen it could be done by implementers without requisition for the change to the spec.

I agree with pulling QoS out of this PR. I'll pull it out into a separate issue for discussion

removed here b6ded72

and opened relevant issue here -- #91

lightclient · 2021-10-28T14:00:17Z

Apologies for the force push on your PR @djrtwo -- was just resolving the conflicts from #85.

Co-authored-by: Mikhail Kalinin <noblesse.knight@gmail.com>

mkalinin

LGTM!

djrtwo requested a review from lightclient October 13, 2021 18:13

mkalinin reviewed Oct 14, 2021

View reviewed changes

src/engine/specification.md Outdated Show resolved Hide resolved

src/engine/specification.md Outdated Show resolved Hide resolved

holiman reviewed Oct 14, 2021

View reviewed changes

src/engine/specification.md Outdated Show resolved Hide resolved

src/engine/specification.md Outdated Show resolved Hide resolved

djrtwo force-pushed the latest-correct-ancestor branch from 2524d97 to 542948b Compare October 14, 2021 19:25

djrtwo mentioned this pull request Oct 18, 2021

Engine API: Strict message ordering #89

Merged

djrtwo changed the title ~~add validAncestorHash to engine_executePayload~~ Engine API: add validAncestorHash to engine_executePayload Oct 18, 2021

Base automatically changed from remove-prepare-payload to main October 18, 2021 23:04

mkalinin reviewed Oct 28, 2021

View reviewed changes

djrtwo added 3 commits October 28, 2021 15:53

add validAncestorHash to executePayload

343ea01

validAncestorHash -> latestValidHash

05dff4f

clarify when to return SYNCING

fa01a68

lightclient force-pushed the latest-correct-ancestor branch from 5e5495f to fa01a68 Compare October 28, 2021 13:59

djrtwo and others added 3 commits October 28, 2021 08:40

Clean up language around valid ancestor hash

e223be8

Co-authored-by: Mikhail Kalinin <noblesse.knight@gmail.com>

clean up description of latestValidHash

d0e67f9

remove executePayload SYNCING QoS timeing

741d0a6

djrtwo force-pushed the latest-correct-ancestor branch from b6ded72 to 741d0a6 Compare October 29, 2021 15:23

mkalinin approved these changes Oct 29, 2021

View reviewed changes

djrtwo mentioned this pull request Oct 29, 2021

engine_executePayload QoS when sync triggered #91

Closed

djrtwo merged commit f7730c9 into main Oct 29, 2021

djrtwo deleted the latest-correct-ancestor branch October 29, 2021 15:33

mkalinin mentioned this pull request May 16, 2022

eth/catalyst: set the correct LatestValidHash ethereum/go-ethereum#24855

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine API: add validAncestorHash to engine_executePayload #84

Engine API: add validAncestorHash to engine_executePayload #84

djrtwo commented Oct 13, 2021

mkalinin left a comment

holiman commented Oct 14, 2021

djrtwo commented Oct 18, 2021

holiman commented Oct 19, 2021 •

edited

Loading

djrtwo commented Oct 19, 2021

djrtwo commented Oct 27, 2021

mkalinin Oct 28, 2021

djrtwo Oct 28, 2021 •

edited

Loading

mkalinin Oct 29, 2021

djrtwo Oct 29, 2021

djrtwo Oct 29, 2021

djrtwo Oct 29, 2021

lightclient commented Oct 28, 2021

mkalinin left a comment

Engine API: add validAncestorHash to engine_executePayload #84

Engine API: add validAncestorHash to engine_executePayload #84

Conversation

djrtwo commented Oct 13, 2021

mkalinin left a comment

Choose a reason for hiding this comment

holiman commented Oct 14, 2021

djrtwo commented Oct 18, 2021

holiman commented Oct 19, 2021 • edited Loading

djrtwo commented Oct 19, 2021

djrtwo commented Oct 27, 2021

mkalinin Oct 28, 2021

Choose a reason for hiding this comment

djrtwo Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

mkalinin Oct 29, 2021

Choose a reason for hiding this comment

djrtwo Oct 29, 2021

Choose a reason for hiding this comment

djrtwo Oct 29, 2021

Choose a reason for hiding this comment

djrtwo Oct 29, 2021

Choose a reason for hiding this comment

lightclient commented Oct 28, 2021

mkalinin left a comment

Choose a reason for hiding this comment

holiman commented Oct 19, 2021 •

edited

Loading

djrtwo Oct 28, 2021 •

edited

Loading