Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine API: add validAncestorHash to engine_executePayload #84

Merged
merged 6 commits into from
Oct 29, 2021

Conversation

djrtwo
Copy link
Contributor

@djrtwo djrtwo commented Oct 13, 2021

NOTE: Currently based on #83 and pointed to that PR as the base. Trying to keep the diff legible while continuing to be able to work. Will rebase and point this PR to main once #83 is merged

INVALID is insufficient to parse the validity of ancestor blocks that previously returned SYNCING. As per discussions at interop, add validAncestorHash in the return value to point to the latest (by block number) valid ancestor to the payload that is being processed. In the event of INVALID, this can thus invalidate an arbitrary number of blocks in the chain starting with the payload.

Did discuss this with @holiman to ensure that this is feasible in geth.

Additionally:

  • Provide guidance on how long to try to resolve a payload's dependencies before responding with SYNCING -- SLOTS_PER_SECOND / 30 (0.4s)
  • Do not make PoW vs PoS an exceptional case wrt SYNCING return value. Enforcing finding of dependencies of blocks with PoW ancestors rather than returning SYNCING could result in deadlocks.

Copy link
Collaborator

@mkalinin mkalinin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Made a couple of suggestions

src/engine/specification.md Outdated Show resolved Hide resolved
src/engine/specification.md Outdated Show resolved Hide resolved
src/engine/specification.md Outdated Show resolved Hide resolved
src/engine/specification.md Outdated Show resolved Hide resolved
@holiman
Copy link
Contributor

holiman commented Oct 14, 2021

cc @karalabe PTAL?

@djrtwo
Copy link
Contributor Author

djrtwo commented Oct 18, 2021

@karalabe and @holiman

Looking to get clarity on the "correct ancestor" requirement in this PR

@djrtwo djrtwo changed the title add validAncestorHash to engine_executePayload Engine API: add validAncestorHash to engine_executePayload Oct 18, 2021
Base automatically changed from remove-prepare-payload to main October 18, 2021 23:04
@holiman
Copy link
Contributor

holiman commented Oct 19, 2021

A couple of cases,

  • EL is behind. We're trying to catch up, and we're currently at height X. CL tells us about M (way ahead). We can't say much about it, since we'd have to reach out to the network and see if X is an ancestor of M. In this case, we could report back "Sorry, we're at height X, can't evaluate height M yet".
  • EL is syncing. CL tells us about M. The 'last good' state we know if is genesis. Technically, we can't even be certain that genesis is an ancestor of M.
  • We have the parent, the parent is fine, but this block is bad. We can return parentHash.
  • We have the parent blocks, but one of the ancestors failed validation. This scenario has happened when there have been consensus splits, and long sidechains have developed. I'm not sure what the expectation on EL is in this case, whether to track sidechains or not (currently we keep sidechains until we move things from hot db to ancient-db, after 30K blocks. At this point, we remove non-canon blocks).

But it may be difficult to pinpoint the valid ancestor, as opposed to a 'current valid known block/height' (which may or may not be an ancestor).

@djrtwo
Copy link
Contributor Author

djrtwo commented Oct 19, 2021

Due to the requirement that CL optimistically process beacon blocks when EL returns SYNCING (which can be because of missing a few blocks or the entire chain), when CL gets an INVALID response, it does not have enough information to validate or invalidate and previously SYNCING blocks.

The impetus could then be on either CL or EL to figure things out. The proposed solution has EL try to figure it out by essentially saying the depth of the branch in question is invalid.

The primary problem I now see is if CL is asking about EL blocks that are entirely unavailable (e.g. due to it being an invalid chain of >30k+ depth that has been pruned). This would lead the EL not being able to respond with anything other than SYNCING because it cannot find the branch that a payload belongs too.
(Note that the 30k+ depth case would require that there is a 1/3 attacker that does something slashable and can only try to trick (likely just deadlock) syncing nodes.)

Your cases:

  • This is the returning of SYNCING and is expected behavior for this case or even much deeper cases
  • Again, in this case, EL would return SYNCING because it does not have he info to evaluate an unknown branch here
  • Yep, looks good
  • This is likely workable in most cases. The problem case would be if the beacon chain optimistically synced 30k+ blocks that had an invalid EL and which was pruned away from EL (this is ~3.5 days of blocks). We thus have (a very unlikely) edge case where validators tricked a CL into thinking some beacon branch was viable but ultimately had an invalid EL. This is a failure case and might require manual intervention (which is fine), but the problem is can EL or CL figure out that there was a failure. I suspect that as currently specified, EL would just keep returning SYNCING which would prevent CL from "making decisions" and serving data from endpoints as canonical (which is good), but you could end up in a deadlock of sorts whereby EL never can resolve a branch being queried abot.

iiuc, the only time the latest valid ancestor would be inaccessible is if this "bad" chain was pruned and EL could no longer find it when CL tries to insert a payload. So the questions are:

  1. In such a 30k+ bad chain depth, can EL remember adequate info to respond instead with INVALID. Answer: No, because it's not just a node that previously had the info and pruned it. It is also new nodes that are coming online
  2. Can CL detect this issue in a reasonable way. That is, can it -- through just seeing SYNCING responses -- decide that it might be on an unavailable/bad chain and need to look elsewhere or throw a critical error

Or do we make an assumption that CL will in almost all realistic cases not be led down a simultaneously invalid and unavailable EL chain when syncing because this would require a 1/3 slashable actor. And if it did happen, it would only result in a small set nodes being tricked and put into a bad sync state.

@djrtwo
Copy link
Contributor Author

djrtwo commented Oct 27, 2021

Okay, based on our convo today, we are going to keep the functionality as specified here -- return the most recent validHash in the event that the payload is INVALID.

There are still some cases where EL might never be able to resolve SYNCINGdue to many reasons -- e.g. peering issues, an unavailable invalid chain, etc. But this is no different than today.

How to surface such a sync failure is not in the scope of this PR and generally not in the scope of the core of the engine API

src/engine/specification.md Outdated Show resolved Hide resolved
src/engine/specification.md Outdated Show resolved Hide resolved
4. In the case when the parent block is unknown, client software **MUST** pull the block from the network and take one of the following actions depending on the parent block properties:
- If the parent block is a PoW block as per [EIP-3675](https://eips.ethereum.org/EIPS/eip-3675#specification) definition, then all missing dependencies of the payload **MUST** be pulled from the network and validated accordingly. The call **MUST** be responded according to the validity of the payload and the chain of its ancestors.
- If the parent block is a PoS block as per [EIP-3675](https://eips.ethereum.org/EIPS/eip-3675#specification) definition, then the call **MAY** be responded with `SYNCING` status and sync process **SHOULD** be initiated accordingly.
3. Client software **MUST** return `{status: SYNCING, lastestValidHash: None}` if the client software does not have the requisite data available locally to validate the payload and cannot retrieve this required data in less than `SLOTS_PER_SECOND / 30` (0.4s in the Mainnet configuration) or if the sync process is already in progress. In the event that requisite data to validate the payload is missing (e.g. does not have payload identified by `parentHash`), the client software **SHOULD** initiate the sync process.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale of the timeout? Is the timer starts ticking after receiving the call and including the corresponding disk/cache access that tries to pull out the parent's state? If we want to give EL some time to go to the wire and pull e.g. a parent block then would this timeout cover the execution of a parent block or just pulling it from the wire?

Copy link
Contributor Author

@djrtwo djrtwo Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the timeout only includes non-local retrieval. If you have the info locally (in a disk or memory cache) you should not be responding with SYNCING because you are in fact synced

The idea is that we want to give some guidance on when EL should decide, "I don't have the info locally required and don't have enough time to quickly sync to respond to the call"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the timer starts ticking when EL client initiates the sync process and ends when the missing block is received and executed, i.e. the timeout is applied to the sync process. I think the timeout mechanism should be specified with more details to avoid bugs due to misreading the spec.

I am thinking about cases when this timeout is beneficial for the node. Here is what comes on mind:

  • Temporal outage of EL software leading to EL's head behind the CL's one. If the diff is one EL block only then validator will likely not miss its attestation opportunity as the dependency will be resolved without EL turning into SYNCING; it it's more than a block then it's likely gonna turn into SYNCING

IMO, this mechanism could be even more beneficial if it targeted logical beacon chain clock instead of just absolute time. I.e. "if initiated sync process isn't finished until TIMESTAMP + (SLOTS_PER_SECOND / 3 - 1) then EL must respond with SYNCING". SLOTS_PER_SECOND / 3 - 1 -- here 1 is a placeholder and should be the time normally needed to execute the payload + respond back to CL + CL to update its fork choice state. The target here is to have the beacon block accepted before the attestation boundary.

Though, any of this approaches look complicated and bug prone. Taking into account the infrequency of such cases to appear do we really want this mechanism to be implemented? And if we do, the next question is the spec a good place for it or it could be a part of implementation? My current impression is that this is premature optimisation and when the need of this optimisation will be seen it could be done by implementers without requisition for the change to the spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with pulling QoS out of this PR. I'll pull it out into a separate issue for discussion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed here b6ded72

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and opened relevant issue here -- #91

@lightclient
Copy link
Member

Apologies for the force push on your PR @djrtwo -- was just resolving the conflicts from #85.

Copy link
Collaborator

@mkalinin mkalinin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants