-
Notifications
You must be signed in to change notification settings - Fork 20.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eth/downloader: fix skeleton cleanup #28581
Conversation
c1305e3
to
af91057
Compare
This looks good to me, but it's really one of these PRs where the effects are non-trivial. Maybe @karalabe can review, or we'll need to have a quick chat about it |
if number < s.progress.Subchains[0].Tail { | ||
// If the filled header is below and discontinuous with the linked subchain, | ||
// something's corrupted internally. Report and error and refuse to do anything. | ||
if number+1 < s.progress.Subchains[0].Tail { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure that this is correct.
Number is the last item newly filled. When I start a sync loop, lets say I have N blocks in my chain. Then I skeleton sync, and will reconstruct headers from HEAD to N+1, and which point I have my subchain linked to the local chain.
I start backfilling. If I fail to fill anything, cleanStales
does not get called. If I successfully fill 1 block, then filled
will be N+1 (the newly filled one block).
In that case number
== filled
== N+1
, which is == s.progress.Subchains[0].Tail
. The only way to hit this error would be to backfill something we already had in our chain.
With the proposed modification however, backfilling my local head block would make the check pass, but that should not be possible, because I can only backfill stuff from my subchain, not 1 block below it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I fail to fill anything, cleanStales does not get called
I am not sure about it. cleanStales
will be invoked anyway regardless if we successfully fill something or not. The CurrentSnapBlock()
will be returned as the newly filled block.
Thus it's theoretically possible (1) the tail is N+1
and (2) filled is N
, the error will be reported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According the below, the will only call cleanStales if something was filled.
if filled := s.filler.suspend(); filled != nil {
// If something was filled, try to delete stale sync helpers. If
// unsuccessful, warn the user, but not much else we can do (it's
// a programming error, just let users report an issue and don't
// choke in the meantime).
if err := s.cleanStales(filled); err != nil {
log.Error("Failed to clean stale beacon headers", "err", err)
}
}
That said, you are also right that resuming the filler will always return the head snap block (or genesis I guess in case we're brand new).
defer func() {
b.lock.Lock()
b.filling = false
b.filled = b.downloader.blockchain.CurrentSnapBlock()
b.lock.Unlock()
}()
Interesting, I guess this was a refactor after I designed the original idea. It kind of makes the filled check in my first code segment moot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I agree with this +1, but we need to fix some comments that mention that it cannot happen. Namely,
- https://github.com/ethereum/go-ethereum/blob/master/eth/downloader/beaconsync.go#L52
-
// suspend cancels any background downloader threads and returns the last header
// that has been successfully backfilled (potentially in a previous run), or the genesis.
-
- https://github.com/ethereum/go-ethereum/blob/master/eth/downloader/skeleton.go#L164
-
// The method should return the last block header that has been successfully
// backfilled (in the current or a previous run), falling back to the genesis.
-
- https://github.com/ethereum/go-ethereum/blob/master/eth/downloader/skeleton.go#L385
- We should add an error log if the filled is nil (vs the current code that checks for non-nil-ness as a happy path only).
batch = s.db.NewBatch() | ||
) | ||
s.progress.Subchains[0].Tail = number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah hmm, indeed this is wonky. If the tail is the first "missing" header, then it should be filled+1
eth/downloader/skeleton.go
Outdated
// If more headers were filled than available, push the entire subchain | ||
// forward to keep tracking the node's block imports. | ||
// | ||
// Note that the new tail will be the filled one in this case, which is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not valid any more. The new tail might be either filled
(if the HEAD itself was filled) or filled+1
if something earlier was filled. This is problematic for L1159, which sets the Head to number. In that case, you could end up with Tail > Head.
L1164 also feels broken with the fixed logic because we're writing filled as a skeleton header, but that's not the case any more if it was just filled. This might arguably be broken even in the current code, but it's at least uniformly broken. With the new code, you can end up with a skeleton header of filled
, but maybe a Tail of filled+1
, which is inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't get it. I don't think it's possible to have HEAD as filled
but TAIL as filled+1
.
In the case of filling everything, both HEAD and TAIL should be set to filled
in the new logic right?
eth/downloader/skeleton.go
Outdated
// chain is consumed. | ||
newTail := filled | ||
if number < s.progress.Subchains[0].Head { | ||
newTail = rawdb.ReadSkeletonHeader(s.db, number+1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is problematic. newTail can either be filled or filled+1, but the code below acts as if it were one or the other. It would be very hard to reason about.
IMO if the happy path is filled+1, then we should special case filled and have a condition that does all the cleanups necessary for "filling the entire subchain"' then we can have the remainder of the code handle the partial fill scenario.
Handling both cases together below makes the code very hard to digest IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we fill the whole chian, should we delete the skeleton chain totally? Theoretically it's the best way to handle the case, just need to confirm the non-existent skeleton won't break the other part.
af91057
to
4d3662c
Compare
@karalabe I updated the cleanup logic, not sure if this version is easier to understand. Please take another look. |
4d3662c
to
2388edd
Compare
2388edd
to
7d7979d
Compare
Wait, we still haven't merged this!? |
@karalabe PTAL |
eth/downloader/skeleton.go
Outdated
// The skeleton chain is partially consumed, set the new tail as filled+1. | ||
tail := rawdb.ReadSkeletonHeader(s.db, number+1) | ||
if tail.ParentHash != filled.Hash() { | ||
return fmt.Errorf("filled header is discontinuous with subchain: %d %s", number, filled.Hash()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whilst I agree that this clause is an error as it should never happen, wondering if returning an error could end up getting stuck in some weird way if a bug is hit vs. being able to self recover if we blindly delete the "overlapping" stuff that was side-filled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the worst case is we keep all the skeleton headers in database, but the normal sync procedure won't be aborted/stuck?
I will probably prefer to print out the error now and to see if anyone really meet this very situation.
7d7979d
to
eead5a2
Compare
I have addressed the comments, @karalabe , please take another look! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
* params: begin v.1.13.12 release cycle * internal/flags: fix typo (ethereum#28876) * core/types: fix and test handling of faulty nil-returning signer (ethereum#28879) This adds an error if the signer returns a nil value for one of the signature value fields. * README.md: fix travis badge (ethereum#28889) The hyperlink in the README file that directs to the Travis CI build was broken. This commit updates the link to point to the corrent build page. * eth/catalyst: allow payload attributes v1 in fcu v2 (ethereum#28882) At some point, `ForkchoiceUpdatedV2` stopped working for `PayloadAttributesV1` while `paris` was active. This was causing a few failures in hive. This PR fixes that, and also adds a gate in `ForkchoiceUpdatedV1` to disallow `PayloadAttributesV3`. * docs/postmortems: fix outdated link (ethereum#28893) * core: reset tx lookup cache if necessary (ethereum#28865) This pull request resets the txlookup cache if chain reorg happens, preventing them from remaining reachable. It addresses failures in the hive tests. * build: fix problem with windows line-endings in CI download (ethereum#28900) fixes ethereum#28890 * eth/downloader: fix skeleton cleanup (ethereum#28581) * eth/downloader: fix skeleton cleanup * eth/downloader: short circuit if nothing to delete * eth/downloader: polish the logic in cleanup * eth/downloader: address comments * deps: update memsize (ethereum#28916) * core/txpool/blobpool: post-crash cleanup and addition/removal metrics (ethereum#28914) * core/txpool/blobpool: clean up resurrected junk after a crash * core/txpool/blobpool: track transaction insertions and rejections * core/txpool/blobpool: linnnnnnnt * core/txpool: don't inject lazy resolved transactions into the container (ethereum#28917) * core/txpool: don't inject lazy resolved transactions into the container * core/txpool: minor typo fixes * core/types: fix typo (ethereum#28922) * p2p: fix accidental termination of portMappingLoop (ethereum#28911) * internal/flags: fix --miner.gasprice default listing (ethereum#28932) * all: fix typos in comments (ethereum#28881) * Makefile: add help target to display available targets (ethereum#28845) Co-authored-by: Martin HS <martin@swende.se> Co-authored-by: Felix Lange <fjl@twurst.com> * core: cache transaction indexing tail in memory (ethereum#28908) * eth, miner: fix enforcing the minimum miner tip (ethereum#28933) * eth, miner: fix enforcing the minimum miner tip * ethclient/simulated: fix failing test due the min tip change * accounts/abi/bind: fix simulater gas tip issue * core/state, core/vm: minor uint256 related perf improvements (ethereum#28944) * cmd,internal/era: implement `export-history` subcommand (ethereum#26621) * all: implement era format, add history importer/export * internal/era/e2store: refactor e2store to provide ReadAt interface * internal/era/e2store: export HeaderSize * internal/era: refactor era to use ReadAt interface * internal/era: elevate anonymous func to named * cmd/utils: don't store entire era file in-memory during import / export * internal/era: better abstraction between era and e2store * cmd/era: properly close era files * cmd/era: don't let defers stack * cmd/geth: add description for import-history * cmd/utils: better bytes buffer * internal/era: error if accumulator has more records than max allowed * internal/era: better doc comment * internal/era/e2store: rm superfluous reader, rm superfluous testcases, add fuzzer * internal/era: avoid some repetition * internal/era: simplify clauses * internal/era: unexport things * internal/era,cmd/utils,cmd/era: change to iterator interface for reading era entries * cmd/utils: better defer handling in history test * internal/era,cmd: add number method to era iterator to get the current block number * internal/era/e2store: avoid double allocation during write * internal/era,cmd/utils: fix lint issues * internal/era: add ReaderAt func so entry value can be read lazily Co-authored-by: lightclient <lightclient@protonmail.com> Co-authored-by: Martin Holst Swende <martin@swende.se> * internal/era: improve iterator interface * internal/era: fix rlp decode of header and correctly read total difficulty * cmd/era: fix rebase errors * cmd/era: clearer comments * cmd,internal: fix comment typos --------- Co-authored-by: Martin Holst Swende <martin@swende.se> * core,params: add holesky to default genesis function (ethereum#28903) * node, rpc: add configurable HTTP request limit (ethereum#28948) Adds a configurable HTTP request limit, and bumps the engine default * all: fix docstring names (ethereum#28923) * fix wrong comment * reviewers input * Update log/handler_glog.go --------- Co-authored-by: Martin HS <martin@swende.se> * ethclient/simulated: fix typo (ethereum#28952) (ethclient/simulated):fix typo * eth/gasprice: fix percentile validation in eth_feeHistory (ethereum#28954) * cmd/devp2p, eth: drop support for eth/67 (ethereum#28956) * params, core/forkid: add mainnet timestamp for Cancun (ethereum#28958) * params: add cancun timestamp for mainnet * core/forkid: add test for mainnet cancun forkid * core/forkid: update todo tests for cancun * internal/ethapi: add support for blobs in eth_fillTransaction (ethereum#28839) This change adds support for blob-transaction in certain API-endpoints, e.g. eth_fillTransaction. A follow-up PR will add support for signing such transactions. * internal/era: update block index format to be based on record offset (ethereum#28959) As mentioned in ethereum#26621, the block index format for era1 is not in line with the regular era block index. This change modifies the index so all relative offsets are based against the beginning of the block index record. * params: go-ethereum v1.13.12 stable --------- Co-authored-by: Martin Holst Swende <martin@swende.se> Co-authored-by: alex <152680487+bodhi-crypo@users.noreply.github.com> Co-authored-by: protolambda <proto@protolambda.com> Co-authored-by: KeienWang <42377006+keienWang@users.noreply.github.com> Co-authored-by: lightclient <14004106+lightclient@users.noreply.github.com> Co-authored-by: rjl493456442 <garyrong0905@gmail.com> Co-authored-by: Péter Szilágyi <peterke@gmail.com> Co-authored-by: zoereco <158379334+zoereco@users.noreply.github.com> Co-authored-by: Chris Ziogas <ziogaschr@gmail.com> Co-authored-by: Dimitris Apostolou <dimitris.apostolou@icloud.com> Co-authored-by: Halimao <1065621723@qq.com> Co-authored-by: Felix Lange <fjl@twurst.com> Co-authored-by: lmittmann <3458786+lmittmann@users.noreply.github.com> Co-authored-by: Sina Mahmoodi <1591639+s1na@users.noreply.github.com> Co-authored-by: Austin Roberts <austin.roberts@rivet.cloud>
This pull request fixes a corner case in skeleton header deletion.
The background is Geth uses a backward-sync mechanism to fetch and assemble the skeleton header chain locally (guided by consensus client) and then extend block chain by consuming the skeleton headers. The consumed skeleton headers will be deleted from database to avoid accumulating junks.
The assumption is always held that a complete skeleton header chain is linked with local blockchain, specifically
skeleton.tail.parent == blockchain.head
. However the originally deletion logic always sets theblockchain.head
as the new tail and delete headers beforehand.In this fix, the next header of filled block will be regarded as the new tail, to keep the guarantee skeleton chain is linked with blockchain, instead of overlapping with it.
It's not a really critical issue because we just leave one more skeleton header in database, but it will result in some weird behaviors which we really want to avoid.
A: Avoid unexpected chain truncation
Before spinning up a new sync cycle, downloader needs to figure out the common ancestor of local chain and skeleton header chain by function
findBeaconAncestor
.Due the fact that originally deletion logic sets the tail with the local chain head, the common ancestor will be
chain.head-1
.Therefore, the excess block along with its other data(e.g. receipts) will be truncated from the ancient store by this logic
This behavior is really unexpected in normal cases, should only occur if the network reorg is deeper than our local chain.
Unfortunately It occurs a lot, e.g.
I think we can avoid it by applying this fix.
B: Annoying error log for deleting skeleton headers
Failed to clean stale beacon headers err="filled header below beacon header tail: 16554016 < 16554017"
This kind of log is pretty common and I realized that the filled header is just one block before the tail. My hunch is we spin up the backfiller but terminate it without extending any blocks into the local chain. It can be avoid with this fix.