-
Notifications
You must be signed in to change notification settings - Fork 20.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core/state/snapshot: faster snapshot generation #22504
Conversation
I agree that it doesn't make a whole lot of difference to use stackTrie for small tries (although, there's still some difference), but the even larger upside is that it saves us from even resolving the trie in the first place. |
Generation completed on
|
This one now needs a rebase |
Rebased! |
log.Crit("Invalid account encountered during snapshot creation", "err", err) | ||
} | ||
data := SlimAccountRLP(acc.Nonce, acc.Balance, acc.Root, acc.CodeHash) | ||
|
||
// If the account is not yet in-progress, write it out | ||
if accMarker == nil || !bytes.Equal(accountHash[:], accMarker) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to think about this a bit wrt continuations if the account changes while the generator is suspended.
accTrie, err := trie.NewSecure(dl.root, dl.triedb) | ||
if err != nil { | ||
// The account trie is missing (GC), surf the chain until one becomes available | ||
stats.Log("Trie missing, state snapshotting paused", dl.root, dl.genMarker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to ensure this scenario is handled correctly (Geth is restarted and the pruned tries are missing). Your code might still handle it correctly, but I think it's useful to have a more user friendly error message for it (also separates the expected missing root error from unexpected missing trie node errors).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this scenario is still handled. In the new generator, the error returned by generateRange
will abort the generation and wait the external signal to resume the generation.
There are two
generateRange
called, one for the account and another for the storages. The error returned by storage'sgenerateRange
will just be propagated to the outer call stack. So eventually all the internal errors will be handled by account'sgenerateRange
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The procedure it aborted, either by external signal or internal error
if err != nil {
if abort == nil { // aborted by internal error, wait the signal
abort = <-dl.genAbort
}
abort <- stats
return
}
// | ||
// Here append the original value to ensure that the number of key and | ||
// value are the same. | ||
vals = append(vals, common.CopyBytes(iter.Value())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure - can .Value be called twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
core/state/snapshot/generate.go
Outdated
stackTr.TryUpdate(key, common.CopyBytes(vals[i])) | ||
} | ||
if gotRoot := stackTr.Hash(); gotRoot != root { | ||
return &proofResult{keys: keys, vals: vals, diskMore: false, trieMore: false, proofErr: errors.New("wrong root")}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe expand wrong root
with wrong root: have %x, want %x
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe make this one liner into many lines (same for the proofResult below). Since we init a ton of fields, it would look a lot cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also unsure if we want to explicitly set the more
fields to false? That would be the default either way. Did you wnt to make it explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah IIRC it was to show that it was intentional
last = keys[len(keys)-1] | ||
} | ||
// Generate the Merkle proofs for the first and last element | ||
if origin == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more of a question. Checking a slice for nil
-ness in general is a bit dangerous, because []byte(nil)
!= []byte{}
. I.e. the empty slice and the nil slice behave identically for all intents and purposes, but equality check against nil
succeeds only for one of them. In most of our code, we do len(slice) == 0
as that works for both cases.
IMHO if there's no very very good reason to handle the two cases separately, checking for the length is preferable as it won't run into weird Heisenbugs (e.g. pass in origin deserialized from some input, endung up "empty" instead of "nil")
// | ||
// The proof result will be returned if the range proving is finished, otherwise | ||
// the error will be returned to abort the entire procedure. | ||
func (dl *diskLayer) proveRange(stats *generatorStats, root common.Hash, prefix []byte, kind string, origin []byte, max int, valueConvertFn func([]byte) ([]byte, error)) (*proofResult, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't origin
be better as common.Hash
? Unless there's a reason to handle nil
and 0x00..0
separately, I think it would make things less brittle not to have a pointer.
eth/protocols/snap/sync.go
Outdated
@@ -2634,5 +2697,5 @@ func (s *Syncer) reportHealProgress(force bool) { | |||
bytecode = fmt.Sprintf("%d@%v", s.bytecodeHealSynced, s.bytecodeHealBytes.TerminalString()) | |||
) | |||
log.Info("State heal in progress", "nodes", trienode, "codes", bytecode, | |||
"pending", s.healer.scheduler.Pending()) | |||
"accounts", s.accountHealed, "bytes", s.accountHealedBytes, "storages", s.storageHealed, "bytes", s.storageHealedBytes, "pending", s.healer.scheduler.Pending()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have two contextual fields bytes
, one for accouts, one for storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WCGW
core/state/snapshot/snapshot.go
Outdated
// Iterate over and mark all layers stale | ||
for _, layer := range t.layers { | ||
switch layer := layer.(type) { | ||
case *diskLayer: | ||
// If the base layer is generating, abort it and save | ||
if layer.genAbort != nil { | ||
abort := make(chan *generatorStats) | ||
abort := make(chan *generatorStats, 1) // Discard the stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a good idea? Previously we waited until the running generator actually stopped. Here we only wait until it gets its signal and immediately go ahead executing before the generator had a chance to stop. Why was the previous one bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e. Discard the stats
is fine, but "not even wait for the stats" is a different thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Namely this path https://github.com/ethereum/go-ethereum/pull/22504/files#diff-83ae0dd12996e07452e863cf8b072366443d6c6fbb0fff341399c9c8ce5fe050R551 will do a whole lot of shuffling and locking before it returns.
This version of geth has freezed for me. I only have basic logs, see the
|
* eth/protocols: persist received state segments * core: initial implementation * core/state/snapshot: add tests * core, eth: updates * eth/protocols/snapshot: count flat state size * core/state: add metrics * core/state/snapshot: skip unnecessary deletion * core/state/snapshot: rename * core/state/snapshot: use the global batch * core/state/snapshot: add logs and fix wiping * core/state/snapshot: fix * core/state/snapshot: save generation progress even if the batch is empty * core/state/snapshot: fixes * core/state/snapshot: fix initial account range length * core/state/snapshot: fix initial account range * eth/protocols/snap: store flat states during the healing * eth/protocols/snap: print logs * core/state/snapshot: refactor (#4) * core/state/snapshot: refactor * core/state/snapshot: tiny fix and polish Co-authored-by: rjl493456442 <garyrong0905@gmail.com> * core, eth: fixes * core, eth: fix healing writer * core, trie, eth: fix paths * eth/protocols/snap: fix encoding * eth, core: add debug log * core/state/generate: release iterator asap (#5) core/state/snapshot: less copy core/state/snapshot: revert split loop core/state/snapshot: handle storage becoming empty, improve test robustness core/state: test modified codehash core/state/snapshot: polish * core/state/snapshot: optimize stats counter * core, eth: add metric * core/state/snapshot: update comments * core/state/snapshot: improve tests * core/state/snapshot: replace secure trie with standard trie * core/state/snapshot: wrap return as the struct * core/state/snapshot: skip wiping correct states * core/state/snapshot: updates * core/state/snapshot: fixes * core/state/snapshot: fix panic due to reference flaw in closure * core/state/snapshot: fix errors in state generation logic + fix log output * core/state/snapshot: remove an error case * core/state/snapshot: fix condition-check for exhausted snap state * core/state/snapshot: use stackTrie for small tries * core/state/snapshot: don't resolve small storage tries in vain * core/state/snapshot: properly clean up storage of deleted accounts * core/state/snapshot: avoid RLP-encoding in some cases + minor nitpicks * core/state/snapshot: fix error (+testcase) * core/state/snapshot: clean up tests a bit * core/state/snapshot: work in progress on better tests * core/state/snapshot: polish code * core/state/snapshot: fix trie iteration abortion trigger * core/state/snapshot: fixes flaws * core/state/snapshot: remove panic * core/state/snapshot: fix abort * core/state/snapshot: more tests (plus failing testcase) * core/state/snapshot: more testcases + fix for failing test * core/state/snapshot: testcase for malformed data * core/state/snapshot: some test nitpicks * core/state/snapshot: improvements to logging * core/state/snapshot: testcase to demo error in abortion * core/state/snapshot: fix abortion * cmd/geth: make verify-state report the root * trie: fix failing test * core/state/snapshot: add timer metrics * core/state/snapshot: fix metrics * core/state/snapshot: udpate tests * eth/protocols/snap: write snapshot account even if code or state is needed * core/state/snapshot: fix diskmore check * core/state/snapshot: review fixes * core/state/snapshot: improve error message * cmd/geth: rename 'error' to 'err' in logs * core/state/snapshot: fix some review concerns * core/state/snapshot, eth/protocols/snap: clear snapshot marker when starting/resuming snap sync * core: add error log * core/state/snapshot: use proper timers for metrics collection * core/state/snapshot: address some review concerns * eth/protocols/snap: improved log message * eth/protocols/snap: fix heal logs to condense infos * core/state/snapshot: wait for generator termination before restarting * core/state/snapshot: revert timers to counters to track total time Co-authored-by: Martin Holst Swende <martin@swende.se> Co-authored-by: Péter Szilágyi <peterke@gmail.com>
This PR improves snapshot generation (post-snap-sync) by orders of magnitude.
How snapshot generation works now
When doing a snap sync, we download 'slices' of the trie, individually verified, and use these slices to fill our trie.
The slices of accounts/storage themselves are forgotten.
After this is done, we iterate the entire account trie, and every storage trie, to reconstruct the snapshot database.
This process takes hundreds of hours, and the disk IO is extreme during this process.
With this PR
This PR implements the basic idea described here: https://gist.github.com/rjl493456442/85832a0c760f2bafe2a69e33efe68c60 .
For starters, it stores the snapshots into the database. The account (and storage) data do not actually match up to whatever
root we'll eventually wind up on, but that's for later.
After the snap sync is done, we start generating the snapshot, as before. But this time, we have some (potentially stale) data already laying there.
We proceed in batches (ranges). In this PR, the range-size for accounts is
200
, and the range-size for storage is1024
.For any given range, we do
0x00..
.untouched
) (but also check the storage)updated
) (and check the storage)created
). (and check the storage).The same algo is used for storage tries. The idea being that most 99% of all ranges are fine, and that out of the ranges which are not fine, most individual elements are fine.
This improves the IO performance of generation by orders of magnitude.
Some other optimizations:
1024
), where we can load the entire range into memory, we can do the verification against the expectedroot
using the
stackTrie
. Just feed everything in there, and have it spit out the root, which we can then compare. This means that we don't have to resolvethe trie for that storage root at all.
Experimental results (mainnet)
During the process, it read
1.5TB
from disk/leveldb (same same). It wrote22GB
in the first 15 minutes (compaction?), and ended up totalling24Gb
leveldb writes,35Gb
disk writes.