Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/state/snapshot: faster snapshot generation #22504

Merged
merged 75 commits into from
Apr 14, 2021

Conversation

rjl493456442
Copy link
Member

@rjl493456442 rjl493456442 commented Mar 16, 2021

This PR improves snapshot generation (post-snap-sync) by orders of magnitude.

How snapshot generation works now

When doing a snap sync, we download 'slices' of the trie, individually verified, and use these slices to fill our trie.
The slices of accounts/storage themselves are forgotten.

After this is done, we iterate the entire account trie, and every storage trie, to reconstruct the snapshot database.
This process takes hundreds of hours, and the disk IO is extreme during this process.

With this PR

This PR implements the basic idea described here: https://gist.github.com/rjl493456442/85832a0c760f2bafe2a69e33efe68c60 .

For starters, it stores the snapshots into the database. The account (and storage) data do not actually match up to whatever
root we'll eventually wind up on, but that's for later.

After the snap sync is done, we start generating the snapshot, as before. But this time, we have some (potentially stale) data already laying there.

We proceed in batches (ranges). In this PR, the range-size for accounts is 200, and the range-size for storage is 1024.

For any given range, we do

  • Ask the snapshot db for that range (e.g. give me 200 accounts starting from 0x00...
  • We then check if that range can be proven against the main account trie.
    • If that is OK, then we know the entire account-set in that range is correct. We then have to still check the storage of each of those.
    • If the check is not ok, then we have some iteration to do. We iterate the trie over that range, and for each leaf:
    • If the leaf key/value is identical to the snap-data, leave it (untouched) (but also check the storage)
    • If the leaf key is identical, but value differs, update the data (updated) (and check the storage)
    • If the key did not exist, write it to the snapshot (created). (and check the storage).
    • And once the iteration is done, we remove any straggling elements that were in the snapshot.

The same algo is used for storage tries. The idea being that most 99% of all ranges are fine, and that out of the ranges which are not fine, most individual elements are fine.
This improves the IO performance of generation by orders of magnitude.

Some other optimizations:

  • For small storage tries (smaller than 1024), where we can load the entire range into memory, we can do the verification against the expected root
    using the stackTrie. Just feed everything in there, and have it spit out the root, which we can then compare. This means that we don't have to resolve
    the trie for that storage root at all.

Experimental results (mainnet)

INFO [03-17|20:43:10.408] Generated state snapshot                 accounts=122184516 slots=422055852 storage=42.15GiB   elapsed=6h48m25.947s

During the process, it read 1.5TB from disk/leveldb (same same). It wrote 22GB in the first 15 minutes (compaction?), and ended up totalling 24Gb leveldb writes, 35Gb disk writes.

@holiman
Copy link
Contributor

holiman commented Mar 16, 2021

I agree that it doesn't make a whole lot of difference to use stackTrie for small tries (although, there's still some difference), but the even larger upside is that it saves us from even resolving the trie in the first place.

eth/protocols/snap/sync.go Outdated Show resolved Hide resolved
eth/protocols/snap/sync.go Outdated Show resolved Hide resolved
@holiman
Copy link
Contributor

holiman commented Mar 19, 2021

Generation completed on bench04. Sub nine hours, while also importing blocks:

Generated state snapshot accounts=122401787 slots=422712064 storage=42.21GiB elapsed=8h31m47.746s

@holiman
Copy link
Contributor

holiman commented Mar 24, 2021

This one now needs a rebase

@holiman
Copy link
Contributor

holiman commented Mar 24, 2021

Rebased!

log.Crit("Invalid account encountered during snapshot creation", "err", err)
}
data := SlimAccountRLP(acc.Nonce, acc.Balance, acc.Root, acc.CodeHash)

// If the account is not yet in-progress, write it out
if accMarker == nil || !bytes.Equal(accountHash[:], accMarker) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think about this a bit wrt continuations if the account changes while the generator is suspended.

accTrie, err := trie.NewSecure(dl.root, dl.triedb)
if err != nil {
// The account trie is missing (GC), surf the chain until one becomes available
stats.Log("Trie missing, state snapshotting paused", dl.root, dl.genMarker)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to ensure this scenario is handled correctly (Geth is restarted and the pruned tries are missing). Your code might still handle it correctly, but I think it's useful to have a more user friendly error message for it (also separates the expected missing root error from unexpected missing trie node errors).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this scenario is still handled. In the new generator, the error returned by generateRange will abort the generation and wait the external signal to resume the generation.

There are two generateRange called, one for the account and another for the storages. The error returned by storage's generateRange will just be propagated to the outer call stack. So eventually all the internal errors will be handled by account's generateRange.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                 // The procedure it aborted, either by external signal or internal error
		if err != nil {
			if abort == nil { // aborted by internal error, wait the signal
				abort = <-dl.genAbort
			}
			abort <- stats
			return
		}

@holiman holiman changed the title [WIP] snapshot generation core/state/snapshot: faster snapshot generation Mar 30, 2021
//
// Here append the original value to ensure that the number of key and
// value are the same.
vals = append(vals, common.CopyBytes(iter.Value()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure - can .Value be called twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

stackTr.TryUpdate(key, common.CopyBytes(vals[i]))
}
if gotRoot := stackTr.Hash(); gotRoot != root {
return &proofResult{keys: keys, vals: vals, diskMore: false, trieMore: false, proofErr: errors.New("wrong root")}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe expand wrong root with wrong root: have %x, want %x?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe make this one liner into many lines (same for the proofResult below). Since we init a ton of fields, it would look a lot cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also unsure if we want to explicitly set the more fields to false? That would be the default either way. Did you wnt to make it explicit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah IIRC it was to show that it was intentional

last = keys[len(keys)-1]
}
// Generate the Merkle proofs for the first and last element
if origin == nil {
Copy link
Member

@karalabe karalabe Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a question. Checking a slice for nil-ness in general is a bit dangerous, because []byte(nil) != []byte{}. I.e. the empty slice and the nil slice behave identically for all intents and purposes, but equality check against nil succeeds only for one of them. In most of our code, we do len(slice) == 0 as that works for both cases.

IMHO if there's no very very good reason to handle the two cases separately, checking for the length is preferable as it won't run into weird Heisenbugs (e.g. pass in origin deserialized from some input, endung up "empty" instead of "nil")

//
// The proof result will be returned if the range proving is finished, otherwise
// the error will be returned to abort the entire procedure.
func (dl *diskLayer) proveRange(stats *generatorStats, root common.Hash, prefix []byte, kind string, origin []byte, max int, valueConvertFn func([]byte) ([]byte, error)) (*proofResult, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't origin be better as common.Hash? Unless there's a reason to handle nil and 0x00..0 separately, I think it would make things less brittle not to have a pointer.

@@ -2634,5 +2697,5 @@ func (s *Syncer) reportHealProgress(force bool) {
bytecode = fmt.Sprintf("%d@%v", s.bytecodeHealSynced, s.bytecodeHealBytes.TerminalString())
)
log.Info("State heal in progress", "nodes", trienode, "codes", bytecode,
"pending", s.healer.scheduler.Pending())
"accounts", s.accountHealed, "bytes", s.accountHealedBytes, "storages", s.storageHealed, "bytes", s.storageHealedBytes, "pending", s.healer.scheduler.Pending())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have two contextual fields bytes, one for accouts, one for storage

@karalabe karalabe added this to the 1.10.3 milestone Apr 14, 2021
Copy link
Member

@karalabe karalabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WCGW

// Iterate over and mark all layers stale
for _, layer := range t.layers {
switch layer := layer.(type) {
case *diskLayer:
// If the base layer is generating, abort it and save
if layer.genAbort != nil {
abort := make(chan *generatorStats)
abort := make(chan *generatorStats, 1) // Discard the stats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a good idea? Previously we waited until the running generator actually stopped. Here we only wait until it gets its signal and immediately go ahead executing before the generator had a chance to stop. Why was the previous one bad?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. Discard the stats is fine, but "not even wait for the stats" is a different thing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karalabe karalabe merged commit 7088f1e into ethereum:master Apr 14, 2021
@chfast
Copy link
Member

chfast commented Apr 15, 2021

This version of geth has freezed for me. I only have basic logs, see the ^C and dates when I restarted it.

INFO [04-07|09:46:14.722] Unindexed transactions                   blocks=1        txs=11        tail=9841444  elapsed="254.679µs"
INFO [04-07|09:46:21.646] Imported new chain segment               blocks=1        txs=218       mgas=12.456  elapsed=109.213ms    mgasps=114.055 number=12191444 hash="
18c101…29066c" dirty=1007.32MiB
INFO [04-07|09:46:21.648] Unindexed transactions                   blocks=1        txs=131       tail=9841445  elapsed=1.364ms
INFO [04-07|09:46:26.972] Imported new chain segment               blocks=1        txs=472       mgas=12.480  elapsed=261.418ms    mgasps=47.741  number=12191445 hash="
eadbb5…fa28b8" dirty=1008.42MiB
INFO [04-07|09:46:26.973] Unindexed transactions                   blocks=1        txs=148       tail=9841446  elapsed=1.403ms
INFO [04-07|09:46:28.771] Imported new chain segment               blocks=1        txs=162       mgas=12.487  elapsed=105.096ms    mgasps=118.812 number=12191446 hash="
21aa9d…510d1c" dirty=1007.27MiB
INFO [04-07|09:46:28.772] Unindexed transactions                   blocks=1        txs=66        tail=9841447  elapsed="642.044µs"
INFO [04-07|09:46:30.545] Imported new chain segment               blocks=1        txs=172       mgas=12.470  elapsed=91.096ms     mgasps=136.891 number=12191447 hash="
2de0d0…c97558" dirty=1007.19MiB
INFO [04-07|09:46:30.546] Unindexed transactions                   blocks=1        txs=76        tail=9841448  elapsed="855.324µs"
INFO [04-07|09:46:30.797] Imported new chain segment               blocks=1        txs=172       mgas=12.470  elapsed=61.501ms     mgasps=202.766 number=12191447 hash="
72994f…ace3c2" dirty=1007.19MiB
INFO [04-07|09:46:55.150] Imported new chain segment               blocks=1        txs=267       mgas=12.487  elapsed=133.402ms    mgasps=93.601  number=12191448 hash="
63db49…63f2b3" dirty=1007.56MiB
INFO [04-07|09:46:55.152] Unindexed transactions                   blocks=1        txs=162       tail=9841449  elapsed=1.343ms
INFO [04-07|09:47:03.588] Imported new chain segment               blocks=1        txs=169       mgas=12.492  elapsed=109.627ms    mgasps=113.952 number=12191449 hash="
c1ca0e…b67b04" dirty=1007.84MiB
INFO [04-07|09:47:03.589] Unindexed transactions                   blocks=1        txs=72        tail=9841450  elapsed="836.354µs"
INFO [04-07|09:47:03.876] Imported new chain segment               blocks=1        txs=258       mgas=12.489  elapsed=98.781ms     mgasps=126.431 number=12191449 hash="
3b29a9…9ff15e" dirty=1008.28MiB
^CINFO [04-07|09:47:12.546] Deep froze chain segment                 blocks=7        elapsed=240.836ms    number=12101449 hash="cacbce…379ebf"
INFO [04-07|09:47:35.850] Chain reorg detected                     number=12191448 hash="63db49…63f2b3" drop=1 dropfrom="c1ca0e…b67b04" add=2 addfrom="fbe37a…d3ebe1"
INFO [04-07|09:47:42.086] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=98.74KiB  throttle=664
ERROR[04-07|10:05:04.039] Snapshot extension registration failed   peer=99619233 err="peer connected on snap without compatible eth support"
INFO [04-07|10:17:42.031] Writing clean trie cache to disk         path=/home/chfast/.ethereum/geth/triecache threads=1
INFO [04-07|10:17:42.033] Regenerated local transaction journal    transactions=0 accounts=0
ERROR[04-09|15:07:43.043] Snapshot extension registration failed   peer=3af25a30 err="peer connected on snap without compatible eth support"
INFO [04-09|15:36:51.344] New local node record                    seq=145 id=455fea992e705d0d ip=127.0.0.1   udp=30303 tcp=30303
INFO [04-15|12:07:20.168] Got interrupt, shutting down... 
INFO [04-15|12:07:20.184] IPC endpoint closed                      url=/home/chfast/.ethereum/geth.ipc
INFO [04-15|12:07:20.202] Imported new chain segment               blocks=1        txs=313       mgas=12.458  elapsed=194h19m44.492s mgasps=0.000   number=12191450 hash
="fbe37a…d3ebe1" age=1w1d2h    dirty=1008.32MiB
INFO [04-15|12:07:20.204] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=98.74KiB  throttle=664
INFO [04-15|12:07:20.223] Unindexed transactions                   blocks=1        txs=53        tail=9841451  elapsed=19.721ms
INFO [04-15|12:07:20.226] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-15|12:07:20.232] Ethereum protocol stopped
INFO [04-15|12:07:20.339] Transaction pool stopped 
ERROR[04-15|12:07:22.167] Snapshot extension registration failed   peer=35ba3061 err="peer connected on snap without compatible eth support"
INFO [04-15|12:07:24.389] Persisted the clean trie cache           path=/home/chfast/.ethereum/geth/triecache elapsed=4.220s
INFO [04-15|12:07:24.465] Writing cached state to disk             block=12191450 hash="fbe37a…d3ebe1" root="a5e7bc…fa3b3d"
INFO [04-15|12:07:30.198] Looking for peers                        peercount=0 tried=48 static=0
INFO [04-15|12:07:40.223] Looking for peers                        peercount=0 tried=65 static=0
INFO [04-15|12:07:50.251] Looking for peers                        peercount=1 tried=59 static=0
INFO [04-15|12:07:55.579] Persisted trie from memory database      nodes=2333036           size=598.89MiB time=31.114416674s gcnodes=8371981 gcsize=3.16GiB gctime=31.96
1939672s livenodes=406010 livesize=156.47MiB
INFO [04-15|12:07:55.596] Writing cached state to disk             block=12191449 hash="3b29a9…9ff15e" root="07c4c4…d7575c"
INFO [04-15|12:07:55.647] Persisted trie from memory database      nodes=3204              size=1.29MiB   time=50.705232ms   gcnodes=0       gcsize=0.00B   gctime=0s   
         livenodes=402806 livesize=155.19MiB
INFO [04-15|12:07:55.647] Writing cached state to disk             block=12191323 hash="1f0dc1…e0c591" root="6234d2…885a34"
INFO [04-15|12:07:57.505] Persisted trie from memory database      nodes=122409            size=45.07MiB  time=1.857697633s  gcnodes=0       gcsize=0.00B   gctime=0s   
         livenodes=280397 livesize=110.12MiB
INFO [04-15|12:07:57.505] Writing snapshot state to disk           root="c7166f…ce5918"
INFO [04-15|12:07:57.505] Persisted trie from memory database      nodes=0                 size=0.00B     time="4.806µs"     gcnodes=0       gcsize=0.00B   gctime=0s   
         livenodes=280397 livesize=110.12MiB
INFO [04-15|12:07:58.704] New local node record                    seq=146 id=455fea992e705d0d ip=5.226.70.43 udp=30303 tcp=30303
INFO [04-15|12:07:58.709] Writing clean trie cache to disk         path=/home/chfast/.ethereum/geth/triecache threads=8
INFO [04-15|12:07:59.727] Persisted the clean trie cache           path=/home/chfast/.ethereum/geth/triecache elapsed=1.017s
INFO [04-15|12:07:59.727] Blockchain stopped

atif-konasl pushed a commit to frozeman/pandora-execution-engine that referenced this pull request Oct 15, 2021
* eth/protocols: persist received state segments

* core: initial implementation

* core/state/snapshot: add tests

* core, eth: updates

* eth/protocols/snapshot: count flat state size

* core/state: add metrics

* core/state/snapshot: skip unnecessary deletion

* core/state/snapshot: rename

* core/state/snapshot: use the global batch

* core/state/snapshot: add logs and fix wiping

* core/state/snapshot: fix

* core/state/snapshot: save generation progress even if the batch is empty

* core/state/snapshot: fixes

* core/state/snapshot: fix initial account range length

* core/state/snapshot: fix initial account range

* eth/protocols/snap: store flat states during the healing

* eth/protocols/snap: print logs

* core/state/snapshot: refactor (#4)

* core/state/snapshot: refactor

* core/state/snapshot: tiny fix and polish

Co-authored-by: rjl493456442 <garyrong0905@gmail.com>

* core, eth: fixes

* core, eth: fix healing writer

* core, trie, eth: fix paths

* eth/protocols/snap: fix encoding

* eth, core: add debug log

* core/state/generate: release iterator asap (#5)

core/state/snapshot: less copy

core/state/snapshot: revert split loop

core/state/snapshot: handle storage becoming empty, improve test robustness

core/state: test modified codehash

core/state/snapshot: polish

* core/state/snapshot: optimize stats counter

* core, eth: add metric

* core/state/snapshot: update comments

* core/state/snapshot: improve tests

* core/state/snapshot: replace secure trie with standard trie

* core/state/snapshot: wrap return as the struct

* core/state/snapshot: skip wiping correct states

* core/state/snapshot: updates

* core/state/snapshot: fixes

* core/state/snapshot: fix panic due to reference flaw in closure

* core/state/snapshot: fix errors in state generation logic + fix log output

* core/state/snapshot: remove an error case

* core/state/snapshot: fix condition-check for exhausted snap state

* core/state/snapshot: use stackTrie for small tries

* core/state/snapshot: don't resolve small storage tries in vain

* core/state/snapshot: properly clean up storage of deleted accounts

* core/state/snapshot: avoid RLP-encoding in some cases + minor nitpicks

* core/state/snapshot: fix error (+testcase)

* core/state/snapshot: clean up tests a bit

* core/state/snapshot: work in progress on better tests

* core/state/snapshot: polish code

* core/state/snapshot: fix trie iteration abortion trigger

* core/state/snapshot: fixes flaws

* core/state/snapshot: remove panic

* core/state/snapshot: fix abort

* core/state/snapshot: more tests (plus failing testcase)

* core/state/snapshot: more testcases + fix for failing test

* core/state/snapshot: testcase for malformed data

* core/state/snapshot: some test nitpicks

* core/state/snapshot: improvements to logging

* core/state/snapshot: testcase to demo error in abortion

* core/state/snapshot: fix abortion

* cmd/geth: make verify-state report the root

* trie: fix failing test

* core/state/snapshot: add timer metrics

* core/state/snapshot: fix metrics

* core/state/snapshot: udpate tests

* eth/protocols/snap: write snapshot account even if code or state is needed

* core/state/snapshot: fix diskmore check

* core/state/snapshot: review fixes

* core/state/snapshot: improve error message

* cmd/geth: rename 'error' to 'err' in logs

* core/state/snapshot: fix some review concerns

* core/state/snapshot, eth/protocols/snap: clear snapshot marker when starting/resuming snap sync

* core: add error log

* core/state/snapshot: use proper timers for metrics collection

* core/state/snapshot: address some review concerns

* eth/protocols/snap: improved log message

* eth/protocols/snap: fix heal logs to condense infos

* core/state/snapshot: wait for generator termination before restarting

* core/state/snapshot: revert timers to counters to track total time

Co-authored-by: Martin Holst Swende <martin@swende.se>
Co-authored-by: Péter Szilágyi <peterke@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants