Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Snapshot restoration overhaul #11219

Merged
merged 29 commits into from
Oct 31, 2019
Merged

Conversation

dvdplm
Copy link
Collaborator

@dvdplm dvdplm commented Oct 29, 2019

This PR does several things and is quite noisy.
The general thrust is to make snapshot syncing easier to understand and debug and in the meanwhile makes a few tweaks to better adapt to the current harsh snapshotting environment.

  • Separate snapshot sync logging from general sync, -l snapshot_sync and improve logging messages in general.

  • Add documentation and comments where missing/lacking.

  • Ensure we pick the snapshot with the highest blocknumber whenever there's a tie.

    Selecting a peer to sync a snapshot from gives the preference to snapshots that are seeded by 3 or more peers. On todays ethereum it is rare that nodes have snapshots from the same point in the chain. This means we always have a single peer to download from. Before this PR we did not sort the peers based on how recent their snapshot is when there's a tie (i.e. a single peer has a snapshot). Fixed here.

  • Do not skip snapshots further away than 30k block from the highest block seen.
    Currently we look for peers that seed snapshots that are close to the highest block seen on the network (where "close" means within 30k blocks of the tip). When a node starts up we wait for some time (5sec, increased here to 10sec) to let peers connect. If we find a suitable peer to sync a snapshot from at the end of that delay, we start the download; if none is found we start slow-syncing (or, if --warp-barrier is used we stall).
    Before this PR, when looking for a suitable snapshot we use the highest block seen on the network to check if a peer has a snapshot that is within 30k blocks of that highest block number. This means that in a situation where all available snapshots are old, we will often fail to start a snapshot at all. What's worse is that the longer we delay starting a snapshot sync (to let more peers connect, in the hope of finding a good snapshot), the more likely we are to have seen a high block and thus the less likely we become to accept a snapshot. This behaviour is quite surprising and tricky to reason about.
    Here this highest-seen blocknumber comparison is removed and we switch to the simpler: pick the best suitable snapshot we find in 10sec.

  • A small but possibly controversial change: before this PR when a remote peers sends us a snapshot chunk we've already seen, we disconnect the peer and stop the download. After this PR we ignore the chunk and keep going instead (change). I do not think this is problematic but I have never seen a dupe happen so can't really be sure.

  • Don't warp-sync twice

    With the highest-seen blocknumber criteria gone, a node can end up using a snapshot that is a fair bit away from the tip, which means that after completion it needs to slow-sync for quite a bit to catch up to the tip. Before this change it can happen that a restored node finds a new snapshot and starts warp syncing again; this PR adds a check to see if our own best block is higher than the --warp-barrier and if so, does not start another sync.

  • Avoid excessive looping over Vecs.

    Use a HashSet instead of a Vec to store the pending chunks. Remove items from the set as chunks are processed. Calculate and store the total number of chunks in the Snapshot struct instead of counting pending chunks each time.

Use `snapshot_sync` as logging target
* master:
  Type annotation for next_key() matching of json filter options (#11192)
  Crypto primitives removed from ethkey (#11174)
…try to fix snapshot startup problems

Docs, todos, comments
* master:
  [informant]: `MillisecondDuration` -> `as_millis()` (#11211)
  Step duration map configuration parameter ported from the POA Network fork (#10902)
  Upgrade jsonrpc to latest (#11206)
  [export hardcoded sync]: use debug for `H256` (#11204)
  Pause pruning while snapshotting (#11178)
…ock seen

Currently we look for peers that seed snapshots that are close to the highest block seen on the network (where "close" means withing 30k blocks). When a node starts up we wait for some time (5sec, increased here to 10sec) to let peers connect and if we have found a suitable peer to sync a snapshot from at the end of that delay, we start the download; if none is found and --warp-barrier is used we stall, otherwise we start a slow-sync.
When looking for a suitable snapshot, we use the highest block seen on the network to check if a peer has a snapshot that is within 30k blocks of that highest block number. This means that in a situation where all available snapshots are older than that, we will often fail to start a snapshot at all. What's worse is that the longer we delay starting a snapshot sync (to let more peers connect, in the hope of finding a good snapshot), the more likely we are to have seen a high block and thus the more likely we become to accept a snapshot.
This commit removes this comparison with the highest blocknumber criteria entirely and picks the best snapshot we find in 10sec.
…f they happen to send us a duplicate chunk (just ignore the chunk and keep going)

Resolve some documentation todos, add more
Check if our own block is beyond the given warp barrier (can happen after we've completed a warp sync but are not quite yet synced up to the tip) and if so, don't sync.
More docs, resolve todos.
Dial down some `sync` debug level logging to trace
… work item

Use a HashSet instead of a Vec and remove items from the set as chunks are processed. Calculate and store the total number of chunks in the `Snapshot`  struct instead of counting pending chunks each time.
@dvdplm dvdplm self-assigned this Oct 29, 2019
@dvdplm dvdplm added the A3-inprogress ⏳ Pull request is in progress. No review needed at this stage. label Oct 29, 2019
@dvdplm dvdplm marked this pull request as ready for review October 30, 2019 10:30
@dvdplm dvdplm requested a review from ngotchac October 30, 2019 10:30
@dvdplm dvdplm added A0-pleasereview 🤓 Pull request needs code review. and removed A3-inprogress ⏳ Pull request is in progress. No review needed at this stage. labels Oct 30, 2019
@dvdplm dvdplm requested a review from ordian October 30, 2019 10:30
@dvdplm dvdplm added B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch. B1-patch-beta 🕷🕷 M4-core ⛓ Core client code / Rust. labels Oct 30, 2019
Cargo.toml Outdated Show resolved Hide resolved
ethcore/sync/src/chain/handler.rs Show resolved Hide resolved
ethcore/snapshot/src/service.rs Outdated Show resolved Hide resolved
ethcore/sync/src/chain/handler.rs Show resolved Hide resolved
ethcore/sync/src/snapshot_sync.rs Outdated Show resolved Hide resolved
ethcore/sync/src/sync_io.rs Outdated Show resolved Hide resolved
ethcore/sync/src/sync_io.rs Outdated Show resolved Hide resolved
@ngotchac
Copy link
Contributor

I left a couple of comments above. One thing I didn't mention is the change to accept far-away snapshots. I'm not quite convinced this would be a good idea: as a user, if I warp-sync my node I would expect to be able to use it a few minutes after (syncing to the top of the chain). Without any limits, I could actually spend quite some time syncing a snapshot and then a couple of days to the top of the chain. And if there's a new and better snapshot while I'm syncing the old one, I'm basically screwed.

@dvdplm
Copy link
Collaborator Author

dvdplm commented Oct 30, 2019

the change to accept far-away snapshots.

Yeah, this is the most important change here. Hopefully they will not be "far far away". The idea is: use the best possible snapshot we can find.

as a user, if I warp-sync my node I would expect to be able to use it a few minutes after

That would be nice, but that is simply not the reality today. Obviously we can hope that after the Pause pruning while snapshotting PR things will improve, but I just don't think we should expect a return to the good old days where many nodes produce snapshots every 5k blocks. Today this is not the case.

I also think that the previous logic was a bit wonky, where the choice of peer to download from was highly dependent on seeing a high block on the chain. This led to a lot of head scratching wondering why a decent snapshot was rejected all of a sudden. I do not want to revert to that.

And if there's a new and better snapshot while I'm syncing the old one, I'm basically screwed.

Why do you think that? You will not see the new better snapshot for sure (this was the case before this PR too) but once you've restored the one you were working on you could always restart the node with a higher --warp-barrier and do it again. Not saying that's good obviously, but it's not terrible either. Note that users don't really have a way to know what snapshots are out there so they don't really know what to expect (and I think we should consider adding a parity command or an oracle that give this information).

ethcore/types/src/snapshot.rs Outdated Show resolved Hide resolved
ethcore/sync/src/chain/mod.rs Outdated Show resolved Hide resolved
@@ -676,7 +676,6 @@ impl<C> Service<C> where C: SnapshotClient + ChainInfo {
} else if manifest.state_hashes.contains(&hash) {
true
} else {
warn!(target: "snapshot", "Hash of the content of {:?} not present in the manifest block/state hashes.", path);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's not an abnormal condition, it's expected when we start syncing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate on that: when we see that there was a partial restoration going on before us, we try to re-use the chunks we already have on disk. This warning looks scary to users but it's not really something we need to worry about I think: it just means we can't use the chunk because we're now working on a different snapshot.
I can put it back as a trace! perhaps?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a yes? put the log back as a trace!?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either way is fine :)

ethcore/snapshot/src/lib.rs Outdated Show resolved Hide resolved
const HEADERS_TIMEOUT: Duration = Duration::from_secs(15);
const BODIES_TIMEOUT: Duration = Duration::from_secs(20);
const RECEIPTS_TIMEOUT: Duration = Duration::from_secs(10);
const FORK_HEADER_TIMEOUT: Duration = Duration::from_secs(3);
const SNAPSHOT_MANIFEST_TIMEOUT: Duration = Duration::from_secs(5);
/// Max time to wait for a peer to send us a snapshot manifest.
const SNAPSHOT_MANIFEST_TIMEOUT: Duration = Duration::from_secs(15);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is an average manifest size? why the timeout is tripled?

Copy link
Collaborator Author

@dvdplm dvdplm Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manifest files are fairly small, ~100-150kb. It's an rlp-encoded ManifestData, containing the hashes of the block/state chunks you see on disk in your snapshot/ folder.
Why tripled? I simply wanted to give all possible snapshotting peers a decent chance to be considered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're mistaken: this is a timeout for the Snapshot Manifest packet to arrive from a peer after it's being asked.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kind of boils down to the same thing though right?

  • a manifest request is sent to a peer and the asking is set to SnapshotManifest
  • every now and then we check if they are still in PeerAsking::SnapshotManifest
  • when the peer responds with the manifest rlp we reset the asking
  • …unless of course the timeout is triggered before they respond, in which case they are disconnected

Did I get that right?

Either way I updated the doc string, not sure why it's not showing up here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you were mistaken with WAIT_PEERS_TIMEOUT, which is used to wait and see if there are more than 3 peers with the same snapshot, and go ahead anyway after the timeout.
The SNAPSHOT_MANIFEST_TIMEOUT is just a timeout on the request, which in most of the case will be fulfilled before this timeout anyway, since the manifest is quite small. So increase it shouldn't change the behavior at all.

ethcore/sync/src/snapshot_sync.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
@ordian ordian added this to the 2.7 milestone Oct 30, 2019
trace!(target: "sync", "Ignored unknown chunk: {:x}", hash);
Err(())

self.pending_block_chunks.take(&hash)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this removes hash from the pending_block_chunks, it wasn't present before

Copy link
Collaborator Author

@dvdplm dvdplm Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is what I tried to describe under "Avoid excessive looping over Vecs. ". Before, the pending_*_chunks were Vecs, initialized once and not mutated during the restoration and used to look up hashes (like here) and to calculate the total number of chunks. I refactored it to use HashSet and remove chunks as we go. Here it also saves a clone().

/// memory usage down.
// See https://github.com/paritytech/parity-common/issues/255
#[ignore_malloc_size_of = "no impl for IndexSet (yet)"]
pending_block_chunks: IndexSet<H256>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth the extra dependency on indexmap. The only benefit is removing the chunk for the set when it's done.

Copy link
Contributor

@ngotchac ngotchac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good, I'm just still not convinced changing the snapshot manifest timeout from 5s to 15s is necessary though.

@dvdplm
Copy link
Collaborator Author

dvdplm commented Oct 31, 2019

@ngotchac reverted to 5 sec. :)

@ngotchac ngotchac added A8-looksgood 🦄 Pull request is reviewed well. and removed A0-pleasereview 🤓 Pull request needs code review. labels Oct 31, 2019
@dvdplm dvdplm merged commit 8c2199d into master Oct 31, 2019
@dvdplm
Copy link
Collaborator Author

dvdplm commented Oct 31, 2019

@s3krit this one may be tricky to backport, just a heads up.

@ordian ordian deleted the dp/chore/ensure-we-ignore-old-snapshots branch October 31, 2019 15:25
This was referenced Nov 5, 2019
dvdplm added a commit that referenced this pull request Nov 6, 2019
* master: (21 commits)
  ropsten #6631425 foundation #8798209 (#11201)
  Update list of bootnodes for xDai chain (#11236)
  ethcore/res: add mordor testnet configuration (#11200)
  [chain specs]: activate `Istanbul` on mainnet (#11228)
  [builtin]: support `multiple prices and activations` in chain spec (#11039)
  Insert explicit warning into the panic hook (#11225)
  Snapshot restoration overhaul (#11219)
  Fix docker centos build (#11226)
  retry on gitlab system failures (#11222)
  Update bootnodes. (#11203)
  Use provided usd-per-eth value if an endpoint is specified (#11209)
  Use a lock instead of atomics for snapshot Progress (#11197)
  [informant]: `MillisecondDuration` -> `as_millis()` (#11211)
  Step duration map configuration parameter ported from the POA Network fork (#10902)
  Upgrade jsonrpc to latest (#11206)
  [export hardcoded sync]: use debug for `H256` (#11204)
  Pause pruning while snapshotting (#11178)
  Type annotation for next_key() matching of json filter options (#11192)
  Crypto primitives removed from ethkey (#11174)
  Made ecrecover implementation trait public (#11188)
  ...
niklasad1 pushed a commit that referenced this pull request Nov 7, 2019
* Comments and todos
Use `snapshot_sync` as logging target

* fix compilation

* More todos, more logs

* Fix picking snapshot peer: prefer the one with the highest block number
More docs, comments, todos

* Adjust WAIT_PEERS_TIMEOUT to be a multiple of MAINTAIN_SYNC_TIMER to try to fix snapshot startup problems
Docs, todos, comments

* Tabs

* Formatting

* Don't build new rlp::EMPTY_LIST_RLP instances

* Dial down debug logging

* Don't warn about missing hashes in the manifest: it's normal
Log client version on peer connect

* Cleanup

* Do not skip snapshots further away than 30k block from the highest block seen

Currently we look for peers that seed snapshots that are close to the highest block seen on the network (where "close" means withing 30k blocks). When a node starts up we wait for some time (5sec, increased here to 10sec) to let peers connect and if we have found a suitable peer to sync a snapshot from at the end of that delay, we start the download; if none is found and --warp-barrier is used we stall, otherwise we start a slow-sync.
When looking for a suitable snapshot, we use the highest block seen on the network to check if a peer has a snapshot that is within 30k blocks of that highest block number. This means that in a situation where all available snapshots are older than that, we will often fail to start a snapshot at all. What's worse is that the longer we delay starting a snapshot sync (to let more peers connect, in the hope of finding a good snapshot), the more likely we are to have seen a high block and thus the more likely we become to accept a snapshot.
This commit removes this comparison with the highest blocknumber criteria entirely and picks the best snapshot we find in 10sec.

* lockfile

* Add a `ChunkType::Dupe` variant so that we do not disconnect a peer if they happen to send us a duplicate chunk (just ignore the chunk and keep going)
Resolve some documentation todos, add more

* tweak log message

* Don't warp sync twice
Check if our own block is beyond the given warp barrier (can happen after we've completed a warp sync but are not quite yet synced up to the tip) and if so, don't sync.
More docs, resolve todos.
Dial down some `sync` debug level logging to trace

* Avoid iterating over all snapshot block/state hashes to find the next work item

Use a HashSet instead of a Vec and remove items from the set as chunks are processed. Calculate and store the total number of chunks in the `Snapshot`  struct instead of counting pending chunks each time.

* Address review grumbles

* Log correct number of bytes written to disk

* Revert ChunkType::Dup change

* whitespace grumble

* Cleanup debugging code

* Fix docs

* Fix import and a typo

* Fix test impl

* Use `indexmap::IndexSet` to ensure chunk hashes are accessed in order

* Revert increased SNAPSHOT_MANIFEST_TIMEOUT: 5sec should be enough
s3krit pushed a commit that referenced this pull request Nov 11, 2019
* [CI] check evmbin build (#11096)
* Correct EIP-712 encoding (#11092)
* [client]: Fix for incorrectly dropped consensus messages (#11082) (#11086)
* Update hardcoded headers (foundation, classic, kovan, xdai, ewc, ...) (#11053)
* Add cargo-remote dir to .gitignore (?)
* Update light client headers: ropsten 6631425 foundation 8798209 (#11201)
* Update list of bootnodes for xDai chain (#11236)
* ethcore/res: add mordor testnet configuration (#11200)
* [chain specs]: activate Istanbul on mainnet (#11228)
* [builtin]: support multiple prices and activations in chain spec (#11039)
* [receipt]: add sender & receiver to RichReceipts (#11179)
* [ethcore/builtin]: do not panic in blake2pricer on short input (#11180)
* Made ecrecover implementation trait public (#11188)
* Fix docker centos build (#11226)
* Update MIX bootnodes. (#11203)
* Insert explicit warning into the panic hook (#11225)
* Use provided usd-per-eth value if an endpoint is specified (#11209)
* Cleanup stratum a bit (#11161)
* Add Constantinople EIPs to the dev (instant_seal) config (#10809) (already backported)
* util Host: fix a double Read Lock bug in fn Host::session_readable() (#11175)
* ethcore client: fix a double Read Lock bug in fn Client::logs() (#11172)
* Type annotation for next_key() matching of json filter options (#11192)
* Upgrade jsonrpc to latest (#11206)
* [dependencies]: jsonrpc 14.0.1 (#11183)
* Upgrade to jsonrpc v14 (#11151)
* Switching sccache from local to Redis (#10971)
* Snapshot restoration overhaul (#11219)
* Add new line after writing block to hex file. (#10984)
* Pause pruning while snapshotting (#11178)
* Change how RPCs eth_call and eth_estimateGas handle "Pending" (#11127)
* Fix block detail updating (#11015)
* Make InstantSeal Instant again #11186
* Filter out some bad ropsten warp snapshots (#11247)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A8-looksgood 🦄 Pull request is reviewed well. B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants