-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Better logging when backfilling ancient blocks fail #10796
Better logging when backfilling ancient blocks fail #10796
Conversation
Print total blocks imported, closes #10792
Pull out call to migrated_blocks() from replace_client_db()
…ong while) Call abort_restore() when restoration fails
… chain back to genesis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks sane, modulo TODOs and the deadlock.
ethcore/src/snapshot/service.rs
Outdated
let find_range = || -> Option<(H256, H256)> { | ||
// TODO: In theory, if the current best_block is > new first_block (i.e. they overlap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd keep the comment, but drop the TODO:
prefix :)
@@ -695,8 +721,9 @@ impl Service { | |||
Ok(()) | | |||
Err(Error::Snapshot(SnapshotError::RestorationAborted)) => (), | |||
Err(e) => { | |||
// TODO: after this we're sometimes deadlocked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should actually always deadlock - because we already have self.restoration.lock()
acquired in line 720. parking_lot locks are not re-entrant.
Instead of calling abort_restore()
we should rather use *restoration = None
here. and maybe add a trace line if you care about it.
Alternatively we could have abort_restore
take the locked restoration
. IMHO if we do that for some methods already we should do that for all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should actually always deadlock
Well, for some reason we do not. It's pretty rare, I've seen it twice.
Instead of calling abort_restore() we should rather use *restoration = None
That's what abort_restore()
does though: here. Why is it better to do it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it better to do it here?
To avoid locking twice?
As I said the second ption should be abort_restore_with_restoration(locked_res: &mut Option<Restoration>)
so we can keep the logic there, but just pass the existing lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the reason I removed *restoration = None
from here and replaced it with a call to abort_restore()
was to collect all shutdown actions in one place. Not that abort_restore()
is doing very much atm, but I still think it's good to have a single point of abortion.
is locking twice really a concern though: this is an error handler and I don't think we care much about performance?
abort_restore_with_restoration(locked_res…)
The restoration
is a member of the struct here, so I'm not sure what the point of passing it as a param tbh. Now you got me confused!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since parking_lot::Mutex
is not re-entrant locking it twice in the same thread can (or does) lead to a deadlock. So the call like that:
fn some(&self) {
let mut restorating = self.restoration.lock();
....
self.abort_restore();
}
fn abort_restore(&self) {
*self.restoration.lock() = None
}
is guaranteed to deadlock, I thought always, but I suspect it might be just random.
In the same file I suppose we already had deadlock issues, so the with_restoration
pattern emerged, where instead of locking resources locally in a particular function you get passed a (mutable) reference to the locked resource:
let mut restoration = self.restoration.lock();
self.feed_chunk_with_restoration(&mut restoration)
I propose to use exactly the same patter for abort_restore
- I'm totally fine grouping all the shutdown actions there, but to prevent potential double-locking and potential deadlocks we can pass the locked resource, so the first example becomes:
fn some(&self) {
let mut restoration = self.restoration.lock();
....
self.abort_restore(&mut restoration);
}
fn abort_restore(&self, res: &mut Option<Restoration>) {
*res = None
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thank you for explaining.
Is this an option?
let r = {
let mut restoration = self.restoration.lock();
self.feed_chunk_with_restoration(&mut restoration, hash, chunk, is_state)
};
match r {
…
Err(e) => {
…
self.abort_restore();
…
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to docs:
Attempts to lock a mutex in the thread which already holds the lock will result in a deadlock.
So I'm not sure either why it didn't always deadlock. The proposed fix looks good.
…ckChain * master: revert changes to .gitlab-ci.yml (#10807) Add filtering capability to `parity_pendingTransactions` (issue 8269) (#10506) removed EthEngine alias (#10805) wait a bit longer in should_check_status_of_request_when_its_resolved (#10808) Do not drop the peer with None difficulty (#10772) ethcore-bloom-journal updated to 2018 (#10804) ethcore-light uses bincode 1.1 (#10798) Fix a few typos and unused warnings. (#10803) updated project to ansi_term 0.11 (#10799) added new ropsten-bootnode and removed old one (#10794) updated price-info to edition 2018 (#10801) ethcore-network-devp2p uses igd 0.9 (#10797) updated parity-local-store to edition 2018 and removed redundant Error type (#10800)
@@ -695,8 +721,9 @@ impl Service { | |||
Ok(()) | | |||
Err(Error::Snapshot(SnapshotError::RestorationAborted)) => (), | |||
Err(e) => { | |||
// TODO: after this we're sometimes deadlocked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to docs:
Attempts to lock a mutex in the thread which already holds the lock will result in a deadlock.
So I'm not sure either why it didn't always deadlock. The proposed fix looks good.
…ckChain * master: cargo update -p smallvec (#10822) replace memzero with zeroize crate (#10816) Don't repeat the logic from Default impl (#10813) removed additional_params method (#10818) Add Constantinople eips to the dev (instant_seal) config (#10809) removed redundant fmt::Display implementations (#10806)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits, otherwise LGTM.
ethcore/src/snapshot/service.rs
Outdated
// so the useful set of blocks is defined as: | ||
// [0 ... min(new.first_block, best_ancient_block or best_block)] | ||
// | ||
// If, for whatever reason, the old db does not have ancient blocks (i.e. | ||
// `best_ancient_block` is `None`) AND a non-zero `first_block`, such that the old db looks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think best_ancient_block
being None
can either mean: "no ancient blocks imported" or "all ancient blocks imported (no gaps)". So comment is misleading, although in the case when first_block
is Some
then the former is the true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I should move the parens I think.
* master: (22 commits) ethcore does not use byteorder (#10829) Better logging when backfilling ancient blocks fail (#10796) depends: Update wordlist to v1.3 (#10823) cargo update -p smallvec (#10822) replace memzero with zeroize crate (#10816) Don't repeat the logic from Default impl (#10813) removed additional_params method (#10818) Add Constantinople eips to the dev (instant_seal) config (#10809) removed redundant fmt::Display implementations (#10806) revert changes to .gitlab-ci.yml (#10807) Add filtering capability to `parity_pendingTransactions` (issue 8269) (#10506) removed EthEngine alias (#10805) wait a bit longer in should_check_status_of_request_when_its_resolved (#10808) Do not drop the peer with None difficulty (#10772) ethcore-bloom-journal updated to 2018 (#10804) ethcore-light uses bincode 1.1 (#10798) Fix a few typos and unused warnings. (#10803) updated project to ansi_term 0.11 (#10799) added new ropsten-bootnode and removed old one (#10794) updated price-info to edition 2018 (#10801) ...
…me-parent * master: refactor: whisper: Add type aliases and update rustdocs in message.rs (#10812) Break circular dependency between Client and Engine (part 1) (#10833) tests: Relates to #10655: Test instructions for Readme (#10835) refactor: Related #9459 - evmbin: replace untyped json! macro with fully typed serde serialization using Rust structs (#10657) idiomatic changes to PodState (#10834) Allow --nat extip:your.host.here.org (#10830) When updating the client or when called from RPC, sleep should mean sleep (#10814) Remove excessive warning (#10831) Fix typo in README.md (#10828) ethcore does not use byteorder (#10829) Better logging when backfilling ancient blocks fail (#10796) depends: Update wordlist to v1.3 (#10823)
* master: refactor: whisper: Add type aliases and update rustdocs in message.rs (#10812) Break circular dependency between Client and Engine (part 1) (#10833) tests: Relates to #10655: Test instructions for Readme (#10835) refactor: Related #9459 - evmbin: replace untyped json! macro with fully typed serde serialization using Rust structs (#10657) idiomatic changes to PodState (#10834) Allow --nat extip:your.host.here.org (#10830) When updating the client or when called from RPC, sleep should mean sleep (#10814) Remove excessive warning (#10831) Fix typo in README.md (#10828) ethcore does not use byteorder (#10829) Better logging when backfilling ancient blocks fail (#10796) depends: Update wordlist to v1.3 (#10823) cargo update -p smallvec (#10822) replace memzero with zeroize crate (#10816) Don't repeat the logic from Default impl (#10813) removed additional_params method (#10818) Add Constantinople eips to the dev (instant_seal) config (#10809)
* master: (21 commits) refactor: whisper: Add type aliases and update rustdocs in message.rs (#10812) Break circular dependency between Client and Engine (part 1) (#10833) tests: Relates to #10655: Test instructions for Readme (#10835) refactor: Related #9459 - evmbin: replace untyped json! macro with fully typed serde serialization using Rust structs (#10657) idiomatic changes to PodState (#10834) Allow --nat extip:your.host.here.org (#10830) When updating the client or when called from RPC, sleep should mean sleep (#10814) Remove excessive warning (#10831) Fix typo in README.md (#10828) ethcore does not use byteorder (#10829) Better logging when backfilling ancient blocks fail (#10796) depends: Update wordlist to v1.3 (#10823) cargo update -p smallvec (#10822) replace memzero with zeroize crate (#10816) Don't repeat the logic from Default impl (#10813) removed additional_params method (#10818) Add Constantinople eips to the dev (instant_seal) config (#10809) removed redundant fmt::Display implementations (#10806) revert changes to .gitlab-ci.yml (#10807) Add filtering capability to `parity_pendingTransactions` (issue 8269) (#10506) ...
Handle databases with holes
After snapshots have been downloaded and imported into the new DB we try to salvage existing blocks from the current DB before switching to the new DB ( this happens in
migrate_blocks()
).The current code assumes that the blocks in the current DB form a chain all the way down to the genesis block (i.e. they are "ancient blocks"), which is not necessarily true. There are situations where the current DB have some amount of blocks in it that do not stretch all the way back to genesis.
I think the way this situation can occur is:
Before this PR the final step of the second snapshot will fail with an
UnlinkedAncientBlockChain
error because the previous DB's blocks do not stretch all the way back to 0.With this PR, the attempt to salvage the old blocks is simply abbandoned and the new snapshot is allowed to complete, effectively tossing away the content of the previous DB.
I suspect that it might be possible to do better than that, but best do this step by step.
The diff is rather noisy but the key change of the PR starts here.
ref. #10793
Print total blocks imported, closes #10792