Better logging when backfilling ancient blocks fail #10796

dvdplm · 2019-06-26T15:49:10Z

Handle databases with holes

After snapshots have been downloaded and imported into the new DB we try to salvage existing blocks from the current DB before switching to the new DB ( this happens in migrate_blocks()).

The current code assumes that the blocks in the current DB form a chain all the way down to the genesis block (i.e. they are "ancient blocks"), which is not necessarily true. There are situations where the current DB have some amount of blocks in it that do not stretch all the way back to genesis.

I think the way this situation can occur is:

start a node with an empty DB
a snapshot is downloaded and the chain populated with [n..m] blocks, where m is close to the tip and n = m - 10k blocks
maybe the node stays up and imports block and is generally fine
the node is stopped and naturally falls out of sync
the node is started again and as it's too far off the tip, a new snapshot is downloaded
the previous db will have blocks [i..j] where i != 0 (it'll be 10k blocks before wherever the tip was in the FIRST snapshot) and j is farther behind the current tip - 10k

Before this PR the final step of the second snapshot will fail with anUnlinkedAncientBlockChain error because the previous DB's blocks do not stretch all the way back to 0.

With this PR, the attempt to salvage the old blocks is simply abbandoned and the new snapshot is allowed to complete, effectively tossing away the content of the previous DB.

I suspect that it might be possible to do better than that, but best do this step by step.

The diff is rather noisy but the key change of the PR starts here.

ref. #10793

Print total blocks imported, closes #10792

Pull out call to migrated_blocks() from replace_client_db()

…e_db()

…ong while) Call abort_restore() when restoration fails

… chain back to genesis

This reverts commit f027d4b.

tomusdrw

Looks sane, modulo TODOs and the deadlock.

tomusdrw · 2019-06-28T10:18:25Z

ethcore/src/snapshot/service.rs

 		let find_range = || -> Option<(H256, H256)> {
+			// TODO: In theory, if the current best_block is > new first_block (i.e. they overlap)


I'd keep the comment, but drop the TODO: prefix :)

ethcore/src/snapshot/service.rs

tomusdrw · 2019-06-28T10:27:25Z

ethcore/src/snapshot/service.rs

@@ -695,8 +721,9 @@ impl Service {
 			Ok(()) |
 			Err(Error::Snapshot(SnapshotError::RestorationAborted)) => (),
 			Err(e) => {
+				// TODO: after this we're sometimes deadlocked


We should actually always deadlock - because we already have self.restoration.lock() acquired in line 720. parking_lot locks are not re-entrant.

Instead of calling abort_restore() we should rather use *restoration = None here. and maybe add a trace line if you care about it.
Alternatively we could have abort_restore take the locked restoration. IMHO if we do that for some methods already we should do that for all of them.

We should actually always deadlock

Well, for some reason we do not. It's pretty rare, I've seen it twice.

Instead of calling abort_restore() we should rather use *restoration = None

That's what abort_restore() does though: here. Why is it better to do it here?

Why is it better to do it here?

To avoid locking twice?

As I said the second ption should be abort_restore_with_restoration(locked_res: &mut Option<Restoration>) so we can keep the logic there, but just pass the existing lock.

So the reason I removed *restoration = None from here and replaced it with a call to abort_restore() was to collect all shutdown actions in one place. Not that abort_restore() is doing very much atm, but I still think it's good to have a single point of abortion.
is locking twice really a concern though: this is an error handler and I don't think we care much about performance?

abort_restore_with_restoration(locked_res…)

The restoration is a member of the struct here, so I'm not sure what the point of passing it as a param tbh. Now you got me confused!

Since parking_lot::Mutex is not re-entrant locking it twice in the same thread can (or does) lead to a deadlock. So the call like that:

fn some(&self) { let mut restorating = self.restoration.lock(); .... self.abort_restore(); } fn abort_restore(&self) { *self.restoration.lock() = None }

is guaranteed to deadlock, I thought always, but I suspect it might be just random.

In the same file I suppose we already had deadlock issues, so the with_restoration pattern emerged, where instead of locking resources locally in a particular function you get passed a (mutable) reference to the locked resource:

let mut restoration = self.restoration.lock(); self.feed_chunk_with_restoration(&mut restoration)

I propose to use exactly the same patter for abort_restore - I'm totally fine grouping all the shutdown actions there, but to prevent potential double-locking and potential deadlocks we can pass the locked resource, so the first example becomes:

fn some(&self) { let mut restoration = self.restoration.lock(); .... self.abort_restore(&mut restoration); } fn abort_restore(&self, res: &mut Option<Restoration>) { *res = None }

I see, thank you for explaining.

Is this an option?

let r = { let mut restoration = self.restoration.lock(); self.feed_chunk_with_restoration(&mut restoration, hash, chunk, is_state) }; match r { … Err(e) => { … self.abort_restore(); … } }

According to docs:

Attempts to lock a mutex in the thread which already holds the lock will result in a deadlock.

So I'm not sure either why it didn't always deadlock. The proposed fix looks good.

…ckChain * master: revert changes to .gitlab-ci.yml (#10807) Add filtering capability to `parity_pendingTransactions` (issue 8269) (#10506) removed EthEngine alias (#10805) wait a bit longer in should_check_status_of_request_when_its_resolved (#10808) Do not drop the peer with None difficulty (#10772) ethcore-bloom-journal updated to 2018 (#10804) ethcore-light uses bincode 1.1 (#10798) Fix a few typos and unused warnings. (#10803) updated project to ansi_term 0.11 (#10799) added new ropsten-bootnode and removed old one (#10794) updated price-info to edition 2018 (#10801) ethcore-network-devp2p uses igd 0.9 (#10797) updated parity-local-store to edition 2018 and removed redundant Error type (#10800)

ordian · 2019-06-28T13:33:53Z

ethcore/src/snapshot/service.rs

@@ -695,8 +721,9 @@ impl Service {
 			Ok(()) |
 			Err(Error::Snapshot(SnapshotError::RestorationAborted)) => (),
 			Err(e) => {
+				// TODO: after this we're sometimes deadlocked


According to docs:

Attempts to lock a mutex in the thread which already holds the lock will result in a deadlock.

So I'm not sure either why it didn't always deadlock. The proposed fix looks good.

ethcore/src/snapshot/service.rs

…ckChain * master: cargo update -p smallvec (#10822) replace memzero with zeroize crate (#10816) Don't repeat the logic from Default impl (#10813) removed additional_params method (#10818) Add Constantinople eips to the dev (instant_seal) config (#10809) removed redundant fmt::Display implementations (#10806)

ascjones

Minor nits, otherwise LGTM.

ethcore/src/snapshot/error.rs

ethcore/src/snapshot/service.rs

ascjones · 2019-07-01T10:31:02Z

ethcore/src/snapshot/service.rs

 		// so the useful set of blocks is defined as:
 		// [0 ... min(new.first_block, best_ancient_block or best_block)]
+		//
+		// If, for whatever reason, the old db does not have ancient blocks (i.e.
+		// `best_ancient_block` is `None`) AND a non-zero `first_block`, such that the old db looks


I think best_ancient_block being None can either mean: "no ancient blocks imported" or "all ancient blocks imported (no gaps)". So comment is misleading, although in the case when first_block is Some then the former is the true.

Right, I should move the parens I think.

ethcore/src/snapshot/service.rs

* master: (22 commits) ethcore does not use byteorder (#10829) Better logging when backfilling ancient blocks fail (#10796) depends: Update wordlist to v1.3 (#10823) cargo update -p smallvec (#10822) replace memzero with zeroize crate (#10816) Don't repeat the logic from Default impl (#10813) removed additional_params method (#10818) Add Constantinople eips to the dev (instant_seal) config (#10809) removed redundant fmt::Display implementations (#10806) revert changes to .gitlab-ci.yml (#10807) Add filtering capability to `parity_pendingTransactions` (issue 8269) (#10506) removed EthEngine alias (#10805) wait a bit longer in should_check_status_of_request_when_its_resolved (#10808) Do not drop the peer with None difficulty (#10772) ethcore-bloom-journal updated to 2018 (#10804) ethcore-light uses bincode 1.1 (#10798) Fix a few typos and unused warnings. (#10803) updated project to ansi_term 0.11 (#10799) added new ropsten-bootnode and removed old one (#10794) updated price-info to edition 2018 (#10801) ...

…me-parent * master: refactor: whisper: Add type aliases and update rustdocs in message.rs (#10812) Break circular dependency between Client and Engine (part 1) (#10833) tests: Relates to #10655: Test instructions for Readme (#10835) refactor: Related #9459 - evmbin: replace untyped json! macro with fully typed serde serialization using Rust structs (#10657) idiomatic changes to PodState (#10834) Allow --nat extip:your.host.here.org (#10830) When updating the client or when called from RPC, sleep should mean sleep (#10814) Remove excessive warning (#10831) Fix typo in README.md (#10828) ethcore does not use byteorder (#10829) Better logging when backfilling ancient blocks fail (#10796) depends: Update wordlist to v1.3 (#10823)

* master: refactor: whisper: Add type aliases and update rustdocs in message.rs (#10812) Break circular dependency between Client and Engine (part 1) (#10833) tests: Relates to #10655: Test instructions for Readme (#10835) refactor: Related #9459 - evmbin: replace untyped json! macro with fully typed serde serialization using Rust structs (#10657) idiomatic changes to PodState (#10834) Allow --nat extip:your.host.here.org (#10830) When updating the client or when called from RPC, sleep should mean sleep (#10814) Remove excessive warning (#10831) Fix typo in README.md (#10828) ethcore does not use byteorder (#10829) Better logging when backfilling ancient blocks fail (#10796) depends: Update wordlist to v1.3 (#10823) cargo update -p smallvec (#10822) replace memzero with zeroize crate (#10816) Don't repeat the logic from Default impl (#10813) removed additional_params method (#10818) Add Constantinople eips to the dev (instant_seal) config (#10809)

* master: (21 commits) refactor: whisper: Add type aliases and update rustdocs in message.rs (#10812) Break circular dependency between Client and Engine (part 1) (#10833) tests: Relates to #10655: Test instructions for Readme (#10835) refactor: Related #9459 - evmbin: replace untyped json! macro with fully typed serde serialization using Rust structs (#10657) idiomatic changes to PodState (#10834) Allow --nat extip:your.host.here.org (#10830) When updating the client or when called from RPC, sleep should mean sleep (#10814) Remove excessive warning (#10831) Fix typo in README.md (#10828) ethcore does not use byteorder (#10829) Better logging when backfilling ancient blocks fail (#10796) depends: Update wordlist to v1.3 (#10823) cargo update -p smallvec (#10822) replace memzero with zeroize crate (#10816) Don't repeat the logic from Default impl (#10813) removed additional_params method (#10818) Add Constantinople eips to the dev (instant_seal) config (#10809) removed redundant fmt::Display implementations (#10806) revert changes to .gitlab-ci.yml (#10807) Add filtering capability to `parity_pendingTransactions` (issue 8269) (#10506) ...

dvdplm added 11 commits June 26, 2019 17:46

Better logging when backfilling ancient blocks fail

088cac8

Print total blocks imported, closes #10792

finalize() doesn't need Engine

3c13109

Pull out call to migrated_blocks() from replace_client_db()

More logs

020acf1

Clarify that the percentage may be misleading

690b805

Remove replace_client_db() and replace with a straight call to restor…

c516d1d

…e_db()

Include the parent_hash in UnlinkedAncientBlockChain errors

9f6d5d4

Add a new RestorationStatus varian: Finalizing (as it can take a looo…

5e2bfb9

…ong while) Call abort_restore() when restoration fails

Add missing cases for new variant

62c65aa

typos

947d1e4

Typo and derive Debug

d9ab3b9

Do not attempt to salvage existing blocks unless they form a complete…

72c58b5

… chain back to genesis

dvdplm self-assigned this Jun 28, 2019

dvdplm marked this pull request as ready for review June 28, 2019 09:56

dvdplm requested review from ascjones and tomusdrw June 28, 2019 09:57

dvdplm added the A0-pleasereview 🤓 Pull request needs code review. label Jun 28, 2019

dvdplm added 3 commits June 28, 2019 12:00

Fix test

f027d4b

Revert "Fix test"

2f5a1c6

This reverts commit f027d4b.

Fix test again

e019c8e

tomusdrw approved these changes Jun 28, 2019

View reviewed changes

dvdplm added 4 commits June 28, 2019 13:00

Update comment

6bf1ad3

Be careful about locks

12fd856

fix test failure

f577c8f

ordian reviewed Jun 28, 2019

View reviewed changes

ethcore/src/snapshot/service.rs Outdated Show resolved Hide resolved

ordian reviewed Jun 28, 2019

View reviewed changes

ethcore/src/snapshot/service.rs Outdated Show resolved Hide resolved

ordian added this to the 2.6 milestone Jun 28, 2019

ordian added the M4-core ⛓ Core client code / Rust. label Jun 28, 2019

Do not defer returning an error when the chain is broken

323edb7

ascjones approved these changes Jul 1, 2019

View reviewed changes

dvdplm added 2 commits July 1, 2019 13:39

Review feedback

fc220b0

no hex formatting for Option

9e17d61

dvdplm merged commit 5dc5be1 into master Jul 1, 2019

dvdplm deleted the dp/chore/debug-synching-UnlinkedAncientBlockChain branch July 1, 2019 12:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better logging when backfilling ancient blocks fail #10796

Better logging when backfilling ancient blocks fail #10796

dvdplm commented Jun 26, 2019 •

edited

Loading

tomusdrw left a comment

tomusdrw Jun 28, 2019

tomusdrw Jun 28, 2019

dvdplm Jun 28, 2019

tomusdrw Jun 28, 2019

dvdplm Jun 28, 2019

tomusdrw Jun 28, 2019

dvdplm Jun 28, 2019

ordian Jun 28, 2019

ordian Jun 28, 2019

ascjones left a comment

ascjones Jul 1, 2019

dvdplm Jul 1, 2019

		let find_range = \|\| -> Option<(H256, H256)> {
		// TODO: In theory, if the current best_block is > new first_block (i.e. they overlap)

Better logging when backfilling ancient blocks fail #10796

Better logging when backfilling ancient blocks fail #10796

Conversation

dvdplm commented Jun 26, 2019 • edited Loading

tomusdrw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ascjones left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dvdplm commented Jun 26, 2019 •

edited

Loading