Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eth/protocols/snap: snap sync testing #22179

Merged
merged 22 commits into from
Jan 25, 2021
Merged

Conversation

holiman
Copy link
Contributor

@holiman holiman commented Jan 15, 2021

This is a work in progress - the tests now just hang, which isn't ideal. However, the tests do enable some fairly deep testing of the snap protocol, and the reason I rebased it again and am working on it, is to see if it can be used to trigger #22172.

Also, in general, I think it would be good to have these kinds of fairly high-level protocol tests.

@holiman
Copy link
Contributor Author

holiman commented Jan 17, 2021

I think this shows two errors, or, maybe let's call them "discussion points".

  1. If cancel is invoked, the Sync does not exit with an error, but just exits with nil. Which seems strange, since the sync didn't complete successfully.
  2. If a remote part is delivering storage tries in very small increments, the requestor eventually stalls. There are still storage tasks, however, the snippet below causes them to not be retrieved, for some reason:
		// Skip tasks that are already retrieving (or done with) all small states
		if len(task.SubTasks) == 0 && len(task.stateTasks) == 0 {
			continue
		}

  1. The current implementation, when delivering storage keys, treats the limit a bit oddly. The server part aborts after a key has been added which goes above the limit. If a server instead respects the limit, the recipient complains that there are more elements available.

@holiman
Copy link
Contributor Author

holiman commented Jan 19, 2021

I pushed another fix, now there's only one remaining stall, in TestSyncWithStorageAndOneCappedPeer, where one peer delivers storage items, but only delivers it in max 500 bytes at a time.

@holiman
Copy link
Contributor Author

holiman commented Jan 19, 2021

@karalabe in 01e4846 , I added revertals to storage requests when they fail due to bad proofs. The testcases in this PR hit both those clauses individually, hence why they were added.

However, the same scenario applies for all types of requests, code, trie etc. Should we add revertals for all other cases when a response is invalid? And if so, maybe we should do it in a more generic fashion, instead of adding them one by one?

@holiman
Copy link
Contributor Author

holiman commented Jan 19, 2021

Another question.
In most places in sync.go, we update the bloom filter at the same time we write batch data to the database.

We do it in

  • processStorageResponse
  • processBytecodeResponse
  • forwardAccountTask

We do not do it in

  • processTrienodeHealResponse
  • processBytecodeHealResponse

size := common.StorageSize(len(hashes) * common.HashLength)
for _, account := range accounts {
size += common.StorageSize(len(account))
}
for _, node := range proof {
size += common.StorageSize(len(node))
}
logger := peer.logger.New("reqid", id)
logger := peer.Log().New("reqid", peer.ID())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong, pls revert. We need the request id (in id), not the peer's id.

@holiman
Copy link
Contributor Author

holiman commented Jan 20, 2021

Last remaining blocker:

$ go test . -run TriePanic
TRACE[01-20|15:53:31.702] Fetching range of accounts               id=nice-a reqid=4037200794235010051 root="c2661f…b0463a" origin="000000…000000" limit="0fffff…ffffff" bytes=512.00KiB
TRACE[01-20|15:53:31.702] Delivering range of accounts             id=nice-a reqid=4037200794235010051 hashes=1 accounts=1 proofs=1 bytes=210.00B
panic: interface conversion: trie.node is nil, not *trie.fullNode

goroutine 20 [running]:
github.com/ethereum/go-ethereum/trie.unsetInternal(0xa5c2a0, 0xc00007ed70, 0xc00002b9a0, 0x41, 0x41, 0xc00002b9f0, 0x41, 0x41, 0x20, 0xa5c1a0)
	/home/user/go/src/github.com/ethereum/go-ethereum/trie/proof.go:302 +0xd65
github.com/ethereum/go-ethereum/trie.VerifyRangeProof(0x6657bd231f66c2, 0x723c83101df021d9, 0xea1a61864a975b5e, 0x3a46b00c81ac76fa, 0xc00014a630, 0x20, 0x20, 0xc000352400, 0x20, 0x20, ...)
	/home/user/go/src/github.com/ethereum/go-ethereum/trie/proof.go:563 +0x7e5
github.com/ethereum/go-ethereum/eth/protocols/snap.(*Syncer).OnAccounts(0xc000147b00, 0xa62960, 0xc00016a3f0, 0x380704bb7b4d7c03, 0xc0003522a0, 0x1, 0x1, 0xc00034c140, 0x1, 0x1, ...)
	/home/user/go/src/github.com/ethereum/go-ethereum/eth/protocols/snap/sync.go:2033 +0x876
github.com/ethereum/go-ethereum/eth/protocols/snap.defaultAccountRequestHandler(0xc00016a3f0, 0x380704bb7b4d7c03, 0x6657bd231f66c2, 0x723c83101df021d9, 0xea1a61864a975b5e, 0x3a46b00c81ac76fa, 0x0, 0x0, 0x0, 0x0, ...)
	/home/user/go/src/github.com/ethereum/go-ethereum/eth/protocols/snap/sync_test.go:220 +0x11b
created by github.com/ethereum/go-ethereum/eth/protocols/snap.(*testPeer).RequestAccountRange
	/home/user/go/src/github.com/ethereum/go-ethereum/eth/protocols/snap/sync_test.go:162 +0x365

.. For now. And the question about the bloom filter above

@holiman
Copy link
Contributor Author

holiman commented Jan 20, 2021

@rjl493456442 do you have any good ideas on what the correct fix for the prover would be? The case is basically a very very small trie, consisting of a shortnode as root, iiuc.

@@ -622,6 +629,7 @@ func (s *Syncer) loadSyncStatus() {
log.Debug("Scheduled account sync task", "from", task.Next, "last", task.Last)
}
s.tasks = progress.Tasks
s.snapped = len(s.tasks) == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a dumb comment, but everywhere else we need to hold a lock for accessing s especially s.snapped. The only instance I can find where loadSyncStatus() is called is in Line 528, but we don't hold the lock there anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a bit funky. loadSyncStatus touches a lot of internals in an unsafe way, and right after, the code accesses s.tasks.

	s.loadSyncStatus()
	if len(s.tasks) == 0 && s.healer.scheduler.Pending() == 0 {

So probably that lock-holding should be extended to cover that portion too.

I don't think it's a real issue, however, because I don't think we enter there in a concurrent way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK loadSyncStatus is only called when you create the thing ,so there's no concurrency at that point.

@holiman holiman marked this pull request as ready for review January 22, 2021 09:33
@holiman
Copy link
Contributor Author

holiman commented Jan 22, 2021

Rebased on top of @rjl493456442 's panic fix

@karalabe karalabe added this to the 1.10.0 milestone Jan 25, 2021
Copy link
Member

@karalabe karalabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, although would be nice to look into how we could speed it up, .Currently it's enormously slow, even with parallel execution.

@karalabe karalabe merged commit 797b081 into ethereum:master Jan 25, 2021
bulgakovk pushed a commit to bulgakovk/go-ethereum that referenced this pull request Jan 26, 2021
* eth/protocols/snap: make timeout configurable

* eth/protocols/snap: snap sync testing

* eth/protocols/snap: test to trigger panic

* eth/protocols/snap: fix race condition on timeouts

* eth/protocols/snap: return error on cancelled sync

* squashme: updates + test causing panic + properly serve accounts in order

* eth/protocols/snap: revert failing storage response

* eth/protocols/snap: revert on bad responses (storage, code)

* eth/protocols/snap: fix account handling stall

* eth/protocols/snap: fix remaining revertal-issues

* eth/protocols/snap: timeouthandler for bytecode requests

* eth/protocols/snap: debugging + fix log message

* eth/protocols/snap: fix misspelliings in docs

* eth/protocols/snap: fix race in bytecode handling

* eth/protocols/snap: undo deduplication of storage roots

* synctests: refactor + minify panic testcase

* eth/protocols/snap: minor polishes

* eth: minor polishes to make logs more useful

* eth/protocols/snap: remove excessive logs from the test runs

* eth/protocols/snap: stress tests with concurrency

* eth/protocols/snap: further fixes to test cancel channel handling

* eth/protocols/snap: extend test timeouts on CI

Co-authored-by: Péter Szilágyi <peterke@gmail.com>
@holiman holiman mentioned this pull request Feb 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants