Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core, eth: split eth package, implement snap protocol #21482

Merged
merged 23 commits into from
Dec 14, 2020

Conversation

karalabe
Copy link
Member

This PR is a rebase of #20800. Since the code changed quite a lot and I needed to do an ugly rebase, I wanted to have the old code exist somewhere to compare if something does't behave correctly.

WIP

Database ethdb.Database // Database for direct sync insertions
Chain *core.BlockChain // Blockchain to serve data from
TxPool txPool // Transaction pool to propagate from
Network uint64 // Network identifier to adfvertise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Network uint64 // Network identifier to adfvertise
Network uint64 // Network identifier to advertise


func (pm *ProtocolManager) runPeer(p *peer) error {
if !pm.chainSync.handlePeerEvent(p) {
// runEthPeer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing documentation

// Read the next message from the remote peer, and ensure it's fully consumed
msg, err := p.rw.ReadMsg()
if err != nil {
// runSnapPeer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing documentation here too

if eth != nil {
eth.Peer.Disconnect(p2p.DiscUselessPeer)
}
if snap != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason for doing the additional check for snap != nil down here too?

can you disconnect useless peer after unregistering the peer after L346?

eth/peerset.go Outdated Show resolved Hide resolved
func (d *Downloader) RegisterPeer(id string, version int, peer Peer) error {
logger := log.New("peer", id)
func (d *Downloader) RegisterPeer(id string, version uint, peer Peer) error {
logger := log.New("peer", id[:16])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geth copydb causes this to panic:

	logger := log.New("peer", id[:16])
	logger.Trace("Registering sync peer")
	if err := d.peers.Register(newPeerConnection(id, version, peer, logger)); err != nil ```

@karalabe karalabe force-pushed the snapshot-sync-b branch 2 times, most recently from dcdc3de to 0bf5df5 Compare October 9, 2020 09:25
Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

eth/protocols/snap/handler.go Outdated Show resolved Hide resolved
eth/protocols/snap/handler.go Outdated Show resolved Hide resolved
eth/protocols/snap/handler.go Show resolved Hide resolved
eth/protocols/snap/handler.go Outdated Show resolved Hide resolved
Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some potential DoS vectors (?)

eth/protocols/snap/handler.go Outdated Show resolved Hide resolved
eth/protocols/snap/handler.go Outdated Show resolved Hide resolved
eth/protocols/snap/handler.go Outdated Show resolved Hide resolved
trie/proof.go Outdated
Comment on lines 464 to 466
for i := 0; i < len(keys)-1; i++ {
if bytes.Compare(keys[i], keys[i+1]) >= 0 {
return errors.New("range is not monotonically increasing"), false
return nil, nil, false, errors.New("range is not monotonically increasing")
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would't it make sense to have this check include firstKey and lastKey aswell? Something like this:

	var pKey = firstKey
	for i := 0; i < len(keys)-1; i++ {
		if bytes.Compare(pKey, keys[i]) >= 0 {
			return nil, nil, false, errors.New("range is not monotonically increasing")
		}
		pKey = keys[i]
	}
	if bytes.Compare(pKey, keys[len(keys)-1]) >= 0 {
		return nil, nil, false, errors.New("range is not monotonically increasing")
	}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather

	// Ensure the received batch is monotonic increasing.
	if firstKey != nil && len(keys) > 0 {
		if bytes.Compare(firstKey, keys[0]) > 0 {
			return nil, nil, false, errors.New("range is not monotonically increasing [0]")
		}
	}
	for i := 0; i < len(keys)-1; i++ {
		if bytes.Compare(keys[i], keys[i+1]) >= 0 {
			return nil, nil, false, errors.New("range is not monotonically increasing [1]")
		}
	}
	if lastKey != nil && len(keys) > 0 {
		if bytes.Compare(keys[len(keys)-1], lastKey) > 0 {
			return nil, nil, false, errors.New("range is not monotonically increasing [2]")
		}
	}

Copy link
Contributor

@holiman holiman Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the snap implementation, this doesn't matter, since AFAICT the snap callers always only send in the origin and the last key as firstKey and lastKey.

Comment on lines +106 to +113
defer func() {
// Cancel active request timers on exit. Also set peers to idle so they're
// available for the next sync.
for _, req := range active {
req.timer.Stop()
req.peer.SetNodeDataIdle(int(req.nItems), time.Now())
}
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, odd that we don't have this on master, if it's something we should have ?

@karalabe
Copy link
Member Author

Random memo: We can eventually make the protocol types (messages, numbers) public so they can be used by external testre frameworks without redefining them localy.

@holiman
Copy link
Contributor

holiman commented Dec 12, 2020

On a branch of mine, snapshot-sync-c, I rebased this on current master, and did two regular fast-syncs, for some sanity checking. Both completed in ~7-8 hours: https://geth-bench.ethdevops.io/d/Jpk-Be5Wk/dual-geth?orgId=1&from=1607712736289&to=1607756400000&var-exp=bench03&var-master=bench04&var-percentile=50 . All good!

@holiman
Copy link
Contributor

holiman commented Dec 12, 2020

On that same this-pr-on-master, I put bench03 and 04 to do snap-sync instead (ansible-playbook configure_bench03-04.yaml -t wipe,prep,bench -e "img_a=holiman/geth-experimental:latest" -e "img_b=holiman/geth-experimental:latest" -e '{"geth_args_custom":["--snapshot"]}' -e "syncmode=snap") . Bench03 seem to have picked up a peer to sync with (using dns discovery only), without even having been added as a priority client at the bootnode(s).

Note though, they're not running with karalabe#42, so they might stall eventually.

@holiman
Copy link
Contributor

holiman commented Dec 12, 2020

bench03 got synced in about 6.5 hours: https://geth-bench.ethdevops.io/d/Jpk-Be5Wk/dual-geth?orgId=1&from=1607762821723&to=1607788883492&var-exp=bench03&var-master=bench04&var-percentile=50

It seems that bench03 had only one snap-capable peer

enode://f6b362a821ba4898b52fffc7a365a494791a3090b3c7ca855f0f292be97be0c7417d0d48d8ccc520c464802a25a5cb34891f64696171f7755b2a4cb348b52027@52.56.142.13:30303

@holiman
Copy link
Contributor

holiman commented Dec 12, 2020

Hm, it's this one: https://github.com/ethereum/discv4-dns-lists/blob/master/snap.mainnet.ethdisco.net/nodes.json#L18 . So apparently there are 6 snap-1 nodes on mainnet - that particular one was bench01.

@holiman
Copy link
Contributor

holiman commented Dec 13, 2020

Later in the evening, at about 20:30, bench04 also found a snap peer, and finished up the sync in~3 hours. https://geth-bench.ethdevops.io/d/Jpk-Be5Wk/dual-geth?orgId=1&from=1607764411539&to=1607828594396&var-exp=bench03&var-master=bench04&var-percentile=50

@holiman holiman merged commit 017831d into ethereum:master Dec 14, 2020
@holiman holiman added this to the 1.10.0 milestone Dec 14, 2020
} else {
cfg.TrieCleanCache += cfg.SnapshotCache
cfg.SnapshotCache = 0 // Disabled
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if !ctx.GlobalIsSet(SnapshotFlag.Name) && cfg.SyncMode != downloader.SnapSync {
	cfg.TrieCleanCache += cfg.SnapshotCache
	cfg.SnapshotCache = 0 // Disabled
}

// If snap-sync is requested, this flag is also required
if cfg.SyncMode == downloader.SnapSync {
log.Info("Snap sync requested, enabling --snapshot")
ctx.Set(SnapshotFlag.Name, "true")
Copy link
Member Author

@karalabe karalabe Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is a noop.

@@ -460,7 +460,7 @@ func (d *Downloader) syncWithPeer(p *peerConnection, hash common.Hash, td *big.I
}
}()
if p.version < 64 {
return errTooOld
return fmt.Errorf("%w, peer version: %d", errTooOld, p.version)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%w: :P

@hadv
Copy link
Contributor

hadv commented Nov 10, 2021

@karalabe Would you explain a bit more details why we need the heal phase for the snap protocol? Thank you so much!

@karalabe
Copy link
Member Author

@hadv Geth maintains 1 persistent snapshot layer on disk and 128 diff layers in memory. This means, that if you start syncing the head snapshot, after 128 block (~30 minutes), that snapshot will not be available any more in the network. Geth then switches to continuing sync on a new snapshot, but that won't match the old one. We do this until we sync everything, but by that time, we have pieces of many different snapshots (count depending on how fast you can sync up). The heal phase just "glues" these snapshot layers together, plus integrates any changes that have occurred since you've downloaded a particular snapshot segment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants