[Merged by Bors] - sync2: multipeer: fix edge cases #6447

ivan4th · 2024-11-11T10:03:00Z

Motivation

Split sync could become blocked when there were slow peers. Their
subranges are assigned to other peers, and there were bugs causing
indefinite blocking and panics in these cases. Moreover, after other
peers managed to sync the slow peers' subranges ahead of them, we need
to interrupt syncing against the slow peers as it's no longer needed.

In multipeer sync, when every peer has failed to sync, e.g. due to
temporary connection interruption, we don't need to wait for the full
sync interval, using shorter wait time between retries.

Description

This fixes aforementioned multipeer sync issues, and adds tests.
It also adds sync interval randomization to avoid network load spikes.

This adds a possibility to take a connection from the pool to use it via the Executor interface, and return it later when it's no longer needed. This avoids connection pool overhead in cases when a lot of quries need to be made, but the use of read transactions is not needed. Using read transactions instead of simple connections has the side effect of blocking WAL checkpoints.

Using single connection for multiple SQL queries which are executed during sync avoids noticeable overhead due to SQLite connection pool delays. Also, this change fixes memory overuse in DBSet. When initializing DBSet from a database table, there's no need to use an FPTree with big preallocated pool for the new entries that are added during recent sync.

Split sync could become blocked when there were slow peers. Their subranges are assigned to other peers, and there were bugs causing indefinite blocking and panics in these cases. Moreover, after other peers managed to sync the slow peers' subranges ahead of them, we need to interrupt syncing against the slow peers as it's no longer needed. In multipeer sync, when every peer has failed to sync, e.g. due to temporary connection interruption, we don't need to wait for the full sync interval, using shorter wait time between retries.

codecov · 2024-11-11T10:25:35Z

Codecov Report

Attention: Patch coverage is 72.22222% with 25 lines in your changes missing coverage. Please review.

Project coverage is 79.9%. Comparing base (06f74fa) to head (989dc18).
Report is 2 commits behind head on develop.

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
sync2/p2p.go	68.1%	13 Missing and 1 partial ⚠️
sync2/multipeer/multipeer.go	71.4%	4 Missing and 2 partials ⚠️
sync2/multipeer/split_sync.go	78.2%	4 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           develop   #6447   +/-   ##
=======================================
  Coverage     79.8%   79.9%           
=======================================
  Files          353     353           
  Lines        46540   46602   +62     
=======================================
+ Hits         37161   37248   +87     
+ Misses        7268    7244   -24     
+ Partials      2111    2110    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fasmat · 2024-12-04T10:58:31Z

sync2/multipeer/multipeer.go

+		interval := time.Duration(
+			float64(mpr.cfg.SyncInterval) *
+				(1 + mpr.cfg.SyncIntervalSpread*(rand.Float64()*2-1)))


I'm a bit confused by this interval calculation. Would it be simpler if SyncIntervalSpread would be defined as time.Duration and gave the maximum deviation from the interval?

interval := mpr.cfg.SyncInterval + rand.N(mpr.cfg.SyncIntervalSpread)

This will uniformly generate a duration between [SyncInterval, SyncInterval+SyncIntervalSprad) while the current definition is [SyncInterval, SyncInterval+2*SyncIntervalSpread) which is a bit odd to me?

The idea was for SyncIntervalSpread to be a floating point number 0..1 and to have intervals between [SyncInterval - SyncInterval*SyncIntervalSpread, SyncInterval + SyncInterval*SyncIntervalSpread]
We could of course also use MinSyncInterval and MaxSyncInterval, but I'm not sure which is more convenient.
My idea was that if I e.g. want the actual sync interval to be uniformly spread across SyncInterval +/- 25% I just set SyncIntervalSpread to 0.25

Maybe I should reflect this simpler explanation in the comments, incl. godoc comments for the config struct

I would prefer the Min- and Max- config options, but if you think a fractional spread is easier then go for that. But please add some explanation - to the config and/or here - what the values mean 🙂

ivan4th · 2024-12-04T20:15:45Z

Will submit another PR for config validation

ivan4th · 2024-12-04T20:21:04Z

bors merge

------- ## Motivation Split sync could become blocked when there were slow peers. Their subranges are assigned to other peers, and there were bugs causing indefinite blocking and panics in these cases. Moreover, after other peers managed to sync the slow peers' subranges ahead of them, we need to interrupt syncing against the slow peers as it's no longer needed. In multipeer sync, when every peer has failed to sync, e.g. due to temporary connection interruption, we don't need to wait for the full sync interval, using shorter wait time between retries.

spacemesh-bors · 2024-12-04T20:48:57Z

Build failed:

ci-status

ivan4th · 2024-12-04T21:11:07Z

Unrelated failure in hare4 TestHare/equivocators
https://github.com/spacemeshos/go-spacemesh/actions/runs/12167651564/job/33936709181

ivan4th · 2024-12-04T21:11:14Z

bors merge

------- ## Motivation Split sync could become blocked when there were slow peers. Their subranges are assigned to other peers, and there were bugs causing indefinite blocking and panics in these cases. Moreover, after other peers managed to sync the slow peers' subranges ahead of them, we need to interrupt syncing against the slow peers as it's no longer needed. In multipeer sync, when every peer has failed to sync, e.g. due to temporary connection interruption, we don't need to wait for the full sync interval, using shorter wait time between retries.

spacemesh-bors · 2024-12-04T21:51:13Z

Build failed:

ci-status

ivan4th · 2024-12-05T00:49:04Z

    logger.go:146: 2024-12-04T21:38:00.893Z	WARN	TestAdminEvents.proposalBuilder	failed to build proposal	{"sessionId": "bf89c65c", "lid": 30, "error": "missing beacon for epoch 3"}
    logger.go:146: 2024-12-04T21:38:00.893Z	ERROR	TestAdminEvents.beacon	failed to set up epoch	{"epoch": 3, "error": "zero epoch weight provided"}
    node_test.go:1006: 
        	Error Trace:	D:/a/go-spacemesh/go-spacemesh/node/node_test.go:1006
        	Error:      	Received unexpected error:
        	            	rpc error: code = DeadlineExceeded desc = context deadline exceeded
        	Test:       	TestAdminEvents
        	Messages:   	stream 0
    logger.go:146: 2024-12-04T21:38:04.981Z	INFO	TestAdminEvents.hare	weak coin reporter exited
    node_test.go:962: 
        	Error Trace:	D:/a/go-spacemesh/go-spacemesh/node/node_test.go:962
        	            				C:/hostedtoolcache/windows/go/1.23.4/x64/src/testing/testing.go:1176
        	            				C:/hostedtoolcache/windows/go/1.23.4/x64/src/testing/testing.go:1354
        	            				C:/hostedtoolcache/windows/go/1.23.4/x64/src/testing/testing.go:1684
        	            				C:/hostedtoolcache/windows/go/1.23.4/x64/src/runtime/panic.go:629
        	            				C:/hostedtoolcache/windows/go/1.23.4/x64/src/testing/testing.go:1006
        	            				D:/a/go-spacemesh/go-spacemesh/node/node_test.go:1006
        	Error:      	Received unexpected error:
        	            	init poet server: failed to listen: listen tcp 127.0.0.1:50001: bind: An attempt was made to access a socket in a way forbidden by its access permissions.
        	Test:       	TestAdminEvents
    testing.go:1232: TempDir RemoveAll cleanup: remove C:\Users\RUNNER~1\AppData\Local\Temp\TestAdminEvents1376418511\001\local.sql: The process cannot access the file because it is being used by another process.

=== FAIL: node TestAdminEvents_MultiSmesher (unknown)

This is unrelated to db changes in this PR

ivan4th · 2024-12-05T00:49:10Z

bors merge

------- ## Motivation Split sync could become blocked when there were slow peers. Their subranges are assigned to other peers, and there were bugs causing indefinite blocking and panics in these cases. Moreover, after other peers managed to sync the slow peers' subranges ahead of them, we need to interrupt syncing against the slow peers as it's no longer needed. In multipeer sync, when every peer has failed to sync, e.g. due to temporary connection interruption, we don't need to wait for the full sync interval, using shorter wait time between retries.

spacemesh-bors · 2024-12-05T01:38:55Z

Pull request successfully merged into develop.

Build succeeded:

ivan4th added 3 commits November 11, 2024 12:52

ivan4th requested review from dshulyak, fasmat, poszu, acud and jellonek as code owners November 11, 2024 10:03

ivan4th added 7 commits November 11, 2024 23:46

sync2: dbset: fix connection leak in non-loaded DBSets

4984889

sql: revert removing Rollback method of the Migration interface

bb31cc6

sql: remove Database.Connection() method, keep WithConnection()

3e5a401

Merge branch 'feature/long-db-conns' into sync2/dbset-conns

2753560

sql: allow multiple connections to in-memory database

f5cae06

sync2: fixup for temporary OrderedSet copies

9e585e5

Merge branch 'sync2/dbset-conns' into sync2/fix-multipeer

042d0d7

ivan4th mentioned this pull request Nov 13, 2024

[Merged by Bors] - sql: add Database.WithConnection #6445

Closed

ivan4th added the area/syncv2 label Nov 18, 2024

ivan4th added 2 commits November 21, 2024 00:13

Merge branch 'develop' into sync2/dbset-conns

0914f1f

Merge branch 'sync2/dbset-conns' into sync2/fix-multipeer

4ac694b

ivan4th mentioned this pull request Nov 25, 2024

sync v2 spacemeshos/pm#272

Open

35 tasks

Merge branch 'develop' into sync2/fix-multipeer

24768fb

jellonek approved these changes Dec 4, 2024

View reviewed changes

fasmat reviewed Dec 4, 2024

View reviewed changes

spacemesh-bors bot changed the base branch from sync2/dbset-conns to develop December 4, 2024 13:58

fasmat approved these changes Dec 4, 2024

View reviewed changes

Merge branch 'develop' into sync2/fix-multipeer

989dc18

spacemesh-bors bot changed the title ~~sync2: multipeer: fix edge cases~~ [Merged by Bors] - sync2: multipeer: fix edge cases Dec 5, 2024

spacemesh-bors bot closed this Dec 5, 2024

spacemesh-bors bot deleted the sync2/fix-multipeer branch December 5, 2024 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - sync2: multipeer: fix edge cases #6447

[Merged by Bors] - sync2: multipeer: fix edge cases #6447

ivan4th commented Nov 11, 2024 •

edited

Loading

codecov bot commented Nov 11, 2024 •

edited

Loading

fasmat Dec 4, 2024

ivan4th Dec 4, 2024

ivan4th Dec 4, 2024

fasmat Dec 4, 2024

ivan4th commented Dec 4, 2024

ivan4th commented Dec 4, 2024

spacemesh-bors bot commented Dec 4, 2024

ivan4th commented Dec 4, 2024

ivan4th commented Dec 4, 2024

spacemesh-bors bot commented Dec 4, 2024

ivan4th commented Dec 5, 2024

ivan4th commented Dec 5, 2024

spacemesh-bors bot commented Dec 5, 2024

[Merged by Bors] - sync2: multipeer: fix edge cases #6447

[Merged by Bors] - sync2: multipeer: fix edge cases #6447

Conversation

ivan4th commented Nov 11, 2024 • edited Loading

Motivation

Description

codecov bot commented Nov 11, 2024 • edited Loading

Codecov Report

fasmat Dec 4, 2024

Choose a reason for hiding this comment

ivan4th Dec 4, 2024

Choose a reason for hiding this comment

ivan4th Dec 4, 2024

Choose a reason for hiding this comment

fasmat Dec 4, 2024

Choose a reason for hiding this comment

ivan4th commented Dec 4, 2024

ivan4th commented Dec 4, 2024

spacemesh-bors bot commented Dec 4, 2024

ivan4th commented Dec 4, 2024

ivan4th commented Dec 4, 2024

spacemesh-bors bot commented Dec 4, 2024

ivan4th commented Dec 5, 2024

ivan4th commented Dec 5, 2024

spacemesh-bors bot commented Dec 5, 2024

ivan4th commented Nov 11, 2024 •

edited

Loading

codecov bot commented Nov 11, 2024 •

edited

Loading