Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - sync2: multipeer: fix edge cases #6447

Closed
wants to merge 14 commits into from

Conversation

ivan4th
Copy link
Contributor

@ivan4th ivan4th commented Nov 11, 2024


Motivation

Split sync could become blocked when there were slow peers. Their
subranges are assigned to other peers, and there were bugs causing
indefinite blocking and panics in these cases. Moreover, after other
peers managed to sync the slow peers' subranges ahead of them, we need
to interrupt syncing against the slow peers as it's no longer needed.

In multipeer sync, when every peer has failed to sync, e.g. due to
temporary connection interruption, we don't need to wait for the full
sync interval, using shorter wait time between retries.

Description

This fixes aforementioned multipeer sync issues, and adds tests.
It also adds sync interval randomization to avoid network load spikes.

This adds a possibility to take a connection from the pool to use it
via the Executor interface, and return it later when it's no longer
needed. This avoids connection pool overhead in cases when a lot of
quries need to be made, but the use of read transactions is not
needed.

Using read transactions instead of simple connections has the side
effect of blocking WAL checkpoints.
Using single connection for multiple SQL queries which are executed
during sync avoids noticeable overhead due to SQLite connection pool
delays.

Also, this change fixes memory overuse in DBSet. When initializing
DBSet from a database table, there's no need to use an FPTree with big
preallocated pool for the new entries that are added during recent sync.
Split sync could become blocked when there were slow peers. Their
subranges are assigned to other peers, and there were bugs causing
indefinite blocking and panics in these cases. Moreover, after other
peers managed to sync the slow peers' subranges ahead of them, we need
to interrupt syncing against the slow peers as it's no longer needed.

In multipeer sync, when every peer has failed to sync, e.g. due to
temporary connection interruption, we don't need to wait for the full
sync interval, using shorter wait time between retries.
Copy link

codecov bot commented Nov 11, 2024

Codecov Report

Attention: Patch coverage is 72.22222% with 25 lines in your changes missing coverage. Please review.

Project coverage is 79.9%. Comparing base (06f74fa) to head (989dc18).
Report is 2 commits behind head on develop.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
sync2/p2p.go 68.1% 13 Missing and 1 partial ⚠️
sync2/multipeer/multipeer.go 71.4% 4 Missing and 2 partials ⚠️
sync2/multipeer/split_sync.go 78.2% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           develop   #6447   +/-   ##
=======================================
  Coverage     79.8%   79.9%           
=======================================
  Files          353     353           
  Lines        46540   46602   +62     
=======================================
+ Hits         37161   37248   +87     
+ Misses        7268    7244   -24     
+ Partials      2111    2110    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ivan4th ivan4th mentioned this pull request Nov 25, 2024
35 tasks
Comment on lines +406 to +408
interval := time.Duration(
float64(mpr.cfg.SyncInterval) *
(1 + mpr.cfg.SyncIntervalSpread*(rand.Float64()*2-1)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this interval calculation. Would it be simpler if SyncIntervalSpread would be defined as time.Duration and gave the maximum deviation from the interval?

interval := mpr.cfg.SyncInterval + rand.N(mpr.cfg.SyncIntervalSpread)

This will uniformly generate a duration between [SyncInterval, SyncInterval+SyncIntervalSprad) while the current definition is [SyncInterval, SyncInterval+2*SyncIntervalSpread) which is a bit odd to me?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was for SyncIntervalSpread to be a floating point number 0..1 and to have intervals between [SyncInterval - SyncInterval*SyncIntervalSpread, SyncInterval + SyncInterval*SyncIntervalSpread]
We could of course also use MinSyncInterval and MaxSyncInterval, but I'm not sure which is more convenient.
My idea was that if I e.g. want the actual sync interval to be uniformly spread across SyncInterval +/- 25% I just set SyncIntervalSpread to 0.25

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should reflect this simpler explanation in the comments, incl. godoc comments for the config struct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the Min- and Max- config options, but if you think a fractional spread is easier then go for that. But please add some explanation - to the config and/or here - what the values mean 🙂

@spacemesh-bors spacemesh-bors bot changed the base branch from sync2/dbset-conns to develop December 4, 2024 13:58
@ivan4th
Copy link
Contributor Author

ivan4th commented Dec 4, 2024

Will submit another PR for config validation

@ivan4th
Copy link
Contributor Author

ivan4th commented Dec 4, 2024

bors merge

spacemesh-bors bot pushed a commit that referenced this pull request Dec 4, 2024
-------

## Motivation

Split sync could become blocked when there were slow peers. Their
subranges are assigned to other peers, and there were bugs causing
indefinite blocking and panics in these cases. Moreover, after other
peers managed to sync the slow peers' subranges ahead of them, we need
to interrupt syncing against the slow peers as it's no longer needed.

In multipeer sync, when every peer has failed to sync, e.g. due to
temporary connection interruption, we don't need to wait for the full
sync interval, using shorter wait time between retries.
@spacemesh-bors
Copy link

spacemesh-bors bot commented Dec 4, 2024

Build failed:

@ivan4th
Copy link
Contributor Author

ivan4th commented Dec 4, 2024

Unrelated failure in hare4 TestHare/equivocators
https://github.com/spacemeshos/go-spacemesh/actions/runs/12167651564/job/33936709181

@ivan4th
Copy link
Contributor Author

ivan4th commented Dec 4, 2024

bors merge

spacemesh-bors bot pushed a commit that referenced this pull request Dec 4, 2024
-------

## Motivation

Split sync could become blocked when there were slow peers. Their
subranges are assigned to other peers, and there were bugs causing
indefinite blocking and panics in these cases. Moreover, after other
peers managed to sync the slow peers' subranges ahead of them, we need
to interrupt syncing against the slow peers as it's no longer needed.

In multipeer sync, when every peer has failed to sync, e.g. due to
temporary connection interruption, we don't need to wait for the full
sync interval, using shorter wait time between retries.
@spacemesh-bors
Copy link

spacemesh-bors bot commented Dec 4, 2024

Build failed:

@ivan4th
Copy link
Contributor Author

ivan4th commented Dec 5, 2024

    logger.go:146: 2024-12-04T21:38:00.893Z	WARN	TestAdminEvents.proposalBuilder	failed to build proposal	{"sessionId": "bf89c65c", "lid": 30, "error": "missing beacon for epoch 3"}
    logger.go:146: 2024-12-04T21:38:00.893Z	ERROR	TestAdminEvents.beacon	failed to set up epoch	{"epoch": 3, "error": "zero epoch weight provided"}
    node_test.go:1006: 
        	Error Trace:	D:/a/go-spacemesh/go-spacemesh/node/node_test.go:1006
        	Error:      	Received unexpected error:
        	            	rpc error: code = DeadlineExceeded desc = context deadline exceeded
        	Test:       	TestAdminEvents
        	Messages:   	stream 0
    logger.go:146: 2024-12-04T21:38:04.981Z	INFO	TestAdminEvents.hare	weak coin reporter exited
    node_test.go:962: 
        	Error Trace:	D:/a/go-spacemesh/go-spacemesh/node/node_test.go:962
        	            				C:/hostedtoolcache/windows/go/1.23.4/x64/src/testing/testing.go:1176
        	            				C:/hostedtoolcache/windows/go/1.23.4/x64/src/testing/testing.go:1354
        	            				C:/hostedtoolcache/windows/go/1.23.4/x64/src/testing/testing.go:1684
        	            				C:/hostedtoolcache/windows/go/1.23.4/x64/src/runtime/panic.go:629
        	            				C:/hostedtoolcache/windows/go/1.23.4/x64/src/testing/testing.go:1006
        	            				D:/a/go-spacemesh/go-spacemesh/node/node_test.go:1006
        	Error:      	Received unexpected error:
        	            	init poet server: failed to listen: listen tcp 127.0.0.1:50001: bind: An attempt was made to access a socket in a way forbidden by its access permissions.
        	Test:       	TestAdminEvents
    testing.go:1232: TempDir RemoveAll cleanup: remove C:\Users\RUNNER~1\AppData\Local\Temp\TestAdminEvents1376418511\001\local.sql: The process cannot access the file because it is being used by another process.

=== FAIL: node TestAdminEvents_MultiSmesher (unknown)

This is unrelated to db changes in this PR

@ivan4th
Copy link
Contributor Author

ivan4th commented Dec 5, 2024

bors merge

spacemesh-bors bot pushed a commit that referenced this pull request Dec 5, 2024
-------

## Motivation

Split sync could become blocked when there were slow peers. Their
subranges are assigned to other peers, and there were bugs causing
indefinite blocking and panics in these cases. Moreover, after other
peers managed to sync the slow peers' subranges ahead of them, we need
to interrupt syncing against the slow peers as it's no longer needed.

In multipeer sync, when every peer has failed to sync, e.g. due to
temporary connection interruption, we don't need to wait for the full
sync interval, using shorter wait time between retries.
@spacemesh-bors
Copy link

spacemesh-bors bot commented Dec 5, 2024

Pull request successfully merged into develop.

Build succeeded:

@spacemesh-bors spacemesh-bors bot changed the title sync2: multipeer: fix edge cases [Merged by Bors] - sync2: multipeer: fix edge cases Dec 5, 2024
@spacemesh-bors spacemesh-bors bot closed this Dec 5, 2024
@spacemesh-bors spacemesh-bors bot deleted the sync2/fix-multipeer branch December 5, 2024 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants