Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share peers between syncing strategies #2814

Closed
wants to merge 34 commits into from

Conversation

dmitry-markin
Copy link
Contributor

@dmitry-markin dmitry-markin commented Dec 26, 2023

Introduce the ability to share peers between syncing strategies and reserve them for requests. This is needed to run GapSync as a separate strategy, and, ultimately, run Sync 2.0 alongside with ChainSync.

@dmitry-markin dmitry-markin added the T0-node This PR/Issue is related to the topic “node”. label Dec 26, 2023
substrate/client/network/sync/src/strategy.rs Outdated Show resolved Hide resolved
substrate/client/network/sync/src/strategy.rs Outdated Show resolved Hide resolved
substrate/client/network/sync/src/strategy.rs Outdated Show resolved Hide resolved
substrate/client/network/sync/src/strategy/chain_sync.rs Outdated Show resolved Hide resolved
substrate/client/network/sync/src/strategy/chain_sync.rs Outdated Show resolved Hide resolved
substrate/client/network/sync/src/strategy/chain_sync.rs Outdated Show resolved Hide resolved
substrate/client/network/sync/src/strategy/warp.rs Outdated Show resolved Hide resolved
Base automatically changed from dm-warp-sync-strategy to master January 12, 2024 16:47
@dmitry-markin dmitry-markin changed the title [draft] Share peers between syncing strategies Share peers between syncing strategies Jan 30, 2024
@dmitry-markin dmitry-markin marked this pull request as ready for review January 30, 2024 14:24
Copy link
Contributor

@altonen altonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass and left some comments but I will have to go over it again.

I'm not super excited about peer_best_blocks or of the fact that we store peers both in PeerPool and in each strategy and then deal with all the possible inconsistencies that follow from that on runtime. Ideally PeerPool would store strategy-specific data but I don't know how feasible that is and I'll do some testing. I also think we may experience yet unknown issues if GapSync, ChainSync and Sync2, or whatever its name will be, have independent (and potentially differing) views of peers. If peer best and common number were stored in PeerPool, all strategies could query and update them and we'd get rid of peer_best_blocks. What do you think?

I also think allowed_requests has to go because ChainSync is now checking the availability of a peer three times and the likelyhood that at one point during future refactorings one of them will go out of sync with the others is non-zero.

state: Option<StateStrategy<B>>,
chain_sync: Option<ChainSync<B, Client>>,
peer_pool: Arc<Mutex<PeerPool>>,
peer_best_blocks: HashMap<PeerId, (B::Hash, NumberFor<B>)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed to seed the peers when switching between the strategies. Otherwise, for example, the state strategy won't be aware of the best hash/number (and won't know what peer to request a state from) until the block announcement is received from a peer.

substrate/client/network/sync/src/strategy.rs Outdated Show resolved Hide resolved
substrate/client/network/sync/src/extra_requests.rs Outdated Show resolved Hide resolved
substrate/client/network/sync/src/strategy.rs Outdated Show resolved Hide resolved
substrate/client/network/sync/src/strategy/warp.rs Outdated Show resolved Hide resolved
substrate/client/network/sync/src/strategy/state.rs Outdated Show resolved Hide resolved
#[must_use]
fn add_peer_inner(
/// Process new peers assigning proper states and initiating requests.
fn handle_new_peers(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me why this function is needed and why can't the new peers be handled in ChainSync::block_requests()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look if this can be simplified.

Comment on lines 1367 to 1368
self.actions.push(ChainSyncAction::CancelRequest { peer_id });
self.peer_pool.lock().free_peer(&peer_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if here's a race condition. The peer is immediately freed but the cancellation is postponed until SyncingEngine processes the event so if some other strategy selects the peer while the request hasn't been canceled, it could result in two in-flight requests. Maybe the strategy/SyncingStrategy should be notified by SyncingEngine once the request is actually canceled so that freeing the peer is safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's even worse than that. When another strategy initiates the request, the old one is automatically dropped. And when we finally cancel the request, we can cancel a legitimate request of another strategy.

I'm not super happy with any kind of ACKing and bookkeeping, but it looks like we need to keep track of pending cancellations in every strategy and free peers in PeerPool when something like on_request_cancelled() is heard from SyncingEngine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why pending cancellations couldn't be stored in SyncingStrategy and once the request has been canceled, SyncingEngine calls SyncingStrategy which releases the peer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means intercepting request cancellation actions in SyncingStrategy on their way to SyncingEngine. Not the best design ever, but this is probably better then dealing with it in every strategy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we store peer's best and common blocks in PeerPool, we'd get rid of peer_best_blocks, allow SyncingEngine to replace its HashMap<PeerId, Peer<B>> with PeerPool and also allow it to free the peer once the request is canceled.

But now that I'm thinking this whole approach again, is there any fundamental reason why GapSync and ChainSync couldn't send a request to the same peer at the same time? I believe this limitation only applies to the sending end, namely the strategies, and made sense before when everything was one state machine. The request handlers shouldn't care how many requests from the same peer are in the queue. If GapSync and ChainSync are separate, is there a valid reason why they couldn't send simultaneous requests to a peer? We must ascertain that two copies of the same request are not sent because that'd get us banned, implying an existence of some kind of PeerPool for shared data, but is there a reason why they couldn't send two different requests at the same time? So the limitation of one request per peer still applies but now it's per strategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any fundamental reason why GapSync and ChainSync couldn't send a request to the same peer at the same time?

Valid point, there is nothing in the block request handler that forbids "simultaneous" requests. But if we get rid of the global PeerPool, we'll need to introduce a way to cancel specific requests, as otherwise strategies could cancel requests of each other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each strategy could be identified by a unique key and when strategy returns StrategyAction::SendRequest { PeerId, Request }, SyncingStrategy would convert it to SyncingAction::SendRequest { PeerId, Key, Request } and PendingResponses would keep track of responses with (PeerId, Key).

Copy link
Contributor Author

@dmitry-markin dmitry-markin Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I had in mind — attaching strategy IDs to requests.

Co-authored-by: Aaro Altonen <48052676+altonen@users.noreply.github.com>
substrate/client/network/sync/src/extra_requests.rs Outdated Show resolved Hide resolved
common_number: Zero::zero(),
best_hash,
best_number,
state: PeerSyncState::New,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new peer and an Available peer are handled differently in the code, and the idea was to postpone peer initialization that can lead to requests (handle_new_peers()) until actions(), so that we can call strategies in the specific order and implement the priorities of reserving the peers (e.g., Sync 2.0, then ChainSync, then GapSync).

#[must_use]
fn add_peer_inner(
/// Process new peers assigning proper states and initiating requests.
fn handle_new_peers(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look if this can be simplified.

substrate/client/network/sync/src/strategy/state.rs Outdated Show resolved Hide resolved
substrate/client/network/sync/src/strategy/state.rs Outdated Show resolved Hide resolved
substrate/client/network/sync/src/strategy/state.rs Outdated Show resolved Hide resolved

(Some(PeerSyncState::Available), None)
} else {
if self.peer_pool.try_reserve_peer(&peer_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we move this if to else if one line above? All the indentation in this file makes it a bit hard to follow.

as the peer is reserved by another syncing strategy.",
);

(None, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think you could directly return None here, then could get rid of the if let Some() few lines below.

Ok(None)
},
}
self.allowed_requests.add(&peer_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not directly related to this PR, but how does this allowed_requests work? It looks like it contains peers that we could potentially send a block request to. However, I don't fully understand why we reset it regularly to All, like here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely understand the logic behind it too, it looks like the only useful thing it's doing is blocking block request during state download in fast sync.

let mut matcher = self.extra_justifications.matcher();
std::iter::from_fn(move || {
if let Some((peer, request)) = matcher.next(peers) {
if let Some((peer_id, request)) = matcher.next(peers, peer_pool) {
// TODO: reserve the peer in `PeerPool`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover todo?

Edit: Ah, its done already in next call it looks like.

for mut available_peer in self.peer_pool.lock().available_peers() {
let peer_id = available_peer.peer_id();
if let Some(peer) = self.peers.get_mut(&peer_id) {
if peer.state.is_available() && peer.common_number >= sync.target_number() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we check for peer.state.is_available() even though it is in the peer_pool as available. I understand that the pool is shared between the strategies, but I am not sure whether it is legal for the peer to be available in the peer pool but not in chain_sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, should not be needed.

@dmitry-markin
Copy link
Contributor Author

@skunert Thanks for reviewing the PR, but I'm about to publish another one that should completely supersede it. So, please don't spend more time on reviewing this one for now.

@dmitry-markin
Copy link
Contributor Author

Closing in favor of #3224.

github-merge-queue bot pushed a commit that referenced this pull request Feb 13, 2024
This PR should supersede
#2814 and accomplish the
same with less changes. It's needed to run sync strategies in parallel,
like running `ChainSync` and `GapSync` as independent strategies, and
running `ChainSync` and Sync 2.0 alongside each other.

The difference with #2814
is that we allow simultaneous requests to remote peers initiated by
different strategies, as this is not tracked on the remote node in any
way. Therefore, `PeerPool` is not needed.

CC @skunert

---------

Co-authored-by: Sebastian Kunert <skunert49@gmail.com>
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this pull request Mar 25, 2024
This PR should supersede
paritytech#2814 and accomplish the
same with less changes. It's needed to run sync strategies in parallel,
like running `ChainSync` and `GapSync` as independent strategies, and
running `ChainSync` and Sync 2.0 alongside each other.

The difference with paritytech#2814
is that we allow simultaneous requests to remote peers initiated by
different strategies, as this is not tracked on the remote node in any
way. Therefore, `PeerPool` is not needed.

CC @skunert

---------

Co-authored-by: Sebastian Kunert <skunert49@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
Status: Blocked ⛔️
Development

Successfully merging this pull request may close these issues.

3 participants