Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Use inbound peerslot slots when a substream is received, rather than a connection #7464

Merged
42 commits merged into from
Nov 16, 2020

Conversation

tomaka
Copy link
Contributor

@tomaka tomaka commented Oct 29, 2020

Fix #7074

This implements the change described in #7074.

New protocol between behaviour and handler

It took me a lot of time, but I came up with this refactoring of the communication between the NetworkBehaviour and the ProtocolsHandler. I believe that the new asynchronous protocol between the behaviour and handler is sound, in other words that it is not prone to race conditions.

The handler can now accept the following incoming messages:

  • Open
  • Close

It can also emit these messages:

  • OpenResultOk or OpenResultErr.
  • CloseResult.
  • OpenDesired.
  • CloseDesired.

Each Open message sent to the handler will result in a corresponding OpenResultOk or OpenResultErr being sent back. Each Close message will result in a corresponding CloseResult being sent back.
Thanks to this correspondence, the behaviour can track the state (open or closed) of the handler. For instance, if the behaviour sends an Close, then receives an OpenDesired, it knows that the OpenDesired was sent before the Close has been processed.

When in closed state, the handler can also send an OpenDesired, indicating that the remote would like the state to transition to "open".
When in open state, the handler can, vice-versa, send a CloseDesired indicating that the remote would like the state to transition to "closed".

As a small acceptable hack, an OpenDesired should always be answered by sending either Open or Close, otherwise the handler cannot differentiate between the situation where the behaviour is still undecided and the situation where the behaviour wants to remain in the "closed" state.
It is forbidden to remain "open" when a CloseDesired is received, and as such there is no similar ambiguity for that message. The behaviour must send Close as soon as possible.

Inbound slots

At the moment, an inbound slot in the peerset is allocated whenever a TCP connection with a new peer is received.

This PR modifies this behaviour, and peerset inbound slots are now allocated when an OpenDesired is received from the handler.
Similarly, the slots are liberated when the substream is closed (this was already the case before).

This ties the inbound peerset slots to the concept of substreams, rather than the concept of TCP connections.

The motivation behind this change is to ultimately introduce peerset slots attributed to certain notification protocols (see #6087). In the future, the messages emitted by the handler would be specific to a notifications protocol. For example, when a substream using the parachains collation protocol is opened, only a parachains collation slot is attributed.

About opening_and_closing

In the implementation, I've opted to track, in the behaviour, the exact state of the handler. Not being tolerant will let us discover bugs sooner.

A consequence of tracking the exact state is that we verify that each OpenResultOk/OpenResultErr/CloseResult message emitted by the handler actually matches the state tracked in the behaviour.

But if the behaviour sends an Open message to the handler, then a Close message, then wants to send an Open message again, it starts being a lot of state to track. Rather than introducing a Vec, I went with putting the connection in a field called opening_and_closing and "banning" the peer for 5 seconds. In other words, we wait for 5 seconds before trying to send an Open message again.
Considering that this can only happen in case of quick state transitions coming from the peerset, I believe that this behaviour is completely ok, as it mimics the existing banning system.

@tomaka tomaka added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Oct 29, 2020
@tomaka tomaka requested review from romanb and mxinden October 29, 2020 17:22
@tomaka
Copy link
Contributor Author

tomaka commented Oct 29, 2020

Sorry for the lack of thorough explanations at the moment. This PR was very tiring to make, and I'm just very happy/relieved to finally open it and shut down my IDE. I'll edit the OP to provide more context tomorrow or next week.

@tomaka tomaka marked this pull request as ready for review October 30, 2020 11:57
@tomaka tomaka added A3-in_progress Pull request is in progress. No review needed at this stage. and removed A0-please_review Pull request needs code review. labels Oct 30, 2020
@tomaka
Copy link
Contributor Author

tomaka commented Oct 30, 2020

Marked as ready for CI purposes only.

@tomaka
Copy link
Contributor Author

tomaka commented Nov 2, 2020

Ready for review.

I apologize in advance for any headache caused by having to focus on this state machine code.
If you agree that the general structure is good, I'll start a burnin with debug-assertions = true. This way, we would trigger the various debug_assert!s I've put everywhere in case of a problem.

@tomaka tomaka added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Nov 2, 2020
client/network/src/protocol/generic_proto/behaviour.rs Outdated Show resolved Hide resolved
@@ -795,25 +1012,50 @@ impl GenericProto {
debug!(target: "sub-libp2p", "PSM => Accept({:?}, {:?}): Obsolete incoming,
sending back dropped", index, incoming.peer_id);
debug!(target: "sub-libp2p", "PSM <= Dropped({:?})", incoming.peer_id);
self.peerset.dropped(incoming.peer_id);
self.peerset.dropped(incoming.peer_id); // TODO: is that correct?!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we find out?

client/network/src/protocol/generic_proto/behaviour.rs Outdated Show resolved Hide resolved

if opening_and_closing.is_empty() && closed.is_empty() && closing.is_empty() {
if let Some(until) = banned_until {
*entry.get_mut() = PeerState::Banned { until };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When are these entries eventually removed if we don't reconnect? I think so far we always removed any particular entry on inject_disconnected().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.
Banned concerns nodes that we aren't connected to, so if we removed them in inject_disconnected, that would be incorrect as well.

I'm going to fix that by adding a timer to them, and removing them in poll when the timer triggers.

open_desired.is_empty()
{
if let Some(until) = banned_until {
*entry.get_mut() = PeerState::Banned { until };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, I'm wondering when these entries are removed if we don't reconnect to that peer.

client/network/src/protocol/generic_proto/behaviour.rs Outdated Show resolved Hide resolved
client/network/src/protocol/generic_proto/behaviour.rs Outdated Show resolved Hide resolved
client/network/src/protocol/generic_proto/behaviour.rs Outdated Show resolved Hide resolved
@tomaka
Copy link
Contributor Author

tomaka commented Nov 4, 2020

I've found the issue with tests: in the situation where the peer is in the Enabled state, and we have one connection Open and one Closed, and the Open connection closes, we don't try to open the one that is Closed, and instead stay Enabled but with no opening ever happening.

I'll fix that at the same time as I refactor the connections to be in only one list.

Sorry for the long time it takes to do this PR. This huge state machine requires a lot of focus, which I'm not necessarily good at.

@tomaka
Copy link
Contributor Author

tomaka commented Nov 4, 2020

I did the refactoring. It looks like the tests are still failing, so I'll do more investigation/reviewing/debugging.

Copy link
Contributor

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the behaviour sends an Open message to the handler, then a Close message, then wants to send an Open message again, it starts being a lot of state to track. Rather than introducing a Vec, I went with putting the connection in a field called opening_and_closing and "banning" the peer for 5 seconds. In other words, we wait for 5 seconds before trying to send an Open message again.

This seems a bit like a hack to me, but to be honest I am unable to come up an alternative.

I'd appreciate a review of the actual details of all the state machine transitions, but I understand if you don't want to.

I took a deeper look resulting in the questions below. I agree that the mere size of these state machines makes it quite complex. I hope the debug_asserts bring up any outstanding bugs.

I suggest to continue the burnin (with debug-assertions = true) over the weekend, and eventually merge early next week.

That sounds good to me.

Comment on lines 56 to 62
/// - `Requested`: No open connection, but requested by the peerset. Currently dialing.
/// - `Disabled`: Has open TCP connection(s) unbeknownst to the peerset. No substream is open.
/// - `Enabled`: Has open TCP connection(s), acknowledged by the peerset.
/// - Notifications substreams are open on at least one connection, and external
/// API has been notified.
/// - Notifications substreams aren't open.
/// - `Incoming`: Has open TCP connection(s) and remote would like to open substreams.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// - `Requested`: No open connection, but requested by the peerset. Currently dialing.
/// - `Disabled`: Has open TCP connection(s) unbeknownst to the peerset. No substream is open.
/// - `Enabled`: Has open TCP connection(s), acknowledged by the peerset.
/// - Notifications substreams are open on at least one connection, and external
/// API has been notified.
/// - Notifications substreams aren't open.
/// - `Incoming`: Has open TCP connection(s) and remote would like to open substreams.
/// - [`PeerState::Requested`]: No open connection, but requested by the peerset. Currently dialing.
/// - [`PeerState::Disabled`]: Has open TCP connection(s) unbeknownst to the peerset. No substream is open.
/// - [`PeerState::Enabled`]: Has open TCP connection(s), acknowledged by the peerset.
/// - Notifications substreams are open on at least one connection, and external
/// API has been notified.
/// - Notifications substreams aren't open.
/// - [`PeerState::Incoming`]: Has open TCP connection(s) and remote would like to open substreams.

Should catch documentation drifting apart from the implementation, right?

Some(v) => v,
None => {
error!(target: "sub-libp2p", "Overflow in next_incoming_index");
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that next_incoming_id is a u64I think it is safe to just panic here, no? Say we open 100 connections per second, a node would reach this after 140386180165 years.

/// Connection is in the `Closed` state but a [`NotifsHandlerIn::Open`] message then a
/// [`NotifsHandlerIn::Close`] message has been sent. An `OpenResultOk`/`OpenResultErr` message
/// followed with a `CloseResult` message are expected.
OpeningAndClosing,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
OpeningAndClosing,
OpeningThenClosing,

Would be more intuitive for me. I am fine either way.

@tomaka
Copy link
Contributor Author

tomaka commented Nov 16, 2020

Since everything worked well during the weekend with debug-assertions = true, unless someone intends to review more, I'm planning to merge this later today.

@tomaka
Copy link
Contributor Author

tomaka commented Nov 16, 2020

(this PR still needs at least one approving review to be merged)

@tomaka
Copy link
Contributor Author

tomaka commented Nov 16, 2020

bot merge

@ghost
Copy link

ghost commented Nov 16, 2020

Trying merge.

@ghost
Copy link

ghost commented Nov 16, 2020

Merge failed: "At least 2 approving reviews are required by reviewers with write access."

@romanb romanb self-requested a review November 16, 2020 15:39
Copy link
Contributor

@romanb romanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still want to take a closer look at the most recent changes, but there is no reason to stall the PR. If I happen to have any important feedback, I will get in touch.

@tomaka
Copy link
Contributor Author

tomaka commented Nov 16, 2020

@romanb Thanks

To be clear, my eagerness to merge this PR is also due to the fact that #7374, which is the next thing I will work on, will require substantial modifications to behaviour.rs and handler.rs again.

While the code changed by this PR might not be bug-free, our defensive-programming style, even though it is questionable, means that bugs in this code will normally not be able to crash the node. To me, this makes it acceptable to merge this intermediate in-between code.

@tomaka
Copy link
Contributor Author

tomaka commented Nov 16, 2020

bot merge

@ghost
Copy link

ghost commented Nov 16, 2020

Trying merge.

This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
3 participants