Use inbound peerslot slots when a substream is received, rather than a connection #7464

tomaka · 2020-10-29T17:22:40Z

This implements the change described in #7074.

New protocol between behaviour and handler

It took me a lot of time, but I came up with this refactoring of the communication between the NetworkBehaviour and the ProtocolsHandler. I believe that the new asynchronous protocol between the behaviour and handler is sound, in other words that it is not prone to race conditions.

The handler can now accept the following incoming messages:

Open
Close

It can also emit these messages:

OpenResultOk or OpenResultErr.
CloseResult.
OpenDesired.
CloseDesired.

Each Open message sent to the handler will result in a corresponding OpenResultOk or OpenResultErr being sent back. Each Close message will result in a corresponding CloseResult being sent back.
Thanks to this correspondence, the behaviour can track the state (open or closed) of the handler. For instance, if the behaviour sends an Close, then receives an OpenDesired, it knows that the OpenDesired was sent before the Close has been processed.

When in closed state, the handler can also send an OpenDesired, indicating that the remote would like the state to transition to "open".
When in open state, the handler can, vice-versa, send a CloseDesired indicating that the remote would like the state to transition to "closed".

As a small acceptable hack, an OpenDesired should always be answered by sending either Open or Close, otherwise the handler cannot differentiate between the situation where the behaviour is still undecided and the situation where the behaviour wants to remain in the "closed" state.
It is forbidden to remain "open" when a CloseDesired is received, and as such there is no similar ambiguity for that message. The behaviour must send Close as soon as possible.

Inbound slots

At the moment, an inbound slot in the peerset is allocated whenever a TCP connection with a new peer is received.

This PR modifies this behaviour, and peerset inbound slots are now allocated when an OpenDesired is received from the handler.
Similarly, the slots are liberated when the substream is closed (this was already the case before).

This ties the inbound peerset slots to the concept of substreams, rather than the concept of TCP connections.

The motivation behind this change is to ultimately introduce peerset slots attributed to certain notification protocols (see #6087). In the future, the messages emitted by the handler would be specific to a notifications protocol. For example, when a substream using the parachains collation protocol is opened, only a parachains collation slot is attributed.

About `opening_and_closing`

In the implementation, I've opted to track, in the behaviour, the exact state of the handler. Not being tolerant will let us discover bugs sooner.

A consequence of tracking the exact state is that we verify that each OpenResultOk/OpenResultErr/CloseResult message emitted by the handler actually matches the state tracked in the behaviour.

But if the behaviour sends an Open message to the handler, then a Close message, then wants to send an Open message again, it starts being a lot of state to track. Rather than introducing a Vec, I went with putting the connection in a field called opening_and_closing and "banning" the peer for 5 seconds. In other words, we wait for 5 seconds before trying to send an Open message again.
Considering that this can only happen in case of quick state transitions coming from the peerset, I believe that this behaviour is completely ok, as it mimics the existing banning system.

tomaka · 2020-10-29T17:26:34Z

Sorry for the lack of thorough explanations at the moment. This PR was very tiring to make, and I'm just very happy/relieved to finally open it and shut down my IDE. I'll edit the OP to provide more context tomorrow or next week.

tomaka · 2020-10-30T11:58:11Z

Marked as ready for CI purposes only.

…a connection

tomaka · 2020-11-02T10:05:05Z

Ready for review.

I apologize in advance for any headache caused by having to focus on this state machine code.
If you agree that the general structure is good, I'll start a burnin with debug-assertions = true. This way, we would trigger the various debug_assert!s I've put everywhere in case of a problem.

client/network/src/protocol/generic_proto/behaviour.rs

romanb · 2020-11-02T14:29:05Z

client/network/src/protocol/generic_proto/behaviour.rs

@@ -795,25 +1012,50 @@ impl GenericProto {
 			debug!(target: "sub-libp2p", "PSM => Accept({:?}, {:?}): Obsolete incoming,
 				sending back dropped", index, incoming.peer_id);
 			debug!(target: "sub-libp2p", "PSM <= Dropped({:?})", incoming.peer_id);
-			self.peerset.dropped(incoming.peer_id);
+			self.peerset.dropped(incoming.peer_id);  // TODO: is that correct?!


How do we find out?

client/network/src/protocol/generic_proto/behaviour.rs

romanb · 2020-11-02T16:30:15Z

client/network/src/protocol/generic_proto/behaviour.rs

+
+				if opening_and_closing.is_empty() && closed.is_empty() && closing.is_empty() {
+					if let Some(until) = banned_until {
+						*entry.get_mut() = PeerState::Banned { until };


When are these entries eventually removed if we don't reconnect? I think so far we always removed any particular entry on inject_disconnected().

That's a good point.
Banned concerns nodes that we aren't connected to, so if we removed them in inject_disconnected, that would be incorrect as well.

I'm going to fix that by adding a timer to them, and removing them in poll when the timer triggers.

romanb · 2020-11-02T16:32:57Z

client/network/src/protocol/generic_proto/behaviour.rs

+					open_desired.is_empty()
+				{
+					if let Some(until) = banned_until {
+						*entry.get_mut() = PeerState::Banned { until };


Same as above, I'm wondering when these entries are removed if we don't reconnect to that peer.

client/network/src/protocol/generic_proto/behaviour.rs

Co-authored-by: Roman Borschel <romanb@users.noreply.github.com>

tomaka · 2020-11-04T10:45:54Z

I've found the issue with tests: in the situation where the peer is in the Enabled state, and we have one connection Open and one Closed, and the Open connection closes, we don't try to open the one that is Closed, and instead stay Enabled but with no opening ever happening.

I'll fix that at the same time as I refactor the connections to be in only one list.

Sorry for the long time it takes to do this PR. This huge state machine requires a lot of focus, which I'm not necessarily good at.

tomaka · 2020-11-04T12:13:05Z

I did the refactoring. It looks like the tests are still failing, so I'll do more investigation/reviewing/debugging.

mxinden

But if the behaviour sends an Open message to the handler, then a Close message, then wants to send an Open message again, it starts being a lot of state to track. Rather than introducing a Vec, I went with putting the connection in a field called opening_and_closing and "banning" the peer for 5 seconds. In other words, we wait for 5 seconds before trying to send an Open message again.

This seems a bit like a hack to me, but to be honest I am unable to come up an alternative.

I'd appreciate a review of the actual details of all the state machine transitions, but I understand if you don't want to.

I took a deeper look resulting in the questions below. I agree that the mere size of these state machines makes it quite complex. I hope the debug_asserts bring up any outstanding bugs.

I suggest to continue the burnin (with debug-assertions = true) over the weekend, and eventually merge early next week.

That sounds good to me.

mxinden · 2020-11-13T16:14:12Z

client/network/src/protocol/generic_proto/behaviour.rs

+/// - `Requested`: No open connection, but requested by the peerset. Currently dialing.
+/// - `Disabled`: Has open TCP connection(s) unbeknownst to the peerset. No substream is open.
+/// - `Enabled`: Has open TCP connection(s), acknowledged by the peerset.
+///   - Notifications substreams are open on at least one connection, and external
+///     API has been notified.
+///   - Notifications substreams aren't open.
+/// - `Incoming`: Has open TCP connection(s) and remote would like to open substreams.


Suggested change

/// - `Requested`: No open connection, but requested by the peerset. Currently dialing.

/// - `Disabled`: Has open TCP connection(s) unbeknownst to the peerset. No substream is open.

/// - `Enabled`: Has open TCP connection(s), acknowledged by the peerset.

/// - Notifications substreams are open on at least one connection, and external

/// API has been notified.

/// - Notifications substreams aren't open.

/// - `Incoming`: Has open TCP connection(s) and remote would like to open substreams.

/// - [`PeerState::Requested`]: No open connection, but requested by the peerset. Currently dialing.

/// - [`PeerState::Disabled`]: Has open TCP connection(s) unbeknownst to the peerset. No substream is open.

/// - [`PeerState::Enabled`]: Has open TCP connection(s), acknowledged by the peerset.

/// - Notifications substreams are open on at least one connection, and external

/// API has been notified.

/// - Notifications substreams aren't open.

/// - [`PeerState::Incoming`]: Has open TCP connection(s) and remote would like to open substreams.

Should catch documentation drifting apart from the implementation, right?

mxinden · 2020-11-13T16:46:00Z

client/network/src/protocol/generic_proto/behaviour.rs

+									Some(v) => v,
+									None => {
+										error!(target: "sub-libp2p", "Overflow in next_incoming_index");
+										return


Given that next_incoming_id is a u64I think it is safe to just panic here, no? Say we open 100 connections per second, a node would reach this after 140386180165 years.

mxinden · 2020-11-13T16:47:11Z

client/network/src/protocol/generic_proto/behaviour.rs

+	/// Connection is in the `Closed` state but a [`NotifsHandlerIn::Open`] message then a
+	/// [`NotifsHandlerIn::Close`] message has been sent. An `OpenResultOk`/`OpenResultErr` message
+	/// followed with a `CloseResult` message are expected.
+	OpeningAndClosing,


Suggested change

OpeningAndClosing,

OpeningThenClosing,

Would be more intuitive for me. I am fine either way.

client/network/src/protocol/generic_proto/behaviour.rs

Co-authored-by: Max Inden <mail@max-inden.de>

tomaka · 2020-11-16T08:51:32Z

Since everything worked well during the weekend with debug-assertions = true, unless someone intends to review more, I'm planning to merge this later today.

tomaka · 2020-11-16T12:32:06Z

(this PR still needs at least one approving review to be merged)

tomaka · 2020-11-16T15:27:51Z

bot merge

ghost · 2020-11-16T15:27:54Z

Trying merge.

ghost · 2020-11-16T15:27:55Z

Merge failed: "At least 2 approving reviews are required by reviewers with write access."

romanb

I still want to take a closer look at the most recent changes, but there is no reason to stall the PR. If I happen to have any important feedback, I will get in touch.

tomaka · 2020-11-16T15:43:35Z

@romanb Thanks

To be clear, my eagerness to merge this PR is also due to the fact that #7374, which is the next thing I will work on, will require substantial modifications to behaviour.rs and handler.rs again.

While the code changed by this PR might not be bug-free, our defensive-programming style, even though it is questionable, means that bugs in this code will normally not be able to crash the node. To me, this makes it acceptable to merge this intermediate in-between code.

tomaka · 2020-11-16T15:46:32Z

bot merge

ghost · 2020-11-16T15:46:35Z

Trying merge.

tomaka added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Oct 29, 2020

tomaka requested review from romanb and mxinden October 29, 2020 17:22

tomaka marked this pull request as ready for review October 30, 2020 11:57

tomaka added A3-in_progress Pull request is in progress. No review needed at this stage. and removed A0-please_review Pull request needs code review. labels Oct 30, 2020

tomaka added 3 commits October 30, 2020 13:38

Use inbound peerslot slots when a substream is received, rather than …

f6a190b

…a connection

Refactor PeerState

72b9955

Some bugfixes

4182faf

tomaka force-pushed the fix-7074 branch from 256722e to 4182faf Compare October 30, 2020 12:38

tomaka added 3 commits October 30, 2020 13:56

Fix warnings so that CI runs, gmlrlblbl

6026894

Bugfixes

856abcf

Update docs

e6f5569

tomaka added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Nov 2, 2020

romanb reviewed Nov 2, 2020

View reviewed changes

tomaka and others added 3 commits November 3, 2020 10:48

Apply suggestions from code review

be64e75

Co-authored-by: Roman Borschel <romanb@users.noreply.github.com>

Clean up Banned state

b0caa7c

Merge remote-tracking branch 'upstream/master' into fix-7074

ca95b8b

tomaka added 3 commits November 4, 2020 12:57

Refactor connections state

9cb2388

Fix possibility of Enabled with no Opening or Open connection

b045aa6

Line width

5e9e6ad

mxinden reviewed Nov 13, 2020

View reviewed changes

tomaka and others added 7 commits November 13, 2020 18:44

OpeningThenClosing

7578c9e

Add doc links to PeerState

3c4ed8f

Simplify increment logic

61fc5bb

One more debug_assert

222fee2

debug_assert!

4d90970

OpenDesiredByRemote

9d3c981

Update client/network/src/protocol/generic_proto/behaviour.rs

ae87242

Co-authored-by: Max Inden <mail@max-inden.de>

mxinden approved these changes Nov 16, 2020

View reviewed changes

romanb self-requested a review November 16, 2020 15:39

romanb approved these changes Nov 16, 2020

View reviewed changes

ghost merged commit 99602cd into paritytech:master Nov 16, 2020

tomaka deleted the fix-7074 branch November 16, 2020 15:46

tomaka mentioned this pull request Nov 30, 2020

Project owners no longer able to merge PR with only one approval paritytech/parity-processbot#235

Closed

tomaka mentioned this pull request Dec 9, 2020

Rework priority groups, take 2 #7700

Merged

tomaka mentioned this pull request Jan 25, 2021

Called on_block_announce with a bad PeerId #5592

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use inbound peerslot slots when a substream is received, rather than a connection #7464

Use inbound peerslot slots when a substream is received, rather than a connection #7464

tomaka commented Oct 29, 2020 •

edited

Loading

tomaka commented Oct 29, 2020 •

edited

Loading

tomaka commented Oct 30, 2020

tomaka commented Nov 2, 2020 •

edited

Loading

romanb Nov 2, 2020

romanb Nov 2, 2020

tomaka Nov 3, 2020

romanb Nov 2, 2020

tomaka commented Nov 4, 2020 •

edited

Loading

tomaka commented Nov 4, 2020

mxinden left a comment

mxinden Nov 13, 2020

mxinden Nov 13, 2020

mxinden Nov 13, 2020

tomaka commented Nov 16, 2020 •

edited

Loading

tomaka commented Nov 16, 2020

tomaka commented Nov 16, 2020

ghost commented Nov 16, 2020

ghost commented Nov 16, 2020

romanb left a comment

tomaka commented Nov 16, 2020 •

edited

Loading

tomaka commented Nov 16, 2020

ghost commented Nov 16, 2020

Use inbound peerslot slots when a substream is received, rather than a connection #7464

Use inbound peerslot slots when a substream is received, rather than a connection #7464

Conversation

tomaka commented Oct 29, 2020 • edited Loading

New protocol between behaviour and handler

Inbound slots

About opening_and_closing

tomaka commented Oct 29, 2020 • edited Loading

tomaka commented Oct 30, 2020

tomaka commented Nov 2, 2020 • edited Loading

romanb Nov 2, 2020

Choose a reason for hiding this comment

romanb Nov 2, 2020

Choose a reason for hiding this comment

tomaka Nov 3, 2020

Choose a reason for hiding this comment

romanb Nov 2, 2020

Choose a reason for hiding this comment

tomaka commented Nov 4, 2020 • edited Loading

tomaka commented Nov 4, 2020

mxinden left a comment

Choose a reason for hiding this comment

mxinden Nov 13, 2020

Choose a reason for hiding this comment

mxinden Nov 13, 2020

Choose a reason for hiding this comment

mxinden Nov 13, 2020

Choose a reason for hiding this comment

tomaka commented Nov 16, 2020 • edited Loading

tomaka commented Nov 16, 2020

tomaka commented Nov 16, 2020

ghost commented Nov 16, 2020

ghost commented Nov 16, 2020

romanb left a comment

Choose a reason for hiding this comment

tomaka commented Nov 16, 2020 • edited Loading

tomaka commented Nov 16, 2020

ghost commented Nov 16, 2020

tomaka commented Oct 29, 2020 •

edited

Loading

About `opening_and_closing`

tomaka commented Oct 29, 2020 •

edited

Loading

tomaka commented Nov 2, 2020 •

edited

Loading

tomaka commented Nov 4, 2020 •

edited

Loading

tomaka commented Nov 16, 2020 •

edited

Loading

tomaka commented Nov 16, 2020 •

edited

Loading