-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Use inbound peerslot slots when a substream is received, rather than a connection #7464
Conversation
Sorry for the lack of thorough explanations at the moment. This PR was very tiring to make, and I'm just very happy/relieved to finally open it and shut down my IDE. I'll edit the OP to provide more context tomorrow or next week. |
Marked as ready for CI purposes only. |
Ready for review. I apologize in advance for any headache caused by having to focus on this state machine code. |
@@ -795,25 +1012,50 @@ impl GenericProto { | |||
debug!(target: "sub-libp2p", "PSM => Accept({:?}, {:?}): Obsolete incoming, | |||
sending back dropped", index, incoming.peer_id); | |||
debug!(target: "sub-libp2p", "PSM <= Dropped({:?})", incoming.peer_id); | |||
self.peerset.dropped(incoming.peer_id); | |||
self.peerset.dropped(incoming.peer_id); // TODO: is that correct?! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we find out?
|
||
if opening_and_closing.is_empty() && closed.is_empty() && closing.is_empty() { | ||
if let Some(until) = banned_until { | ||
*entry.get_mut() = PeerState::Banned { until }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When are these entries eventually removed if we don't reconnect? I think so far we always removed any particular entry on inject_disconnected()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point.
Banned
concerns nodes that we aren't connected to, so if we removed them in inject_disconnected
, that would be incorrect as well.
I'm going to fix that by adding a timer to them, and removing them in poll
when the timer triggers.
open_desired.is_empty() | ||
{ | ||
if let Some(until) = banned_until { | ||
*entry.get_mut() = PeerState::Banned { until }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, I'm wondering when these entries are removed if we don't reconnect to that peer.
Co-authored-by: Roman Borschel <romanb@users.noreply.github.com>
I've found the issue with tests: in the situation where the peer is in the I'll fix that at the same time as I refactor the connections to be in only one list. Sorry for the long time it takes to do this PR. This huge state machine requires a lot of focus, which I'm not necessarily good at. |
I did the refactoring. It looks like the tests are still failing, so I'll do more investigation/reviewing/debugging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the behaviour sends an Open message to the handler, then a Close message, then wants to send an Open message again, it starts being a lot of state to track. Rather than introducing a Vec, I went with putting the connection in a field called opening_and_closing and "banning" the peer for 5 seconds. In other words, we wait for 5 seconds before trying to send an Open message again.
This seems a bit like a hack to me, but to be honest I am unable to come up an alternative.
I'd appreciate a review of the actual details of all the state machine transitions, but I understand if you don't want to.
I took a deeper look resulting in the questions below. I agree that the mere size of these state machines makes it quite complex. I hope the debug_assert
s bring up any outstanding bugs.
I suggest to continue the burnin (with debug-assertions = true) over the weekend, and eventually merge early next week.
That sounds good to me.
/// - `Requested`: No open connection, but requested by the peerset. Currently dialing. | ||
/// - `Disabled`: Has open TCP connection(s) unbeknownst to the peerset. No substream is open. | ||
/// - `Enabled`: Has open TCP connection(s), acknowledged by the peerset. | ||
/// - Notifications substreams are open on at least one connection, and external | ||
/// API has been notified. | ||
/// - Notifications substreams aren't open. | ||
/// - `Incoming`: Has open TCP connection(s) and remote would like to open substreams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// - `Requested`: No open connection, but requested by the peerset. Currently dialing. | |
/// - `Disabled`: Has open TCP connection(s) unbeknownst to the peerset. No substream is open. | |
/// - `Enabled`: Has open TCP connection(s), acknowledged by the peerset. | |
/// - Notifications substreams are open on at least one connection, and external | |
/// API has been notified. | |
/// - Notifications substreams aren't open. | |
/// - `Incoming`: Has open TCP connection(s) and remote would like to open substreams. | |
/// - [`PeerState::Requested`]: No open connection, but requested by the peerset. Currently dialing. | |
/// - [`PeerState::Disabled`]: Has open TCP connection(s) unbeknownst to the peerset. No substream is open. | |
/// - [`PeerState::Enabled`]: Has open TCP connection(s), acknowledged by the peerset. | |
/// - Notifications substreams are open on at least one connection, and external | |
/// API has been notified. | |
/// - Notifications substreams aren't open. | |
/// - [`PeerState::Incoming`]: Has open TCP connection(s) and remote would like to open substreams. |
Should catch documentation drifting apart from the implementation, right?
Some(v) => v, | ||
None => { | ||
error!(target: "sub-libp2p", "Overflow in next_incoming_index"); | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that next_incoming_id
is a u64
I think it is safe to just panic here, no? Say we open 100 connections per second, a node would reach this after 140386180165 years.
/// Connection is in the `Closed` state but a [`NotifsHandlerIn::Open`] message then a | ||
/// [`NotifsHandlerIn::Close`] message has been sent. An `OpenResultOk`/`OpenResultErr` message | ||
/// followed with a `CloseResult` message are expected. | ||
OpeningAndClosing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpeningAndClosing, | |
OpeningThenClosing, |
Would be more intuitive for me. I am fine either way.
Co-authored-by: Max Inden <mail@max-inden.de>
Since everything worked well during the weekend with |
(this PR still needs at least one approving review to be merged) |
bot merge |
Trying merge. |
Merge failed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still want to take a closer look at the most recent changes, but there is no reason to stall the PR. If I happen to have any important feedback, I will get in touch.
@romanb Thanks To be clear, my eagerness to merge this PR is also due to the fact that #7374, which is the next thing I will work on, will require substantial modifications to While the code changed by this PR might not be bug-free, our defensive-programming style, even though it is questionable, means that bugs in this code will normally not be able to crash the node. To me, this makes it acceptable to merge this intermediate in-between code. |
bot merge |
Trying merge. |
Fix #7074
This implements the change described in #7074.
New protocol between behaviour and handler
It took me a lot of time, but I came up with this refactoring of the communication between the
NetworkBehaviour
and theProtocolsHandler
. I believe that the new asynchronous protocol between the behaviour and handler is sound, in other words that it is not prone to race conditions.The handler can now accept the following incoming messages:
Open
Close
It can also emit these messages:
OpenResultOk
orOpenResultErr
.CloseResult
.OpenDesired
.CloseDesired
.Each
Open
message sent to the handler will result in a correspondingOpenResultOk
orOpenResultErr
being sent back. EachClose
message will result in a correspondingCloseResult
being sent back.Thanks to this correspondence, the behaviour can track the state (open or closed) of the handler. For instance, if the behaviour sends an
Close
, then receives anOpenDesired
, it knows that theOpenDesired
was sent before theClose
has been processed.When in closed state, the handler can also send an
OpenDesired
, indicating that the remote would like the state to transition to "open".When in open state, the handler can, vice-versa, send a
CloseDesired
indicating that the remote would like the state to transition to "closed".As a small acceptable hack, an
OpenDesired
should always be answered by sending eitherOpen
orClose
, otherwise the handler cannot differentiate between the situation where the behaviour is still undecided and the situation where the behaviour wants to remain in the "closed" state.It is forbidden to remain "open" when a
CloseDesired
is received, and as such there is no similar ambiguity for that message. The behaviour must sendClose
as soon as possible.Inbound slots
At the moment, an inbound slot in the peerset is allocated whenever a TCP connection with a new peer is received.
This PR modifies this behaviour, and peerset inbound slots are now allocated when an
OpenDesired
is received from the handler.Similarly, the slots are liberated when the substream is closed (this was already the case before).
This ties the inbound peerset slots to the concept of substreams, rather than the concept of TCP connections.
The motivation behind this change is to ultimately introduce peerset slots attributed to certain notification protocols (see #6087). In the future, the messages emitted by the handler would be specific to a notifications protocol. For example, when a substream using the parachains collation protocol is opened, only a parachains collation slot is attributed.
About
opening_and_closing
In the implementation, I've opted to track, in the behaviour, the exact state of the handler. Not being tolerant will let us discover bugs sooner.
A consequence of tracking the exact state is that we verify that each
OpenResultOk
/OpenResultErr
/CloseResult
message emitted by the handler actually matches the state tracked in the behaviour.But if the behaviour sends an
Open
message to the handler, then aClose
message, then wants to send anOpen
message again, it starts being a lot of state to track. Rather than introducing aVec
, I went with putting the connection in a field calledopening_and_closing
and "banning" the peer for 5 seconds. In other words, we wait for 5 seconds before trying to send anOpen
message again.Considering that this can only happen in case of quick state transitions coming from the peerset, I believe that this behaviour is completely ok, as it mimics the existing banning system.