Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

No longer actively open legacy substreams #7076

Merged
22 commits merged into from
Oct 16, 2020

Conversation

tomaka
Copy link
Contributor

@tomaka tomaka commented Sep 10, 2020

Tackles bullet number 2 in this comment.

Based on top of #7075
Shouldn't be merged now, as we need to publish a version between #7075 and this.
The intention in this PR is to check whether CI is green to make sure that #7075 is working properly. If we merge a broken version of #7075, then we will have to wait again for a bugfix release.

This PR changes legacy.rs to no longer pro-actively open a legacy substream.
A consequence of this change is that we can no longer establish outgoing connections to nodes that don't have #7075 (hence the need for a release). However we can still receive incoming connections from nodes that don't have #7075.

The diff is quite large because of all the side clean-ups, but the core part of the changes is that legacy.rs no longer emits OutboundSubstreamRequest.

I've opted to keep the timeout system on the listening side as long as #7074 isn't resolved. After 60 seconds of inactivity on the legacy substream, the connection is force-closed, thereby freeing the peerset slot.

@tomaka tomaka added A0-please_review Pull request needs code review. B5-clientnoteworthy C3-medium PR touches the given topic and has a medium impact on builders. labels Sep 10, 2020
@tomaka tomaka marked this pull request as ready for review September 10, 2020 14:26
@tomaka
Copy link
Contributor Author

tomaka commented Sep 10, 2020

Switching to non-draft, as the point is to run the CI.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 10, 2020

I've opted to keep the timeout system on the listening side as long as #7074 isn't resolved. After 60 seconds of inactivity on the legacy substream, the connection is force-closed, thereby freeing the peerset slot.

I've now realized that, since we no longer open legacy substreams, all connections between peers would always close after 60 seconds.

At the moment, the objective of this PR is to prove that #7075 works well. I'll fix that after #7075 is approved or merged.

@mxinden
Copy link
Contributor

mxinden commented Sep 15, 2020

@tomaka let me know once you would like another review on this pull request.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 15, 2020

Ready for review.

I've removed the timeout system from the legacy substream entirely.

With notification protocols, the listening side is pro-actively trying to open substreams, which, if they get refused, will result in the keep-alive system closing the connection.

The reason for the existing timeout system for the legacy substream comes from the situation where the dialer doesn't support notification substreams, and we don't know whether it intends to open a legacy substream.
This is no longer relevant, as nodes that don't support notification protocol substreams are too old to be supported.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 15, 2020

Needs a burnin, but after the release of 0.8.24.

ProtocolState::Disabled { .. } | ProtocolState::Poisoned |
ProtocolState::KillAsap => KeepAlive::No,
ProtocolState::Init { .. } | ProtocolState::Normal { .. } => KeepAlive::Yes,
ProtocolState::Opening { .. } | ProtocolState::Disabled { .. } |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening is now No because of the removal of the timeout.

Copy link
Contributor

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a burnin, but after the release of 0.8.24.

👍

num: Option<usize>,
err: ProtocolsHandlerUpgrErr<EitherError<NotificationsHandshakeError, io::Error>>
num: usize,
err: ProtocolsHandlerUpgrErr<NotificationsHandshakeError>
) {
match (err, num) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it still makes sense to match on num, right?

@tomaka
Copy link
Contributor Author

tomaka commented Sep 18, 2020

Let's wait until Monday to start the burnin, so that more nodes have upgraded.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 21, 2020

Burnin' report:

  • The percentage of filled slots seems to be unable to go above 70%. While the metrics don't distinguish between in and out slots, my assumption (as expanded in details below) is that all in slots are full while out slots aren't.
  • The number of connections forcibly closed by the local node has dropped to almost 0.
  • The number of connections forcibly closed by the remote has almost doubled.
  • The number of connections closed because of the keep alive timeout has increased from ~5/mn to ~20/mn
  • The number of dialing errors caused by an invalid PeerId is skyrocketting, from ~25/mn to ~200/mn

Force-closing a connection in general is almost always caused by the dialing side not opening a legacy substream in time when the listening side has reserved a slot (see #7074). As expected, the node with this PR consequently almost doesn't force-close connections anymore.
Instead, the situation where the dialer doesn't open substreams when expected is now supposed to trigger the keep-alive timeout rather than force-closing the connection.
Since the node with this PR no longer opens a legacy substream, its outgoing connections are force-closed by nodes on 0.8.23 and below.

I don't really have a strong explanation for the dialing errors caused by an invalid PeerId, other that, since the outgoing slots aren't full, the local node tries to repeatedly try connect to older nodes, which it wouldn't normally have to do.

In other words, so far the observation is consistent with what is expected. It's disappointing that only ~25% of the network (roughly guessing from looking at the telemetry) seems to be using 0.8.24, which is not enough to even fill all the slots of that single burnin node.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 23, 2020

The percentage of filled slots seems to be unable to go above 70%. While the metrics don't distinguish between in and out slots, my assumption (as expanded in details below) is that all in slots are full while out slots aren't.

Filled slots now at 85%, which probably follows the nodes upgrading to 0.8.24.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 25, 2020

The node looks like it is behaving normally.
We only worry concerns the increase in the InvalidPeerId errors. See also #7198.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 25, 2020

I've been informed that the increase in InvalidPeerId errors corresponds to a period when the node had --reserved-nodes flags. For 2 days, the PR got burned-in without these flags, and it looked normal. I don't think in general that this InvalidPeerId problem is related in any way to this PR, and would go for merging.

@mxinden
Copy link
Contributor

mxinden commented Sep 29, 2020

and would go for merging.

As far as I understand this pull request requires #7075 to be widely deployed. #7075 is only part of Polkadot v0.8.24. When we merge this pull-request now it will be part of Polkadot v0.8.25.

Are we sure right now that v0.8.24 will be widely deployed once v0.8.25 is released?

@tomaka tomaka removed the A0-please_review Pull request needs code review. label Oct 1, 2020
@tomaka
Copy link
Contributor Author

tomaka commented Oct 16, 2020

We had a couple of announcements asking validators to upgrade to at least 0.8.24.
One can see on telemetry that approximately 1/3rd of the network still uses 0.8.23-, which in the absolute is still high but low enough that merging this PR isn't really risky anymore.

@tomaka
Copy link
Contributor Author

tomaka commented Oct 16, 2020

bot merge

@ghost
Copy link

ghost commented Oct 16, 2020

Trying merge.

@ghost ghost merged commit ec18346 into paritytech:master Oct 16, 2020
@tomaka tomaka deleted the no-longer-open-substream branch October 16, 2020 11:07
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C3-medium PR touches the given topic and has a medium impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants