No longer actively open legacy substreams #7076

tomaka · 2020-09-10T14:07:01Z

Tackles bullet number 2 in this comment.

Based on top of #7075
Shouldn't be merged now, as we need to publish a version between #7075 and this.
The intention in this PR is to check whether CI is green to make sure that #7075 is working properly. If we merge a broken version of #7075, then we will have to wait again for a bugfix release.

This PR changes legacy.rs to no longer pro-actively open a legacy substream.
A consequence of this change is that we can no longer establish outgoing connections to nodes that don't have #7075 (hence the need for a release). However we can still receive incoming connections from nodes that don't have #7075.

The diff is quite large because of all the side clean-ups, but the core part of the changes is that legacy.rs no longer emits OutboundSubstreamRequest.

I've opted to keep the timeout system on the listening side as long as #7074 isn't resolved. After 60 seconds of inactivity on the legacy substream, the connection is force-closed, thereby freeing the peerset slot.

tomaka · 2020-09-10T14:26:57Z

Switching to non-draft, as the point is to run the CI.

tomaka · 2020-09-10T14:34:34Z

I've opted to keep the timeout system on the listening side as long as #7074 isn't resolved. After 60 seconds of inactivity on the legacy substream, the connection is force-closed, thereby freeing the peerset slot.

I've now realized that, since we no longer open legacy substreams, all connections between peers would always close after 60 seconds.

At the moment, the objective of this PR is to prove that #7075 works well. I'll fix that after #7075 is approved or merged.

mxinden · 2020-09-15T12:35:40Z

@tomaka let me know once you would like another review on this pull request.

…bstream

tomaka · 2020-09-15T14:01:00Z

Ready for review.

I've removed the timeout system from the legacy substream entirely.

With notification protocols, the listening side is pro-actively trying to open substreams, which, if they get refused, will result in the keep-alive system closing the connection.

The reason for the existing timeout system for the legacy substream comes from the situation where the dialer doesn't support notification substreams, and we don't know whether it intends to open a legacy substream.
This is no longer relevant, as nodes that don't support notification protocol substreams are too old to be supported.

tomaka · 2020-09-15T14:03:35Z

Needs a burnin, but after the release of 0.8.24.

tomaka · 2020-09-15T14:06:31Z

client/network/src/protocol/generic_proto/handler/legacy.rs

-			ProtocolState::Disabled { .. } | ProtocolState::Poisoned |
-	  		ProtocolState::KillAsap => KeepAlive::No,
+			ProtocolState::Init { .. } | ProtocolState::Normal { .. } => KeepAlive::Yes,
+			ProtocolState::Opening { .. } | ProtocolState::Disabled { .. } |


Opening is now No because of the removal of the timeout.

mxinden

Needs a burnin, but after the release of 0.8.24.

👍

mxinden · 2020-09-15T14:23:31Z

client/network/src/protocol/generic_proto/handler/group.rs

-		num: Option<usize>,
-		err: ProtocolsHandlerUpgrErr<EitherError<NotificationsHandshakeError, io::Error>>
+		num: usize,
+		err: ProtocolsHandlerUpgrErr<NotificationsHandshakeError>
 	) {
 		match (err, num) {


I don't think it still makes sense to match on num, right?

…bstream

tomaka · 2020-09-18T07:07:44Z

Let's wait until Monday to start the burnin, so that more nodes have upgraded.

…bstream

tomaka · 2020-09-21T14:12:24Z

Burnin' report:

The percentage of filled slots seems to be unable to go above 70%. While the metrics don't distinguish between in and out slots, my assumption (as expanded in details below) is that all in slots are full while out slots aren't.
The number of connections forcibly closed by the local node has dropped to almost 0.
The number of connections forcibly closed by the remote has almost doubled.
The number of connections closed because of the keep alive timeout has increased from ~5/mn to ~20/mn
The number of dialing errors caused by an invalid PeerId is skyrocketting, from ~25/mn to ~200/mn

Force-closing a connection in general is almost always caused by the dialing side not opening a legacy substream in time when the listening side has reserved a slot (see #7074). As expected, the node with this PR consequently almost doesn't force-close connections anymore.
Instead, the situation where the dialer doesn't open substreams when expected is now supposed to trigger the keep-alive timeout rather than force-closing the connection.
Since the node with this PR no longer opens a legacy substream, its outgoing connections are force-closed by nodes on 0.8.23 and below.

I don't really have a strong explanation for the dialing errors caused by an invalid PeerId, other that, since the outgoing slots aren't full, the local node tries to repeatedly try connect to older nodes, which it wouldn't normally have to do.

In other words, so far the observation is consistent with what is expected. It's disappointing that only ~25% of the network (roughly guessing from looking at the telemetry) seems to be using 0.8.24, which is not enough to even fill all the slots of that single burnin node.

…bstream

tomaka · 2020-09-23T08:52:05Z

The percentage of filled slots seems to be unable to go above 70%. While the metrics don't distinguish between in and out slots, my assumption (as expanded in details below) is that all in slots are full while out slots aren't.

Filled slots now at 85%, which probably follows the nodes upgrading to 0.8.24.

…bstream

tomaka · 2020-09-25T12:58:07Z

The node looks like it is behaving normally.
We only worry concerns the increase in the InvalidPeerId errors. See also #7198.

tomaka · 2020-09-25T13:28:01Z

I've been informed that the increase in InvalidPeerId errors corresponds to a period when the node had --reserved-nodes flags. For 2 days, the PR got burned-in without these flags, and it looked normal. I don't think in general that this InvalidPeerId problem is related in any way to this PR, and would go for merging.

mxinden · 2020-09-29T19:06:33Z

and would go for merging.

As far as I understand this pull request requires #7075 to be widely deployed. #7075 is only part of Polkadot v0.8.24. When we merge this pull-request now it will be part of Polkadot v0.8.25.

Are we sure right now that v0.8.24 will be widely deployed once v0.8.25 is released?

…bstream

tomaka · 2020-10-16T10:13:47Z

We had a couple of announcements asking validators to upgrade to at least 0.8.24.
One can see on telemetry that approximately 1/3rd of the network still uses 0.8.23-, which in the absolute is still high but low enough that merging this PR isn't really risky anymore.

tomaka · 2020-10-16T11:07:00Z

bot merge

ghost · 2020-10-16T11:07:04Z

Trying merge.

tomaka added 2 commits September 10, 2020 14:56

Allow remotes to not open a legacy substream

bf458b3

No longer actively open legacy substreams

a009846

tomaka added A0-please_review Pull request needs code review. B5-clientnoteworthy C3-medium PR touches the given topic and has a medium impact on builders. labels Sep 10, 2020

tomaka added 2 commits September 10, 2020 16:25

Misc fixes

d7a506d

Merge branch 'allow-no-legacy' into no-longer-open-substream

f74cc8f

tomaka marked this pull request as ready for review September 10, 2020 14:26

tomaka added the A1-onice label Sep 10, 2020

tomaka requested review from mxinden and romanb September 10, 2020 14:27

tomaka mentioned this pull request Sep 10, 2020

Allow remotes to not open a legacy substream #7075

Merged

tomaka added 4 commits September 10, 2020 16:40

Line width

5555ae4

Special case first protocol as the one bearing the handshake

776ee3b

Merge remote-tracking branch 'upstream/master' into allow-no-legacy

3608b98

Merge branch 'allow-no-legacy' into no-longer-open-substream

1f98436

tomaka added 2 commits September 15, 2020 15:35

Merge remote-tracking branch 'upstream/master' into no-longer-open-su…

905ba6f

…bstream

Legacy opening state no longer keeps connection alive

2c7e176

Remove now-unused code

c40d674

tomaka commented Sep 15, 2020

View reviewed changes

mxinden approved these changes Sep 15, 2020

View reviewed changes

tomaka added 2 commits September 15, 2020 16:35

Simplify inject_dial_upgrade_error

052b8d8

Merge remote-tracking branch 'upstream/master' into no-longer-open-su…

3d227d6

…bstream

romanb approved these changes Sep 17, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into no-longer-open-su…

994a274

…bstream

tomaka mentioned this pull request Sep 21, 2020

Dummy burnin' PR for no longer opening a legacy substream paritytech/polkadot#1738

Closed

[chaos:basic]

5d3d613

tomaka added 4 commits September 21, 2020 16:21

Merge remote-tracking branch 'upstream/master' into no-longer-open-su…

f48d45e

…bstream

Merge remote-tracking branch 'upstream/master' into no-longer-open-su…

8d12d57

…bstream

[chaos:basic]

e4f6ed3

[chaos:basic]

5af3ea5

tomaka mentioned this pull request Sep 22, 2020

Do not merge - Dummy PR to trigger chaos net #7173

Closed

Merge remote-tracking branch 'upstream/master' into no-longer-open-su…

6055917

…bstream

tomaka mentioned this pull request Sep 24, 2020

Connectivity issue between validator and sentry on 0.8.23 #7198

Closed

Merge remote-tracking branch 'upstream/master' into no-longer-open-su…

102a834

…bstream

tomaka removed the A0-please_review Pull request needs code review. label Oct 1, 2020

Merge remote-tracking branch 'upstream/master' into no-longer-open-su…

58be34f

…bstream

ghost merged commit ec18346 into paritytech:master Oct 16, 2020

tomaka deleted the no-longer-open-substream branch October 16, 2020 11:07

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No longer actively open legacy substreams #7076

No longer actively open legacy substreams #7076

tomaka commented Sep 10, 2020

tomaka commented Sep 10, 2020

tomaka commented Sep 10, 2020

mxinden commented Sep 15, 2020

tomaka commented Sep 15, 2020 •

edited

Loading

tomaka commented Sep 15, 2020

tomaka Sep 15, 2020

mxinden left a comment

mxinden Sep 15, 2020

tomaka commented Sep 18, 2020

tomaka commented Sep 21, 2020 •

edited

Loading

tomaka commented Sep 23, 2020

tomaka commented Sep 25, 2020

tomaka commented Sep 25, 2020 •

edited

Loading

mxinden commented Sep 29, 2020

tomaka commented Oct 16, 2020

tomaka commented Oct 16, 2020

ghost commented Oct 16, 2020

No longer actively open legacy substreams #7076

No longer actively open legacy substreams #7076

Conversation

tomaka commented Sep 10, 2020

tomaka commented Sep 10, 2020

tomaka commented Sep 10, 2020

mxinden commented Sep 15, 2020

tomaka commented Sep 15, 2020 • edited Loading

tomaka commented Sep 15, 2020

tomaka Sep 15, 2020

Choose a reason for hiding this comment

mxinden left a comment

Choose a reason for hiding this comment

mxinden Sep 15, 2020

Choose a reason for hiding this comment

tomaka commented Sep 18, 2020

tomaka commented Sep 21, 2020 • edited Loading

tomaka commented Sep 23, 2020

tomaka commented Sep 25, 2020

tomaka commented Sep 25, 2020 • edited Loading

mxinden commented Sep 29, 2020

tomaka commented Oct 16, 2020

tomaka commented Oct 16, 2020

ghost commented Oct 16, 2020

tomaka commented Sep 15, 2020 •

edited

Loading

tomaka commented Sep 21, 2020 •

edited

Loading

tomaka commented Sep 25, 2020 •

edited

Loading