-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent opening another substream right after the connection is closed #1563
base: master
Are you sure you want to change the base?
Conversation
Back-off wasn't applied to locally closed or rejected connections which could cause `ProtocolController` to reconnect to the node right after a connection to them had been closed. Remote mightn't have noticed that a connection was closed and the immediate reconnection would cause remote side substream to not work as it was no longer writable.
TBH, I'm not sure that this is going to fix the issue with the collation protocol — it doesn't matter if the connection was reinitiated 0 seconds or 10 seconds after it was closed, unless the remote writes something into the substream. On the other hand, delaying a connection for 5-10 seconds might introduce unneeded latency in case the connection was closed intentionally or by error (not sure how this is going to play with 6s block times in case a connection needs to be reestablished). I would prefer we correctly detect the closed substream on the remote, instead of introducing a delay here. |
It seems the PR is not fixing the issue referenced, because it only applies the back-off to locally-disconnected connections, not rejected by the remote ones. |
This is a fix for paritytech/substrate#13778 but it relates to #1499, The PR description is wrong. There is already a back-off for This change is orthogonal to detecting a closed substream and I fail to understand your concerns regarding the delay since back-off mechanism is already used in |
Back-offs of this PR are added to It looks like the right place to insert a back-off if a substream was closed by the remote (i.e., rejected by polkadot-sdk/substrate/client/network/src/protocol/notifications/behaviour.rs Lines 1744 to 1745 in 76724ce
It seems the back-off is already in place when handling
I was just afraid that there might have been legitimate reasons to close a substream, and then open it again, e.g. after one block (hence 6 s). Or there could be behavior like the one we saw with async backing where substreams were closed then opened again. This comment mostly applies to the original version of this PR at the moment of writing where the back-off was applied if the local node had closed the substream. Applying a back-off if the remote has closed a substream seems less risky. |
I don't believe this analysis to be correct, or at least it doesn't cover the full issue as I see it. The reason for inbound spam/rapid reconnections was local node disconnecting a remote node and then immediately reconnecting. If it was the remote that disconnected the local node, there would anyway be a delay in detecting it since the closed connection would be noticed only when the local node tried to send data. The issue is that there is no back-off applied to a reconnection if local node was the one that disconnected the node. This is what paritytech/substrate#13778 is about, as is shown in the logs here: paritytech/substrate#13778 (comment).
I'm not convinced what async backing changes were doing was legitimate in terms of peer connections, as evident by the |
Yeah, makes sense. But may be on block announcement substream it's enough to advertise a new block to detect that the substream is closed? I'm probably missing the complete picture of the issue described in paritytech/substrate#13778 (comment). Do you understand what causes the local node to disconnect the remote and not vice versa? |
I'm not actively advocating for reserved peers being special in respect to connection chilling. Mostly I'm just afraid that adding back-offs to outbound connections might break something. |
I think we should still try and detect closed subtreams but it's hard to say which of these are undocumented implementation details and which are bugs. As perverse as it is, detecting a closed substream causes its own set of issues but we probably just have to bite the bullet.
In the testing code I was working with, I just issued
I would argue rapid reconnections were behind the issues we saw earlier this year with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a little bit hesitant to insert delays if the local node was the one who closed the substream, but otherwise the logic of backing-off seems correct.
I went through
I would put more though into it before fixing the first issue, but it seems the second issue might lead to the connection spam when the remote closes only a substream and not the entire connection. @altonen could you check my reasoning, please? |
I remember we discussed this in some other issue but I can't remember which it was but I think we should think about moving all banning logic to The first part of your analysis looks correct but I'm not sure which spam exactly you're referring to in the second point. Spam caused by the remote or local node? This PR attempted to fix the issue where local node disconnects a peer and opens it immediately again but you're correct in that it looks like there is no back-off applied if remote closed the substream. I think the spam is caused because there is no back-off on new outbound substreams. |
I'm referring to spam caused by the remote node reopening substream after being rejected. I'm not sure if we still have this issue after all the refactorings of sync protocol substream validation.
Are you referring to the case when the remote node closes the substream and then opens it again immediately? I still don't understand why would the remote node close the substream in the first place. Apart from the collation protocol, where it should be fixed now (and where introducing the delays might affect collators' operation). |
The issue we had with syncing spamming substreams was because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to move forward with this.
@dmitry-markin @lexnv do we still need this? |
I was originally skeptical regarding this PR. May be we don't need it, but if the issue it tries to fix is still there, we might need to fix it another way. I'll have a look. |
Back-off wasn't applied to locally closed or rejected connections which could cause
ProtocolController
to reconnect to the node right after a connection to them had been closed. Remote mightn't have noticed that a connection was closed and the immediate reconnection would cause remote side substream to not work as it was no longer writable.Fixes paritytech/substrate#13778
Relates to #1499