-
Notifications
You must be signed in to change notification settings - Fork 964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
swarm: Remove ConnectionHandler::Error
#3591
Comments
Based on intuition, with no particular instance in mind, I am not sure this is possible, i.e. I expect us to find a use-case for |
Currently, we tend to close connection upon certain errors. That is wrong in my opinion. Just because one or more streams on one handler fail doesn't mean the connection is faulty or that the other peer is necessarily misbehaving. Do we agree that in those instances, setting |
Agreed. |
I think I can pick up this issue as "a spiritual follow-up" of #3577. I scouted the code, and it seems that
|
Sure! Let me give you some more context so you can work through this properly. It is a slightly more advanced issue! First, I'd recommend you to read through #3353 (comment). Then have a look at this section in case you haven't: https://github.com/libp2p/rust-libp2p/blob/master/docs/coding-guidelines.md#hierarchical-state-machines Each Currently, we often do that upon what is termed "fatal" errors. However, those errors are usually only fatal for this one
Does this make sense? In regards to implementing this, I'd prefer if we can have very small PRs, ideally one per protocol / crate. |
OMG, where has this gem been all my Rust life? I've googled a lot on SMs, but this was never among the results.
Agreed, let me draft a small PR for some protocol, work out all the quirks and then attempt to apply the same procedure to the rest. |
Our connection state machine is one of my favourites: rust-libp2p/swarm/src/connection.rs Line 95 in 1b09b8c
Looking forward to it! Thank you for your help :) |
Look, there is an approach i found in rust-libp2p/protocols/gossipsub/src/handler.rs Lines 303 to 314 in 1b09b8c
compared to the following piece rust-libp2p/protocols/gossipsub/src/handler.rs Lines 327 to 329 in 1b09b8c
In the former case, this particular handler will be closed, while the error is still reported to the parent. Would the former approach be acceptable for the latter case? It seems to reach the set goals: the handler will be closed, but also the error will be propagated to the parent, where it can be appropriately reported. |
I don't think it is worth reporting these errors to the parent. They might be transient, i.e. only a single stream might have failed but otherwise the connection is intact. This actually makes me think, it could be that we have other streams that are still operating (i.e.
I think the above applies to all protocols. This is a slight deviation from what I previously thought we should do! Setting The combination of the above means that in case all our streams fail, we will eventually close the connection because they are set to rust-libp2p/protocols/gossipsub/src/handler.rs Lines 341 to 346 in 1b09b8c
|
I just read through this. I'm one of the people that have been closing connections on errors (in gossipsub) :p. I'm interested in the negotiation errors. And have a few questions about some logic to handle. I see these often. I think a good portion of the time, is when trying to connect to a go-libp2p client and for one reason or another (usually they have too many peers) they drop the initial connection. From our side, it can look like a negotiation error. In the earlier days, I think they were not responding, but not dropping the connection. It was up to us to drop the connection when negotiation failed. In the new version, we may keep these stale connections alive until the KeepAlive timeout ends? There were also attacks where an attacker would consume all of the local file descriptors by constantly trying to open many many new connections and keeping them alive. We mitigated this somewhat by capping the number of inbound connections, but also dropping the connections pretty fast when negotiations or connections failed. Could we be at risk if we now leave these connections alive for longer? When a node goes offline, we can identify it via Pings or various other negotiation timeouts. In the past I've found it useful to drop connections faster so that we can run discovery and search for new peers, rather than keeping stale old connections alive. I think all of these can be mitigated I guess by the keep alive timeout? Curious about your thoughts on these. |
Unless a
The longer discussion around this is in #3353 (comment). I think eagerly closing connections on unsupported protocols is a reasonable policy. I also think that policy does not belong into I don't want to derail this too much but the entire |
Yeah I agree. We probably need a standardized way of letting the users decide which protocols should allow the connection to be kept alive. Maybe in the derive macro when composing behaviours we could tell the macro which sub-behaviours we want to keep the connection alive. |
I’d think that it depends on the protocol if While I do agree with @mxinden’s proposal above, I’m not sure whether it addresses the issue @AgeManning raised: if we need to interact with a broken peer (like go-ipfs sounds to be) who “signals” the desire that we should close the connection by reporting a nondescript stream negotiation failure, then we might need to keep the ability for a stream protocol handler to close the connection. Alternatively, go-ipfs could be fixed … Regarding the overall topic: it might well be reasonable to run a protocol on a stream that performs admin coordination. Such a handler may then want to shut down the connection when agreed with the peer, independent of other stream handlers setting |
Configuration is one option (pun intended) but adding configuration options is a fight we are not going to win as more usecases come up. I am a fan of thinking in policies and mechanism.
Yes, that is the idea. I am starting to wonder whether the entire
Even determining whether a remote supports any of our protocols is difficult. We can only learn that through identify and if that works, a naive intersection of the protocols will at least yield that one. |
From a selfish perspective (gossipsub), it would seem to make sense that each behaviour has the ability to "disable" itself independently of other behaviours (such that for all intents and purposes a "DisableBehaviour" event looks the same as a "CloseConnection" at the behaviour level). I guess this is what the KeepAlive is attempting to do. If a peer does something the behaviour doesn't like, it has the ability to drop all substreams and prevent new ones from being created. Other behaviours work independently until all become disabled and the connection gets closed altogether. Its been a while since I looked at the KeepAlive, but from memory it does this, but without the ability of disabling re-established substreams? |
Previously, we closed the entire connection upon receiving too many upgrade errors. This is unnecessarily aggressive. For example, an upgrade error may be caused by the remote dropping a stream during the initial handshake which is completely isolated from other protocols running on the same connection. Instead of closing the connection, set `KeepAlive::No`. Related: #3591. Resolves: #3690. Pull-Request: #3625.
Previously, we closed the entire connection upon receiving too many upgrade errors. This is unnecessarily aggressive. For example, an upgrade error may be caused by the remote dropping a stream during the initial handshake which is completely isolated from other protocols running on the same connection. Instead of closing the connection, set `KeepAlive::No`. Related: libp2p#3591. Resolves: libp2p#3690. Pull-Request: libp2p#3625.
Previously, the `libp2p-ping` module came with a policy to close a connection after X failed pings. This is only one of many possible policies on how users would want to do connection management. We remove this policy without a replacement. If users wish to restore this functionality, they can easily implement such policy themselves: The default value of `max_failures` was 1. To restore the previous functionality users can simply close the connection upon the first received ping error. In this same patch, we also simplify the API of `ping::Event` by removing the layer of `ping::Success` and instead reporting the RTT to the peer directly. Related: #3591. Pull-Request: #3947.
ConnectionHandlerEvent::Close
ConnectionHandlerEvent::Close
ConnectionHandlerEvent::Close
ConnectionHandler::Error
@mxinden I added this to the milestone but I'd also like to make another patch-release of |
To make a reservation with a relay, a user calls `Swarm::listen_on` with an address of the relay, suffixed with a `/p2pcircuit` protocol. Similarly, to establish a circuit to another peer, a user needs to call `Swarm::dial` with such an address. Upon success, the `Swarm` then issues a `SwarmEvent::NewListenAddr` event in case of a successful reservation or a `SwarmEvent::ConnectionEstablished` in case of a successful connect. The story is different for errors. Somewhat counterintuitively, the actual reason of an error during these operations are only reported as `relay::Event`s without a direct correlation to the user's `Swarm::listen_on` or `Swarm::dial` calls. With this PR, we send these errors back "into" the `Transport` and report them as `SwarmEvent::ListenerClosed` or `SwarmEvent::OutgoingConnectionError`. This is conceptually more correct. Additionally, by sending these errors back to the transport, we no longer use `ConnectionHandlerEvent::Close` which entirely closes the underlying relay connection. In case the connection is not used for something else, it will be closed by the keep-alive algorithm. Resolves: #4717. Related: #3591. Related: #4718. Pull-Request: #4745.
We refactor the `libp2p-dcutr` API to only emit a single event: whether the hole-punch was successful or not. All other intermediate events are removed. Hole-punching is something that we try to do automatically as soon as we are connected to a peer over a relayed connection. The lack of explicit user intent means any event we emit is at best informational and not a "response" that the user would wait for. Thus, I chose to not expose the details of why the hole-punch failed but return an opaque error. Lastly, this PR also removes the usage of `ConnectionHandlerEvent::Close`. Just because something went wrong during the DCUtR handshake, doesn't mean we should close the relayed connection. Related: #3591. Pull-Request: #4749.
To remove the usages of `ConnectionHandlerEvent::Close` from the relay-server, we unify what used to be called `CircuitFailedReason` and `FatalUpgradeError`. Whilst the errors may be fatal for the particular circuit, they are not necessarily fatal for the entire connection. Related: #3591. Resolves: #4716. Pull-Request: #4718.
Related: libp2p#3591. Pull-Request: libp2p#3913.
Based on #3353 (comment), I am opening this issue to track the work around removing
ConnectionHandlerEvent::Close
. Closing a connection is the only usage ofConnectionHandler::Error
but we've agree thatConnectionHandler
s themselves should not be allowed to close connections because they don't fully own them and might disrupt other protocols.Tasks
ConnectionHandlerEvent::Close
#4714OneShotHandler
#4715Transport::{listen_on,dial}
#4745ConnectionHandler
s close connections #4755The text was updated successfully, but these errors were encountered: