swarm: Remove `ConnectionHandler::Error` #3591

thomaseizinger · 2023-03-11T16:48:15Z

Based on #3353 (comment), I am opening this issue to track the work around removing ConnectionHandlerEvent::Close. Closing a connection is the only usage of ConnectionHandler::Error but we've agree that ConnectionHandlers themselves should not be allowed to close connections because they don't fully own them and might disrupt other protocols.

Tasks

Give feedback

fix(gossipsub): gracefully disable handler on stream errors #3625

send-it
feat(request-response): don't close connection on stream errors #3913

send-it
feat(ping): don't close connections upon failures #3947

send-it
feat(swarm): deprecate ConnectionHandlerEvent::Close #4714

internal-change
feat(swarm): don't close connection in OneShotHandler #4715

internal-change send-it
feat(relay): don't close connections upon errors in relay server #4718

internal-change send-it
feat(relay): propagate errors to Transport::{listen_on,dial} #4745

send-it
refactor(dcutr): simplify public API #4749

send-it
feat(swarm): don't have ConnectionHandlers close connections #4755

internal-change send-it
Options

The text was updated successfully, but these errors were encountered:

mxinden · 2023-03-13T08:28:21Z

Audit the codebase for usages of Close and port all protocols to set KeepAlive::No instead of closing the connection directly.

Based on intuition, with no particular instance in mind, I am not sure this is possible, i.e. I expect us to find a use-case for ConnectionHandlerEvent::Close. That said, I would like to be proven wrong and thus I am in favor of proceeding on this GitHub issue.

thomaseizinger · 2023-03-13T10:23:54Z

Audit the codebase for usages of Close and port all protocols to set KeepAlive::No instead of closing the connection directly.

Based on intuition, with no particular instance in mind, I am not sure this is possible, i.e. I expect us to find a use-case for ConnectionHandlerEvent::Close. That said, I would like to be proven wrong and thus I am in favor of proceeding on this GitHub issue.

Currently, we tend to close connection upon certain errors. That is wrong in my opinion. Just because one or more streams on one handler fail doesn't mean the connection is faulty or that the other peer is necessarily misbehaving. Do we agree that in those instances, setting KeepAlive::No is more appropriate than closing the connection?

mxinden · 2023-03-13T20:57:20Z

Do we agree that in those instances, setting KeepAlive::No is more appropriate than closing the connection?

Agreed.

vnermolaev · 2023-03-14T15:31:44Z

I think I can pick up this issue as "a spiritual follow-up" of #3577.

I scouted the code, and it seems that ConnectionHandlerEvent::Close is only used in two cases

as a result of polling behaviour, i.e.,return Poll::Ready(ConnectionHandlerEvent::Close(err));
The inner error can contain different values, e.g.,
```
return Poll::Ready(ConnectionHandlerEvent::Close(
    HandlerError::MaxInboundSubstreams,
));
```
```
return Poll::Ready(ConnectionHandlerEvent::Close(
    ConnectionHandlerUpgrErr::Upgrade(UpgradeError::Apply(Either::Left(e))),
))
```
This information will be lost (at best printed out). Is this a desired behaviour? In such cases, I can set self.keep_alive = KeepAlive::No. Should Poll::Pending be returned (it seems the only choice, though)?
in rewrapping actions, e.g., ConnectionHandlerEvent::Close(val) => ConnectionHandlerEvent::Close(map(val)),. This can be deprecated for consequent removal.

thomaseizinger · 2023-03-14T17:01:19Z

Sure!

Let me give you some more context so you can work through this properly. It is a slightly more advanced issue!

First, I'd recommend you to read through #3353 (comment).

Then have a look at this section in case you haven't: https://github.com/libp2p/rust-libp2p/blob/master/docs/coding-guidelines.md#hierarchical-state-machines

Each NetworkBehaviour is given one ConnectionHandler per physical connection. All ConnectionHandlers of a single connection are composed together. What is issue is essentially about is the fact that one of these ConnectionHandlers can close the underlying physical connection which obviously affect all other protocols that run in parallel.

Currently, we often do that upon what is termed "fatal" errors. However, those errors are usually only fatal for this one ConnectionHandler and not for all handlers that are running on the same connection. As you said correctly, if we stop returning the errors, this information is "lost". I think a log message on info level is good enough here. info is typically used for state changes and the state change we are making here is that we "disable" the handler by setting KeepAlive::No.

Poll::Pending should only be returned if there is no work to do within the handler, i.e. all queues are empty etc. Most likely, the pattern you will see is either that we check for an Option<Error> to be Some, in which case the statement can be removed entirely, or that a VecDeque of events is collected and we push a ConnectionHandlerEvent::Close into it in which case we only need to remove the code that pushes the Close event into it.

Does this make sense?

In regards to implementing this, I'd prefer if we can have very small PRs, ideally one per protocol / crate.

vnermolaev · 2023-03-15T10:20:41Z

Then have a look at this section in case you haven't: https://github.com/libp2p/rust-libp2p/blob/master/docs/coding-guidelines.md#hierarchical-state-machines

OMG, where has this gem been all my Rust life? I've googled a lot on SMs, but this was never among the results.

In regards to implementing this, I'd prefer if we can have very small PRs, ideally one per protocol / crate.

Agreed, let me draft a small PR for some protocol, work out all the quirks and then attempt to apply the same procedure to the rest.

thomaseizinger · 2023-03-15T10:27:00Z

Then have a look at this section in case you haven't: master/docs/coding-guidelines.md#hierarchical-state-machines

OMG, where has this gem been all my Rust life? I've googled a lot on SMs, but this was never among the results.

Our connection state machine is one of my favourites:

rust-libp2p/swarm/src/connection.rs

Line 95 in 1b09b8c

pub struct Connection<THandler>

In regards to implementing this, I'd prefer if we can have very small PRs, ideally one per protocol / crate.

Agreed, let me draft a small PR for some protocol, work out all the quirks and then attempt to apply the same procedure to the rest.

Looking forward to it! Thank you for your help :)

vnermolaev · 2023-03-15T11:05:17Z

Look, there is an approach i found in gossipsub

rust-libp2p/protocols/gossipsub/src/handler.rs

Lines 303 to 314 in 1b09b8c

    
           NegotiationError::Failed => { 
        
               // The protocol is not supported 
        
               self.protocol_unsupported = true; 
        
               if !self.peer_kind_sent { 
        
                   self.peer_kind_sent = true; 
        
                   // clear all substreams so the keep alive returns false 
        
                   self.inbound_substream = None; 
        
                   self.outbound_substream = None; 
        
                   self.keep_alive = KeepAlive::No; 
        
                   return Poll::Ready(ConnectionHandlerEvent::Custom( 
        
                       HandlerEvent::PeerKind(PeerKind::NotSupported), 
        
                   ));

compared to the following piece

rust-libp2p/protocols/gossipsub/src/handler.rs

Lines 327 to 329 in 1b09b8c

    
           if let Some(error) = reported_error { 
        
               return Poll::Ready(ConnectionHandlerEvent::Close(error)); 
        
           }

In the former case, this particular handler will be closed, while the error is still reported to the parent.
In the latter case, the full connection will be closed (and such behaviour is undesiarable).
It is worth noting that both inner types of ::Custom and ::Close are the same HandlerEvent.

Would the former approach be acceptable for the latter case? It seems to reach the set goals: the handler will be closed, but also the error will be propagated to the parent, where it can be appropriately reported.

thomaseizinger · 2023-03-15T11:26:40Z

Would the former approach be acceptable for the latter case? It seems to reach the set goals: the handler will be closed, but also the error will be propagated to the parent, where it can be appropriately reported.

I don't think it is worth reporting these errors to the parent. They might be transient, i.e. only a single stream might have failed but otherwise the connection is intact. This actually makes me think, it could be that we have other streams that are still operating (i.e. inbound_stream or outbound_stream might still be Some). Thus I am inclined to say:

Upon errors on a single stream, all we really should be doing is log the error
KeepAlive should only depend on whether we are currently doing work

I think the above applies to all protocols. This is a slight deviation from what I previously thought we should do! Setting KeepAlive::No on any error seems to be too aggressive. As @mxinden noted in #3482 (comment), a remote might just drop one particular stream for whatever reason. We should be resilient to that!

The combination of the above means that in case all our streams fail, we will eventually close the connection because they are set to None. Also note that gossipsub in particular has a limit of creating a maximum of 5 streams. Once we hit that limit, we should set KeepAlive::No instead of closing the connection. That applies to these lines here:

rust-libp2p/protocols/gossipsub/src/handler.rs

Lines 341 to 346 in 1b09b8c

    
           if self.inbound_substreams_created > MAX_SUBSTREAM_CREATION { 
        
               // Too many inbound substreams have been created, end the connection. 
        
               return Poll::Ready(ConnectionHandlerEvent::Close( 
        
                   HandlerError::MaxInboundSubstreams, 
        
               )); 
        
           }

AgeManning · 2023-03-22T22:05:10Z

I just read through this. I'm one of the people that have been closing connections on errors (in gossipsub) :p.

I'm interested in the negotiation errors. And have a few questions about some logic to handle.

I see these often. I think a good portion of the time, is when trying to connect to a go-libp2p client and for one reason or another (usually they have too many peers) they drop the initial connection. From our side, it can look like a negotiation error. In the earlier days, I think they were not responding, but not dropping the connection. It was up to us to drop the connection when negotiation failed. In the new version, we may keep these stale connections alive until the KeepAlive timeout ends?

There were also attacks where an attacker would consume all of the local file descriptors by constantly trying to open many many new connections and keeping them alive. We mitigated this somewhat by capping the number of inbound connections, but also dropping the connections pretty fast when negotiations or connections failed. Could we be at risk if we now leave these connections alive for longer?

When a node goes offline, we can identify it via Pings or various other negotiation timeouts. In the past I've found it useful to drop connections faster so that we can run discovery and search for new peers, rather than keeping stale old connections alive.

I think all of these can be mitigated I guess by the keep alive timeout? Curious about your thoughts on these.

thomaseizinger · 2023-03-23T12:06:49Z

In the new version, we may keep these stale connections alive until the KeepAlive timeout ends?

Unless a ConnectionHandler sets KeepAlive::Until, terminating the connection is instant with KeepAlive::No. My approach would be that any protocol needs to disable its handler instantly if it receives a negotiation error. If we go forward with an approach such as #3651 for #2680, then a ConnectionHandler could even eagerly disable itself once we receive the "supported protocols" (via identify f.e.) and the current one is not present. A quick sketch could be (Disclaimer: I only came up with this idea yesterday, there might still be flaws):

Connection is established
Some protocols, like identify, run upon connection establishment
They receive a response and emit ConnectionHandlerEvent::ReportRemoteProtocols
The list of protocols is immediately handed back to the ConnectionHandler via ConnectionEvent::RemoteProtocolsChange
Each ConnectionHandler could instantly set themselves to KeepAlive::No if their own protocol is not present in the list

Curious about your thoughts on these.

The longer discussion around this is in #3353 (comment). I think eagerly closing connections on unsupported protocols is a reasonable policy. I also think that policy does not belong into rust-libp2p. It is something each application has to decide for themselves. My recommendation would be that protocols should emit an event if they disable themselves (perhaps we can even provide something standard for this?). This event can be captured by the behaviour and emitted to the Swarm. At the application-level, you can then close connections to peers that e.g. don't support gossipsub in case your application relies on that.

I don't want to derail this too much but the entire KeepAlive mechanism may also need some more thought. For example, should a protocol like identify keep the connection alive? It is more of an "auxiliary" protocol in my eyes but if no other protocols need the connection, it could be closed.

AgeManning · 2023-03-23T12:43:27Z

Yeah I agree.

We probably need a standardized way of letting the users decide which protocols should allow the connection to be kept alive. Maybe in the derive macro when composing behaviours we could tell the macro which sub-behaviours we want to keep the connection alive.

rkuhn · 2023-03-23T13:04:08Z

I’d think that it depends on the protocol if KeepAlive::No makes sense. If the user (of the derived behaviour) wants to keep connections around then that should probably be done by offering a NetworkBehaviour that does just that — previously this was allowed by the ping protocol, but the functionality has — correctly, I think — been decoupled from ping.

While I do agree with @mxinden’s proposal above, I’m not sure whether it addresses the issue @AgeManning raised: if we need to interact with a broken peer (like go-ipfs sounds to be) who “signals” the desire that we should close the connection by reporting a nondescript stream negotiation failure, then we might need to keep the ability for a stream protocol handler to close the connection. Alternatively, go-ipfs could be fixed …

Regarding the overall topic: it might well be reasonable to run a protocol on a stream that performs admin coordination. Such a handler may then want to shut down the connection when agreed with the peer, independent of other stream handlers setting KeepAlive::Yes. If the connection handler event goes away, then this will need to be done by emitting an event into the corresponding admin NetworkBehaviour and closing the connection from there, right?

thomaseizinger · 2023-03-23T13:43:21Z

We probably need a standardized way of letting the users decide which protocols should allow the connection to be kept alive. Maybe in the derive macro when composing behaviours we could tell the macro which sub-behaviours we want to keep the connection alive.

Configuration is one option (pun intended) but adding configuration options is a fight we are not going to win as more usecases come up. I am a fan of thinking in policies and mechanism. rust-libp2p should provide a mechanism for closing connections and users should bring the policy on when to use that.

If the connection handler event goes away, then this will need to be done by emitting an event into the corresponding admin NetworkBehaviour and closing the connection from there, right?

Yes, that is the idea.

I am starting to wonder whether the entire connection_keep_alive functionality is a good idea. Wouldn't everyone be served better if we don't close connections automatically?

ConnectionHandlers wouldn't need to use KeepAlive::Until as the default value to avoid connections being shut down instantly after they are established
Our examples and tests wouldn't need to add keep_alive::Behaviour to work

Even determining whether a remote supports any of our protocols is difficult. We can only learn that through identify and if that works, a naive intersection of the protocols will at least yield that one.

AgeManning · 2023-03-27T00:35:13Z

From a selfish perspective (gossipsub), it would seem to make sense that each behaviour has the ability to "disable" itself independently of other behaviours (such that for all intents and purposes a "DisableBehaviour" event looks the same as a "CloseConnection" at the behaviour level). I guess this is what the KeepAlive is attempting to do.

If a peer does something the behaviour doesn't like, it has the ability to drop all substreams and prevent new ones from being created. Other behaviours work independently until all become disabled and the connection gets closed altogether. Its been a while since I looked at the KeepAlive, but from memory it does this, but without the ability of disabling re-established substreams?

Previously, we closed the entire connection upon receiving too many upgrade errors. This is unnecessarily aggressive. For example, an upgrade error may be caused by the remote dropping a stream during the initial handshake which is completely isolated from other protocols running on the same connection. Instead of closing the connection, set `KeepAlive::No`. Related: #3591. Resolves: #3690. Pull-Request: #3625.

Previously, we closed the entire connection upon receiving too many upgrade errors. This is unnecessarily aggressive. For example, an upgrade error may be caused by the remote dropping a stream during the initial handshake which is completely isolated from other protocols running on the same connection. Instead of closing the connection, set `KeepAlive::No`. Related: libp2p#3591. Resolves: libp2p#3690. Pull-Request: libp2p#3625.

thomaseizinger · 2023-05-05T06:07:22Z

We are going to land #3605 soon which simplifies error handling already a fair bit.

Plus, @mxinden had a good idea regarding keep-alive: Deprecate the Until mechanism. I've captured this here: #3844

Related: #3591. Pull-Request: #3913.

Previously, the `libp2p-ping` module came with a policy to close a connection after X failed pings. This is only one of many possible policies on how users would want to do connection management. We remove this policy without a replacement. If users wish to restore this functionality, they can easily implement such policy themselves: The default value of `max_failures` was 1. To restore the previous functionality users can simply close the connection upon the first received ping error. In this same patch, we also simplify the API of `ping::Event` by removing the layer of `ping::Success` and instead reporting the RTT to the peer directly. Related: #3591. Pull-Request: #3947.

thomaseizinger · 2023-10-24T00:21:47Z

@mxinden I added this to the milestone but I'd also like to make another patch-release of libp2p-swarm where we deprecate ConnectionHandlerEvent::Close to give users a heads-up that we are going to remove this.

Related: #3591. Pull-Request: #4715.

To make a reservation with a relay, a user calls `Swarm::listen_on` with an address of the relay, suffixed with a `/p2pcircuit` protocol. Similarly, to establish a circuit to another peer, a user needs to call `Swarm::dial` with such an address. Upon success, the `Swarm` then issues a `SwarmEvent::NewListenAddr` event in case of a successful reservation or a `SwarmEvent::ConnectionEstablished` in case of a successful connect. The story is different for errors. Somewhat counterintuitively, the actual reason of an error during these operations are only reported as `relay::Event`s without a direct correlation to the user's `Swarm::listen_on` or `Swarm::dial` calls. With this PR, we send these errors back "into" the `Transport` and report them as `SwarmEvent::ListenerClosed` or `SwarmEvent::OutgoingConnectionError`. This is conceptually more correct. Additionally, by sending these errors back to the transport, we no longer use `ConnectionHandlerEvent::Close` which entirely closes the underlying relay connection. In case the connection is not used for something else, it will be closed by the keep-alive algorithm. Resolves: #4717. Related: #3591. Related: #4718. Pull-Request: #4745.

We refactor the `libp2p-dcutr` API to only emit a single event: whether the hole-punch was successful or not. All other intermediate events are removed. Hole-punching is something that we try to do automatically as soon as we are connected to a peer over a relayed connection. The lack of explicit user intent means any event we emit is at best informational and not a "response" that the user would wait for. Thus, I chose to not expose the details of why the hole-punch failed but return an opaque error. Lastly, this PR also removes the usage of `ConnectionHandlerEvent::Close`. Just because something went wrong during the DCUtR handshake, doesn't mean we should close the relayed connection. Related: #3591. Pull-Request: #4749.

To remove the usages of `ConnectionHandlerEvent::Close` from the relay-server, we unify what used to be called `CircuitFailedReason` and `FatalUpgradeError`. Whilst the errors may be fatal for the particular circuit, they are not necessarily fatal for the entire connection. Related: #3591. Resolves: #4716. Pull-Request: #4718.

Related: libp2p#3591. Pull-Request: libp2p#3913.

thomaseizinger added priority:nicetohave difficulty:moderate help wanted labels Mar 11, 2023

thomaseizinger mentioned this issue Mar 13, 2023

feat(swarm)!: Remove ConnectionHandler::Error in favor of CloseReason #3201

Closed

4 tasks

thomaseizinger mentioned this issue Mar 13, 2023

swarm: Rename NetworkBehaviourAction to ToSwarm #3123

Closed

thomaseizinger mentioned this issue Mar 14, 2023

fix(swarm): gracefully disable oneshot handler on dial upgrade errors #3577

Merged

thomaseizinger mentioned this issue Mar 21, 2023

fix(gossipsub): gracefully disable handler on stream errors #3625

Merged

4 tasks

This was referenced Mar 29, 2023

Review the design of the OneShotHandler disconnecting in case of protocol error #1530

Closed

Have a way to find out why a connection is kept open #1478

Closed

mxinden mentioned this issue May 4, 2023

feat: don't report inbound stream upgrade errors to handler #3605

Merged

4 tasks

thomaseizinger mentioned this issue May 11, 2023

feat(request-response): don't close connection on stream errors #3913

Merged

4 tasks

mergify bot pushed a commit that referenced this issue May 15, 2023

feat(request-response): don't close connection on stream errors

1742bff

Related: #3591. Pull-Request: #3913.

thomaseizinger mentioned this issue May 16, 2023

Removing the handler field from FromSwarm::ConnectionClosed in favor of ConnectionHandler::poll_close #3046

Closed

gongzhengyang mentioned this issue May 31, 2023

dialing client raised Connection closed with error NoAddresses, relay server log is no protocol could be agreed upon for inbound stream #4007

Open

sandreae mentioned this issue Jun 9, 2023

Integrate replication manager with networking stack p2panda/aquadoggo#387

Merged

4 tasks

This was referenced Jun 13, 2023

swarm/src/protocols_handler: Make ProtocolsHandlerEvent depend on `ProtocolsHandler only #2389

Closed

feat(rendezvous): directly return error from register #4073

Merged

thomaseizinger changed the title ~~swarm: Deprecate ConnectionHandlerEvent::Close~~ swarm: Remove ConnectionHandlerEvent::Close Jun 27, 2023

thomaseizinger mentioned this issue Sep 8, 2023

refactor(relay): move stream-handling away from {In,Out}boundUpgrade #4275

Merged

4 tasks

thomaseizinger mentioned this issue Sep 20, 2023

Simplify ConnectionHandler trait by removing as many associated types as possible #2863

Open

thomaseizinger changed the title ~~swarm: Remove ConnectionHandlerEvent::Close~~ swarm: Remove ConnectionHandler::Error Sep 20, 2023

thomaseizinger added this to the v0.53.0 milestone Oct 24, 2023

mergify bot pushed a commit that referenced this issue Oct 24, 2023

feat(swarm): don't close connection in OneShotHandler

7bbca11

Related: #3591. Pull-Request: #4715.

This was referenced Oct 27, 2023

feat(relay): propagate errors to Transport::{listen_on,dial} #4745

Merged

refactor(dcutr): simplify public API #4749

Merged

feat(swarm): don't have ConnectionHandlers close connections #4755

Merged

thomaseizinger removed this from the v0.53.0 milestone Nov 2, 2023

mergify bot closed this as completed in #4755 Nov 2, 2023

mergify bot closed this as completed in 0ef6feb Nov 2, 2023

julianbraha mentioned this issue Nov 4, 2023

Usage of deprecated library code spacedriveapp/spacedrive#1732

Closed

thomaseizinger mentioned this issue Nov 19, 2023

Upgrade libp2p to version 0.53.1 nimiq/core-rs-albatross#1977

Merged

3 tasks

alindima pushed a commit to alindima/rust-libp2p that referenced this issue Jan 9, 2024

feat(request-response): don't close connection on stream errors

85e3025

Related: libp2p#3591. Pull-Request: libp2p#3913.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swarm: Remove `ConnectionHandler::Error` #3591

swarm: Remove `ConnectionHandler::Error` #3591

thomaseizinger commented Mar 11, 2023 •

edited

Loading

Tasks

mxinden commented Mar 13, 2023

thomaseizinger commented Mar 13, 2023

mxinden commented Mar 13, 2023

vnermolaev commented Mar 14, 2023

thomaseizinger commented Mar 14, 2023

vnermolaev commented Mar 15, 2023

thomaseizinger commented Mar 15, 2023

vnermolaev commented Mar 15, 2023

thomaseizinger commented Mar 15, 2023

AgeManning commented Mar 22, 2023

thomaseizinger commented Mar 23, 2023

AgeManning commented Mar 23, 2023

rkuhn commented Mar 23, 2023

thomaseizinger commented Mar 23, 2023

AgeManning commented Mar 27, 2023

thomaseizinger commented May 5, 2023

thomaseizinger commented Oct 24, 2023

swarm: Remove ConnectionHandler::Error #3591

swarm: Remove ConnectionHandler::Error #3591

Comments

thomaseizinger commented Mar 11, 2023 • edited Loading

Tasks

mxinden commented Mar 13, 2023

thomaseizinger commented Mar 13, 2023

mxinden commented Mar 13, 2023

vnermolaev commented Mar 14, 2023

thomaseizinger commented Mar 14, 2023

vnermolaev commented Mar 15, 2023

thomaseizinger commented Mar 15, 2023

vnermolaev commented Mar 15, 2023

thomaseizinger commented Mar 15, 2023

AgeManning commented Mar 22, 2023

thomaseizinger commented Mar 23, 2023

AgeManning commented Mar 23, 2023

rkuhn commented Mar 23, 2023

thomaseizinger commented Mar 23, 2023

AgeManning commented Mar 27, 2023

thomaseizinger commented May 5, 2023

thomaseizinger commented Oct 24, 2023

swarm: Remove `ConnectionHandler::Error` #3591

swarm: Remove `ConnectionHandler::Error` #3591

thomaseizinger commented Mar 11, 2023 •

edited

Loading