Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Swarm behaviour #2242

Closed
AgeManning opened this issue Sep 21, 2021 · 7 comments
Closed

Invalid Swarm behaviour #2242

AgeManning opened this issue Sep 21, 2021 · 7 comments

Comments

@AgeManning
Copy link
Contributor

I've been testing some code on the latest master and running into unexpected behaviour from the Swarm.

I've found that Swarm is injecting a ConnectionId to the behaviour via inject_connection_established(). In some less-common circumstances (which I've not isolated) upon disconnection, via inject_connection_closed() the Swarm is inputting a different connection id.

In gossipsub, this causes a panic as we are expecting the swarm to inform the behaviour of the past-injected connection, not a new one.

@AgeManning
Copy link
Contributor Author

AgeManning commented Sep 23, 2021

Ok, I did a bit of digging.

It looks like in #2191 for new connections beyond a connection limit we close the connection. Previously we reported a PendingConnectionError as a PoolEvent.

I think we then are injecting connection closed events to the behaviour with the ConnectionLimit error if a connection is attempted to being established beyond the connection limit. Ultimately a behaviour will see an inject_connection_closed() without a matching inject_connection_established() for the same id.

If this is the intended behaviour, I'll have to update some assumptions in gossipsub and maybe some other behaviours I've built

What do you guys think? @thomaseizinger @mxinden

@AgeManning
Copy link
Contributor Author

AgeManning commented Sep 25, 2021

I've also seen the following panics

So I assume we probably need to change the behaviour of how the connections are reported.

@mxinden
Copy link
Member

mxinden commented Sep 27, 2021

Thank you @AgeManning for the debugging work. Sorry for the trouble this has caused on your end.

It looks like in #2191 for new connections beyond a connection limit we close the connection. Previously we reported a PendingConnectionError as a PoolEvent.

I think we then are injecting connection closed events to the behaviour with the ConnectionLimit error if a connection is attempted to being established beyond the connection limit. Ultimately a behaviour will see an inject_connection_closed() without a matching inject_connection_established() for the same id.

This is exactly right! In #2191 I changed the behaviour on how connection limit errors are reported. Long story short, I did so in order to be able to return the handler to the behaviour on connection limit errors. More specifically in pool.rs in case the established-limit is reached, I do not report a PoolEvent::PendingConnectionError directly, but instead take a detour through task.rs in order to pick up the handler and return it along with the connection limit error as a PoolEvent::ConnectionClosed.

If this is the intended behaviour, I'll have to update some assumptions in gossipsub and maybe some other behaviours I've built.

While returning the handler along with the error was obviously the intention of #2191, triggering a NetworkBehaviour::inject_connection_closed without a NetworkBehaviour::inject_connection_established was not. Not only was it not intended, but also do I find it counter intuitive. With that in mind I would very much prefer to fix this within lib2p-core and libp2p-swarm instead of expecting NetworkBehaviour implementors to handle this caveat. I guess you wold agree @AgeManning?

First solution that comes to my mind which would retain the intended behaviour of #2191 (returning the handler), while upholding the guarantee that every inject_connection_closed is preceded by a inject_connection_established: Merge the max_pending_incoming with the max_established_incoming and the max_pending_outgoing with the max_established_outgoing connection limit. Limits would only need to be checked before connection upgrade, thus being able to directly return the handler, not requiring the detour through task.rs. I don't see much benefit in having the two limits (pending and established) in the first place. I expect the time of a connection spend in pending state to be on the order of milliseconds, thus only occupying a pending-limit-slot for a short time after which it is then trying to acquire a established-limit-slot.

What do people think? Does someone have the need to differentiate in pending and established connection limit?

mxinden added a commit to mxinden/rust-libp2p that referenced this issue Sep 28, 2021
Merge pending and established limits for both incoming and outgoing
connections. More specifically merge
`ConnectionLimits::with_max_pending_incoming` with
`ConnectionLimits::with_max_established_incoming` and
`ConnectionLimits::with_max_pending_outgoing` with
`ConnectionLimits::with_max_established_outgoing`. Connection limits are
checked on `Network::dial` for outgoing and on `Network::accept` for
incoming connections.

This (a) simplifies connection limits from an implementations and user
perspective and (b) simplifies returning a connection handler on limit
error as limits can only be exceeded at the start of dialing and
accepting. See [1].

[1]: libp2p#2242 (comment)
@mxinden
Copy link
Member

mxinden commented Sep 28, 2021

#2253 showcases the solution suggested above.

@mxinden
Copy link
Member

mxinden commented Sep 28, 2021

I've also seen the following panics

* https://github.com/libp2p/rust-libp2p/blob/master/protocols/identify/src/identify.rs#L277

* https://github.com/libp2p/rust-libp2p/blob/master/core/src/network/peer.rs#L412

So I assume we probably need to change the behaviour of how the connections are reported.

I still need to dig deeper into these failures. I am not yet sure whether they are related.

@AgeManning
Copy link
Contributor Author

This sounds good to me.

I had a look at the identify panic, and it is definitely related, assume the core/network/peer one is too, but didn't dive into it.

mxinden added a commit that referenced this issue Oct 14, 2021
Concurrently dial address candidates within a single dial attempt.

Main motivation for this feature is to increase success rate on hole punching
(see #1896 (comment)
for details). Though, as a nice side effect, as one would expect, it does
improve connection establishment time.

Cleanups and fixes done along the way:

- Merge `pool.rs` and `manager.rs`.

- Instead of manually implementing state machines in `task.rs` use
  `async/await`.

- Fix bug where `NetworkBehaviour::inject_connection_closed` is called without a
  previous `NetworkBehaviour::inject_connection_established` (see
  #2242).

- Return handler to behaviour on incoming connection limit error. Missed in
  #2242.
@AgeManning
Copy link
Contributor Author

From what I can tell the latest master has resolved this

santos227 pushed a commit to santos227/rustlib that referenced this issue Jun 20, 2022
Concurrently dial address candidates within a single dial attempt.

Main motivation for this feature is to increase success rate on hole punching
(see libp2p/rust-libp2p#1896 (comment)
for details). Though, as a nice side effect, as one would expect, it does
improve connection establishment time.

Cleanups and fixes done along the way:

- Merge `pool.rs` and `manager.rs`.

- Instead of manually implementing state machines in `task.rs` use
  `async/await`.

- Fix bug where `NetworkBehaviour::inject_connection_closed` is called without a
  previous `NetworkBehaviour::inject_connection_established` (see
  libp2p/rust-libp2p#2242).

- Return handler to behaviour on incoming connection limit error. Missed in
  libp2p/rust-libp2p#2242.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants