Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gossipsub] - Connection keep alive bug-fix #2043

Merged
merged 25 commits into from
May 14, 2021

Conversation

AgeManning
Copy link
Contributor

A 30 second timeout was introduced, which gave newly connected peers 30 seconds in order to establish a long-lived substream.

The timeout was not being updated once substreams were negotiated. Thus, after 30 seconds the gossipsub protocol was setting its keep-alive to expire after the initial 30 second timeout.

This PR updates gossipsub to set the keep alive to Yes once a substream has been established. The keep alive resets to No once there are no more existing substreams as expected.

@Frederik-Baetens
Copy link
Contributor

This branch seems to work quite well for me (after running a cargo update as well)

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AgeManning for preparing a patch for the two issues.

Do I understand correctly that with this change a connection to a peer is kept alive indefinitely once either an inbound or outbound substream is established and neither the inbound substream is closed nor the outbound substream closed, with the latter not being supported today?

// Currently never used - manual shutdown may implement this in the future
Some(OutboundSubstreamState::_Closing(mut substream)) => {

Does this not lead to an infinite amount of connections on long lived nodes?

I would have expected something along the lines of: Set KeepAlive::Yes as long as there are RPC messages to send. Set KeepAlive::Until with some reasonable timeout once the queue of RPC messages to send is empty. Reconnect to peer in case an idle connection to said peer has been closed and there is a new RPC message to send.

@AgeManning
Copy link
Contributor Author

AgeManning commented Apr 12, 2021

Do I understand correctly that with this change a connection to a peer is kept alive indefinitely once either an inbound or outbound substream is established and neither the inbound substream is closed nor the outbound substream closed, with the latter not being supported today?

More or less. Once an outbound substream is established, it is kept alive indefinitely. If only an inbound substream is set up and it closes, then the keep alive is set to No. IIRC correctly, it used to be symmetric and the keep alive was set to No once both substreams were closed, but there were some errors in setting up substreams which lead to continual infinite failures. Now if there is an error in the outbound substream we drop the connection.

Unlike other libp2p protocols, which negotiate substreams per RPC message, gossipsub requires two long-lived substreams (in and out) for the duration of the connection. All outbound messages get sent through one and all inbound ones get sent through another. If all is working correctly, these should never close (unless the peer disconnects).

I didn't implement an idle time in which we disconnect from peers, rather its up to the user to choose to disconnect from peers when they need. A disconnection removes peers from the mesh and topics etc. If I understand correctly, the suggestion to drop peers after an idle time, would mean that if we didn't send (or receive) a message on some particular topic for a while, we could drop all peers from that topic. When we do want to publish, we would no longer know about the peers that would have been on that topic and so cannot reconnect to them.

Does this not lead to an infinite amount of connections on long lived nodes?

I dont think so. Messages sent to a peer will just use the currently established long-lived substream. If one doesn't exist (i.e first message) it creates one. We also only keep one long-lived inbound substream

@mxinden
Copy link
Member

mxinden commented Apr 13, 2021

I still need to put more thoughts into this, thus please bear with me.

There are 4 actions a ProtocolsHandler can take for connection management:

  • KeepAlive::Yes
  • KeepAlive::Until
  • KeepAlive::No
  • ProtocolsHandlerEvent::Close

I consider there to be two strategies to do connection management:

  • Based on keeping connections alive

    1. Each ProtocolsHandler sets KeepAlive::Yes when it is actively using the connction.
    2. Each ProtocolsHandler sets KeepAlive::Until(X) when it is not actively using the connection but expects to in the next time frame X.
    3. Each ProtocolsHandler sets KeepAlive::No when not requiring the connection.
    4. Each ProtocolsHandler returns ProtocolsHandlerEvent::Close on protocol violations.
    5. There would be one meta NetworkBehaviour (e.g. called ConnectionManager) that manages connection limits, in the sense that it artifically keeps connections alive via setting KeepAlive::Yes in its ProtocolsHandlers.
    6. Connections not kept alive via individual ProtocolHandlers or the meta ConnectionManager NetworkBehaviour are closed by the Swarm.
  • Based on closing connections

    1. Each ProtocolsHandler sets KeepAlive::Yes when it is actively using the connction.
    2. Each ProtocolsHandler sets KeepAlive::Yes when it is not using the connction.
    3. Each ProtocolsHandler returns ProtocolsHandlerEvent::Close on protocol violations.
    4. There would be one meta NetworkBehaviour (e.g. called ConnectionManager) that manages connection limits, in the sense that it explicitly closes alive connections by returning ProtocolsHandlerEvent::Close.

All protocols in rust-libp2p, correct me if I am wrong, follow the strategy based on keeping connections alive except libp2p-gossipsub. Note that with the based on keeping connections alive strategy an individual ProtocolsHandler has more fine grained control over a connection in the sense that it can both indicate that a connection should stay alive (KeepAlive::Yes), not actively close it but allowing it to be closed (KeepAlive::No and KeepAlive::Until) and actively closing it ProtocolsHandlerEvent::Close. On the other side with the based on closing connections strategy an individual ProtocolsHandler can not indicate specifically that a connection should stay alive or that it can be closed as the meta ConnectionManager NetworkBehaviour can not differentiate the two states, as both are represented as KeepAlive::Yes.

Do you see any draw backs in adopting the based on keeping connections alive strategy in libp2p-gossipsub @AgeManning? As far as I can tell, and again I still need to put more thoughts into this, the changes required are not very intrusive.

@AgeManning
Copy link
Contributor Author

Originally floodsub used the "keeping connections alive" strategy. It used short lived substreams per RPC message. This wasn't compatible with go (the specs were pretty bad at the time for gossipsub). I updated to match the go implementation which required long-lived substreams throughout the connection. I guess this point is irrelevant to what you are referring to however, we could still have long-lived substreams and a timeout that drops the connection.

I guess my concern is what inactivity timeout should we set? And should it be specified per topic. If we introduce an inactivity timeout globally, there could be topics users wish to subscribe to, like an "emergency topic" which nodes would send messages very rarely. If we have a fixed timeout, gossipsub would then disconnect from these emergency nodes because no messages were received in some timeout. We could make the global timeout configurable, or even more complicated a per-topic timeout.

My thinking around this has always been that we should persist the connection unless there is an error or a disconnection. I guess its probably wise to add an inactivity timeout, and hope users couple it with ping or something to verify the connection is still active.

If all other protocols have an inactivity timeout, perhaps gossipsub should also. If you agree, I can add it as an option, that defaults to 1min, but users can set it to None to persist connection indefinitely. What do you think?

@mxinden
Copy link
Member

mxinden commented Apr 15, 2021

If all other protocols have an inactivity timeout, perhaps gossipsub should also. If you agree, I can add it as an option, that defaults to 1min, but users can set it to None to persist connection indefinitely. What do you think?

The timeout sounds good to me. While it might be good for the concrete default timeout value to be optional, I don't think the timeout functionality needs to be optional. Imagine the following:

  • Each GossipsubHandler starts off with KeepAlive::Until(Instant::now() + DEFAULT_KEEP_ALIVE) unless the handled peer is in Gossipsub::mesh.
  • When the oubound stream of a GossipsubHandler is in OutboundSubstreamState::PendingSend or OutboundSubstream::PendingFlush keep_alive is set to KeepAlive::Yes.
  • When the oubound stream of GossipsubHandler is in OutboundSubstreamState::WaitingOutput set keep_alive to KeepAlive::Until(Instant::now() + DEFAULT_KEEP_ALIVE) unless the handled peer is in Gossipsub::mesh.
  • When the oubound stream of GossipsubHandler is in OutboundSubstreamState::_Closing set keep_alive to KeepAlive::No.
  • After any kind of successful activity (e.g. sending and receiving), if keep_alive is set to KeepAlive::Until(_), reset the timeout i.e. set keep_alive to KeepAlive::Until(Instant::now() + DEFAULT_KEEP_ALIVE). (We might want to restrict the reset to outgoing actions only to prevent remote nodes from artificially keeping a connection to the local node alive.)
  • If the peer handled by the GossipsubHandler is part of Gossipsub::mesh, set keep_alive to KeepAlive::Yes.
  • When a peer is removed from Gossipsub::mesh set its GossipsubHandler keep_alive to KeepAlive::Until(Instant::now() + DEFAULT_KEEP_ALIVE).

Node connectivity is always assured, as connections to nodes in the mesh are kept alive. Connections required and thus kept alive by other NetworkBehaviours can be used but are not actively kept alive by the Gossipsub NetworkBehaviour. Idle connections not used by either the Gossipsub NetworkBehaviour or any other NetworkBehaviour are not kept alive and thus close eventually.

@AgeManning what do you think? Sorry for the wall of text.

@AgeManning
Copy link
Contributor Author

I think this might be difficult to implement in practice and want to double check the logic we're aiming for here.

Firstly, the mesh and topics a peer is subscribed to is handled in the behaviour and is not really accessible in the ProtocolsHandler (we could of course pass extra messages and things to the handler (everything is do-able)). Secondly, the mesh is (depending on the network) quite small compared to the number of peers we are typically connected to. We only have a mesh for topics we are subscribed to. We keep a list of connected peers and their topics in the behaviour and often move peers in and out of the mesh based on score and if we want to subscribe to a new topic, we can collect peers that are subscribed to that topic and try and graft to them. So the mesh isn't necessarily a static subset of connected peers, but i like the idea that we would maintain connections at least to the peers we know are subscribed to the topics of interest.

Another thought, is that our mesh peers send us messages and we forward to other mesh peers. The peers that are outside the mesh (but still subscribed to the topic) won't often send us messages and we won't often send them messages, except for the gossip. So I'd imagine we'd be often sending messages to mesh peers and they would be kept alive and the other peers are less likely to send/recv messages and we are more likely to timeout with them.

We might want to adjust the timeout based on number of messages we expect and/or gossip time (the time we gossip to our peers on the same topic).

I think its probably just a balance of setting the right DEFAULT_KEEP_ALIVE for the maximum time we expect to not send/recv messages on the topic with the least frequency to maintain a connection. Again, it might be necessary to couple with ping.

Do you think we should try and be fancy and pass information to the peer handlers about the mesh and global peer status to the handler to implement the logic you suggest?

@mxinden
Copy link
Member

mxinden commented Apr 19, 2021

Firstly, the mesh and topics a peer is subscribed to is handled in the behaviour and is not really accessible in the ProtocolsHandler (we could of course pass extra messages and things to the handler (everything is do-able)).

I think passing extra messages is a viable solution. Similar approach is taken in libp2p-relay.

We keep a list of connected peers and their topics in the behaviour and often move peers in and out of the mesh based on score and if we want to subscribe to a new topic, we can collect peers that are subscribed to that topic and try and graft to them. So the mesh isn't necessarily a static subset of connected peers

I don't think peers moving in and out of the mesh is a big deal, as each change to the mesh would only require a message to be passed from the NetworkBehaviour to the corresponding ProtocolHandlers.

Again, it might be necessary to couple with ping.

Can you expand on how ping plays a role here? I don't see how ping could help detect inactive connections. In case you want to detect broken connections with ping, I am all onboard. I do think every deployment of libp2p should run the ping protocol.

Do you think we should try and be fancy and pass information to the peer handlers about the mesh and global peer status to the handler to implement the logic you suggest?

I am not sure what you are referring to with "global peer status". I do think that your initial proposal of only using a timeout mechanism would work in most cases. After all, as you described above, connections to peers in the mesh are heavily used and would thus never time out. Still, to make this more robust, and prevent edge cases like #2034 I would suggest going the extra mile, making sure precious connections to peers in the mesh stay connected, e.g. by having the NetworkBehaviour inform a ProtocolsHandler of the peer status.

I had a short chat with @vyzo to find out how the Golang GossipSub implementation does connection management. Connections to peers in the mesh are tagged as protected to prevent the Golang connection manager from disconnecting them. To other connections decaying tags are added. From a high level perspective, this seems to be in-line with our approach here.

I am sorry for this short bug fix pull request to turn in such a long discussion @AgeManning! Does the above sound good to you?

@AgeManning
Copy link
Contributor Author

Yep this sounds good to me.

I'll make the changes at some point next week :)

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investing more time into this @AgeManning!

protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
.push_back(NetworkBehaviourAction::NotifyHandler {
peer_id: peer,
event: Arc::new(InboundHandlerEvent::LeftMesh),
handler: NotifyHandler::Any,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say there are two nodes Node-A and Node-B communicating via two connections Conn-1 and Conn-2. From the perspective of Node-A Node-B joins its mesh. Node-A then sends a InboundHandlerEvent::JoinedMesh to the Conn-1 handler. Later on, again from the perspective of Node-A, Node-B leaves the mesh. Node-A then sends a InboundHandlerEvent::LeftMesh to Conn-2, possible as NotifyHandler::Any is used.

As far as I can tell, the above would lead to a state mismatch. Conn-1 thinks Node-B is still in Node-A's mesh and thus keeps Conn-1 alive. Conn-2 received an InboundHandlerEvent::LeftMesh before previously receiving an InboundHandlerEvent::JoinedMesh.

Off the top of my head to solve this I suggest to extend Gossipsub::peer_protocols not only tracking the protocol (i.e. kind) of a peer, but as well its connections. When joining or leaving, one would always inform the handler of the first connection to the peer. Something along the lines of:

struct Gossipsub {
  // ...
  connected_peers: HashMap<PeerId, Peer>,
  // ...
}

struct Peer {
  connections: Vec<ConnectionId>,
  kind: PeerKind,
}

Note: I am advocating to only inform the first connection, not all connections, to allow idle secondary connections to eventually close. I have not fully thought this through. Let me know if you see problems with this strategy.

Given that I am not deeply familiar with the implementation, I would very much appreciate suggestions from your side @AgeManning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. The whole implementation was built prior to the multiple connection feature. When it was introduced, the NotifyHandler::Any was just added to all the events. Tbh, I've not thought through the implications of multiple connections to a peer. I guess, we just have an extra redundant connections, which off the top of my head doesn't cause any real issues.

However, the state mismatch is an issue. What you propose sounds reasonable to me. I'll add it in.

protocols/gossipsub/src/behaviour.rs Outdated Show resolved Hide resolved
@@ -181,6 +182,13 @@ impl GossipsubConfig {
self.max_transmit_size
}

/// The time a connection is maintained to a peer without being in the mesh and without
/// send/receiving a message from. Peers that idle beyond this timeout are disconnected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// send/receiving a message from. Peers that idle beyond this timeout are disconnected.
/// send/receiving a message from. Connections that idle beyond this timeout are disconnected.

@@ -524,6 +533,14 @@ impl GossipsubConfigBuilder {
self
}

/// The time a connection is maintained to a peer without being in the mesh and without
/// send/receiving a message from. Peers that idle beyond this timeout are disconnected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// send/receiving a message from. Peers that idle beyond this timeout are disconnected.
/// send/receiving a message from. Connections that idle beyond this timeout are disconnected.

@@ -200,6 +216,9 @@ impl ProtocolsHandler for GossipsubHandler {

self.inbound_substreams_created += 1;

// Inbound substream is live set the keep alive
self.keep_alive = KeepAlive::Yes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting KeepAlive::Yes both when the inbound and the outbound substream is live would lead to idle connections never being closed, no? Is the keep_alive setting in inject_fully_negotiated_inbound and inject_fully_negotiated_outbound needed at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No these are not needed. These are remnants of the first "fix".

protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Show resolved Hide resolved
@@ -947,6 +943,17 @@ where
topic_hash: topic_hash.clone(),
},
);

// If the peer did not previously exist in any mesh, inform the handler
if self.peer_added_to_mesh(&peer_id, topic_hash) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These check if peer was added and if so inform the handler blocks are quite pervasive. I am not familiar enough with the codebase to tell whether they are inherent, or whether there is room for improvement. Would e.g. extracting the mesh tracking logic to a separate component be an option? Each time one instructs the separate component to add a peer to a mesh, it returns (with #[must-use]) whether the corresponding handler needs to be informed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I agree.

I found it difficult to find a good pattern for this. Originally these were mutable functions which handled sending the events, but its difficult to inject them somewhat generically in the places they need to be with a mutable reference to self (even if I don't mutate the fields being borrowed). I resorted to immutable reference to self as a work-around.

Ideally a new structure of the mesh which handles this internally might be a nicer approach, but there are parts of the code that take mutable references to the peers in each mesh and mutate them. I'll have play try and refactor it a little

@AgeManning
Copy link
Contributor Author

I've made the changes you've suggested and performed some simple tests. It seems to work as expected.

It could probably be refactored a little nicer, but it wasn't immediately obvious to me how to.

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with the recent commits this is a solid step forward! I only have a couple of smaller comments left.

protocols/gossipsub/src/behaviour.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/behaviour.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/behaviour.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/behaviour.rs Outdated Show resolved Hide resolved
connections: &HashMap<PeerId, PeerConnections>,
) {
// Ensure there is an active connection
let connection_id = match connections.get(&peer_id) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I assuming correctly, that once the last connection to a peer in a mesh is removed, the peer is removed from the mesh? If so, would it not be safe to expect the peer to have a connection?

Suggested change
let connection_id = match connections.get(&peer_id) {
let connection_id = match connections.get(&peer_id).expect("To be connected to peer.") {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is safe.

We remove the connection in inject_connection_closed and then run inject_disconnected. If we were to attempt to inform the handler in inject_disconnected we could panic here. But we don't, so I think we are safe, I can't see any other case where we would not have any connections.

@mxinden - I'll make these changes, please check my logic here tho.

protocols/gossipsub/src/types.rs Outdated Show resolved Hide resolved
@@ -228,7 +229,7 @@ pub struct Gossipsub<

/// A map of peers to their protocol kind. This is to identify different kinds of gossipsub
/// peers.
peer_protocols: HashMap<PeerId, PeerKind>,
peer_protocols: HashMap<PeerId, PeerConnections>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both key and doc comment are off now, right?

How about:

Suggested change
peer_protocols: HashMap<PeerId, PeerConnections>,
/// A set of connected peers, indexed by their [`PeerId`], tracking both the [`PeerKind`] and the set of [`ConnectionId`]s.
connected_peers: HashMap<PeerId, PeerConnections>,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find the key confusing and would prefer connected_peers. @AgeManning any particular reason why keeping peer_protocols?

@mxinden
Copy link
Member

mxinden commented May 3, 2021

@elenaf9 @Frederik-Baetens could you test whether this most recent version of the patch-set fixes your issue in #2034 and #2036 respectively?

@Frederik-Baetens
Copy link
Contributor

Just tried it, both with and without mdns, and it all seems to work pretty well, thanks a lot for the quick fix! (keep in mind that i haven't tested performance or large deployments)

AgeManning and others added 2 commits May 4, 2021 13:47
Co-authored-by: Max Inden <mail@max-inden.de>
Co-authored-by: Max Inden <mail@max-inden.de>
@AgeManning
Copy link
Contributor Author

Added the extra suggestions. There is an edge case, where a peer connects over two connections 1, and 2. We inform the handler on connection 1 about the various mesh state, then connection 1 drops and we start informing connection 2. If this is possible we can get potentially invalid states in the handler for connection 2.

I think this is not much of a concern, as worst case we keep the peer when we might not want to, or drop it when its in a mesh (but idle).

dependabot bot and others added 5 commits May 4, 2021 14:21
…ibp2p#2045)

Bumps [styfle/cancel-workflow-action](https://github.com/styfle/cancel-workflow-action) from 0.8.0 to 0.9.0.
- [Release notes](https://github.com/styfle/cancel-workflow-action/releases)
- [Commits](styfle/cancel-workflow-action@0.8.0...89f242e)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
With `rand` `v0.8.0` platform support changed [1] due to its upgrade to
`getrandom` `v0.2`. With `getrandom` `v0.2` `wasm32-unknown-unknown` is no
longer supported out of the box:

> This crate fully supports the wasm32-wasi and wasm32-unknown-emscripten
> targets. However, the wasm32-unknown-unknown target is not automatically
> supported since, from the target name alone, we cannot deduce which JavaScript
> interface is in use (or if JavaScript is available at all).
>
> Instead, if the "js" Cargo feature is enabled, this crate will assume that you
> are building for an environment containing JavaScript, and will call the
> appropriate methods. Both web browser (main window and Web Workers) and
> Node.js environments are supported, invoking the methods described above using
> the wasm-bindgen toolchain.
>
> This feature has no effect on targets other than wasm32-unknown-unknown.

This commit drops support for wasm32-unknown-unknown in favor of the two more
specific targets wasm32-wasi and wasm32-unknown-emscripten.

Note on `resolver = "2"`: The new resolver is required to prevent features
being mixed, more specifically to prevent libp2p-noise to build with the
`ring-resolver` feature. See [3] for details.

---

[1] https://github.com/rust-random/rand/blob/master/CHANGELOG.md#platform-support
[2] https://docs.rs/getrandom/0.2.2/getrandom/#webassembly-support
[3] https://doc.rust-lang.org/nightly/cargo/reference/features.html#feature-resolver-version-2
* Update yamux requirement from 0.8.0 to 0.9.0

Updates the requirements on [yamux](https://github.com/paritytech/yamux) to permit the latest version.
- [Release notes](https://github.com/paritytech/yamux/releases)
- [Changelog](https://github.com/paritytech/yamux/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/paritytech/yamux/commits)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Max Inden <mail@max-inden.de>
Co-authored-by: Max Inden <mail@max-inden.de>
dependabot bot and others added 9 commits May 4, 2021 14:21
Bumps [actions/cache](https://github.com/actions/cache) from v2.1.4 to v2.1.5.
- [Release notes](https://github.com/actions/cache/releases)
- [Commits](actions/cache@v2.1.4...1a9e213)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ibp2p#2053)

If you start listening after mdns joined a multicast group, the peers may not
discover eachother until the 5min timeout expires.

Co-authored-by: Max Inden <mail@max-inden.de>
Remove `regex-filter` feature flag thus always enabling
`regex::RegexSubscriptionFilter`.
…ibp2p#2058)

Changes in 45f07bf now seem to append `/p2p/<peer>` to multiaddr passed to
transports. The regex in transport/wasm-ext doesn't support this fully qualified
format. This commit adjusts the regex accordingly.
@mxinden
Copy link
Member

mxinden commented May 4, 2021

Added the extra suggestions. There is an edge case, where a peer connects over two connections 1, and 2. We inform the handler on connection 1 about the various mesh state, then connection 1 drops and we start informing connection 2. If this is possible we can get potentially invalid states in the handler for connection 2.

I think this is not much of a concern, as worst case we keep the peer when we might not want to, or drop it when its in a mesh (but idle).

Can you expand on the concrete race conditions you see here?

One that I can spot is: The local node has two connections to the same remote node. When the remote node joins the mesh, the first connection handler is informed. Then the first connection closes due to some benign reason. While the peer is in the mesh and still connected via the second connection, the second connection handler is not aware of the fact that the node it is handling is in the mesh, and thus closes the connection after the idle timeout.

As far as I can tell the above can be fixed in inject_connection_closed, by checking whether the remote peer is in the mesh and the closed connection was the first of multiple connections and if so, inform the second connection that the peer it is handling is in the mesh.

@@ -228,7 +229,7 @@ pub struct Gossipsub<

/// A map of peers to their protocol kind. This is to identify different kinds of gossipsub
/// peers.
peer_protocols: HashMap<PeerId, PeerKind>,
peer_protocols: HashMap<PeerId, PeerConnections>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find the key confusing and would prefer connected_peers. @AgeManning any particular reason why keeping peer_protocols?

protocols/gossipsub/src/behaviour.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/behaviour.rs Show resolved Hide resolved
protocols/gossipsub/src/behaviour.rs Outdated Show resolved Hide resolved
@elenaf9
Copy link
Contributor

elenaf9 commented May 4, 2021

Thank you for putting so much effort into the fix! I just tested it and it worked really well, so once it is merged I think #2034 should be fixed. (It has to be noted though that my tests are just manual tests on a smaller project, I currently don't have any written tests for proper testing.)

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comment, other than that this looks good to me. Thanks for the continuous work here!

protocols/gossipsub/src/behaviour.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/behaviour.rs Outdated Show resolved Hide resolved
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me.

Would you mind including the diff below in this pull request? I suggest releasing this as a breaking change (minor bump) given its somewhat intrusive behavioral change. Thoughts?

@AgeManning would you like to have more time testing this change on a live network? Otherwise I am happy to cut rust-libp2p v0.38.0 once this is merged.

diff --git a/Cargo.toml b/Cargo.toml
index f19dd5e0..55d05c8f 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -66,7 +66,7 @@ futures = "0.3.1"
 lazy_static = "1.2"
 libp2p-core = { version = "0.28.3", path = "core",  default-features = false }
 libp2p-floodsub = { version = "0.29.0", path = "protocols/floodsub", optional = true }
-libp2p-gossipsub = { version = "0.30.1", path = "./protocols/gossipsub", optional = true }
+libp2p-gossipsub = { version = "0.31.0", path = "./protocols/gossipsub", optional = true }
 libp2p-identify = { version = "0.29.0", path = "protocols/identify", optional = true }
 libp2p-kad = { version = "0.30.0", path = "protocols/kad", optional = true }
 libp2p-mplex = { version = "0.28.0", path = "muxers/mplex", optional = true }
diff --git a/protocols/gossipsub/CHANGELOG.md b/protocols/gossipsub/CHANGELOG.md
index 7bc1f9aa..f215203f 100644
--- a/protocols/gossipsub/CHANGELOG.md
+++ b/protocols/gossipsub/CHANGELOG.md
@@ -1,3 +1,10 @@
+# 0.31.0 [unreleased]
+
+- Keep connections to peers in a mesh alive. Allow closing idle connections to peers not in a mesh
+  [PR-2043].
+
+[PR-2043]: https://github.com/libp2p/rust-libp2p/pull/2043https://github.com/libp2p/rust-libp2p/pull/2043
+
 # 0.30.1 [2021-04-27]
 
 - Remove `regex-filter` feature flag thus always enabling `regex::RegexSubscriptionFilter` [PR
diff --git a/protocols/gossipsub/Cargo.toml b/protocols/gossipsub/Cargo.toml
index a0ecc7e3..2d6c7c6a 100644
--- a/protocols/gossipsub/Cargo.toml
+++ b/protocols/gossipsub/Cargo.toml
@@ -2,7 +2,7 @@
 name = "libp2p-gossipsub"
 edition = "2018"
 description = "Gossipsub protocol for libp2p"
-version = "0.30.1"
+version = "0.31.0"
 authors = ["Age Manning <Age@AgeManning.com>"]
 license = "MIT"
 repository = "https://github.com/libp2p/rust-libp2p"

@AgeManning
Copy link
Contributor Author

Yeah, I have not done extensive testing on this.
I should find some time to get some testing done early next week. Maybe lets wait for some live-network testing before cutting a new release?

@mxinden
Copy link
Member

mxinden commented May 7, 2021

Maybe lets wait for some live-network testing before cutting a new release?

Sounds good to me. Thanks @AgeManning! Let me know how it goes.

@AgeManning
Copy link
Contributor Author

Initial testing looks like this works fine. :) 🎉

@mxinden
Copy link
Member

mxinden commented May 11, 2021

Initial testing looks like this works fine. :) tada

Great. I will cut a release (likely) tomorrow, unless anything comes up till then.

@mxinden
Copy link
Member

mxinden commented May 12, 2021

I seem to not be allowed to merge master to your branch. Could you either give me permissions, or merge master @AgeManning?

@mxinden mxinden mentioned this pull request May 12, 2021
@AgeManning
Copy link
Contributor Author

Oh sorry. I thought it would default to allow edits from maintainers. Strange it didn't.

I've merged latest master now.

@mxinden mxinden merged commit c5bcada into libp2p:master May 14, 2021
@AgeManning AgeManning deleted the gossip-keep-alive branch June 15, 2021 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants