-
Notifications
You must be signed in to change notification settings - Fork 976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Gossipsub] - Connection keep alive bug-fix #2043
Conversation
This branch seems to work quite well for me (after running a cargo update as well) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @AgeManning for preparing a patch for the two issues.
Do I understand correctly that with this change a connection to a peer is kept alive indefinitely once either an inbound or outbound substream is established and neither the inbound substream is closed nor the outbound substream closed, with the latter not being supported today?
rust-libp2p/protocols/gossipsub/src/handler.rs
Lines 488 to 489 in b1b6f2b
// Currently never used - manual shutdown may implement this in the future | |
Some(OutboundSubstreamState::_Closing(mut substream)) => { |
Does this not lead to an infinite amount of connections on long lived nodes?
I would have expected something along the lines of: Set KeepAlive::Yes
as long as there are RPC messages to send. Set KeepAlive::Until
with some reasonable timeout once the queue of RPC messages to send is empty. Reconnect to peer in case an idle connection to said peer has been closed and there is a new RPC message to send.
More or less. Once an outbound substream is established, it is kept alive indefinitely. If only an inbound substream is set up and it closes, then the keep alive is set to No. IIRC correctly, it used to be symmetric and the keep alive was set to No once both substreams were closed, but there were some errors in setting up substreams which lead to continual infinite failures. Now if there is an error in the outbound substream we drop the connection. Unlike other libp2p protocols, which negotiate substreams per RPC message, gossipsub requires two long-lived substreams (in and out) for the duration of the connection. All outbound messages get sent through one and all inbound ones get sent through another. If all is working correctly, these should never close (unless the peer disconnects). I didn't implement an idle time in which we disconnect from peers, rather its up to the user to choose to disconnect from peers when they need. A disconnection removes peers from the mesh and topics etc. If I understand correctly, the suggestion to drop peers after an idle time, would mean that if we didn't send (or receive) a message on some particular topic for a while, we could drop all peers from that topic. When we do want to publish, we would no longer know about the peers that would have been on that topic and so cannot reconnect to them.
I dont think so. Messages sent to a peer will just use the currently established long-lived substream. If one doesn't exist (i.e first message) it creates one. We also only keep one long-lived inbound substream |
I still need to put more thoughts into this, thus please bear with me. There are 4 actions a
I consider there to be two strategies to do connection management:
All protocols in rust-libp2p, correct me if I am wrong, follow the strategy based on keeping connections alive except Do you see any draw backs in adopting the based on keeping connections alive strategy in |
Originally floodsub used the "keeping connections alive" strategy. It used short lived substreams per RPC message. This wasn't compatible with go (the specs were pretty bad at the time for gossipsub). I updated to match the go implementation which required long-lived substreams throughout the connection. I guess this point is irrelevant to what you are referring to however, we could still have long-lived substreams and a timeout that drops the connection. I guess my concern is what inactivity timeout should we set? And should it be specified per topic. If we introduce an inactivity timeout globally, there could be topics users wish to subscribe to, like an "emergency topic" which nodes would send messages very rarely. If we have a fixed timeout, gossipsub would then disconnect from these emergency nodes because no messages were received in some timeout. We could make the global timeout configurable, or even more complicated a per-topic timeout. My thinking around this has always been that we should persist the connection unless there is an error or a disconnection. I guess its probably wise to add an inactivity timeout, and hope users couple it with If all other protocols have an inactivity timeout, perhaps gossipsub should also. If you agree, I can add it as an option, that defaults to 1min, but users can set it to |
The timeout sounds good to me. While it might be good for the concrete default timeout value to be optional, I don't think the timeout functionality needs to be optional. Imagine the following:
Node connectivity is always assured, as connections to nodes in the mesh are kept alive. Connections required and thus kept alive by other @AgeManning what do you think? Sorry for the wall of text. |
I think this might be difficult to implement in practice and want to double check the logic we're aiming for here. Firstly, the mesh and topics a peer is subscribed to is handled in the behaviour and is not really accessible in the ProtocolsHandler (we could of course pass extra messages and things to the handler (everything is do-able)). Secondly, the mesh is (depending on the network) quite small compared to the number of peers we are typically connected to. We only have a mesh for topics we are subscribed to. We keep a list of connected peers and their topics in the behaviour and often move peers in and out of the mesh based on score and if we want to subscribe to a new topic, we can collect peers that are subscribed to that topic and try and graft to them. So the mesh isn't necessarily a static subset of connected peers, but i like the idea that we would maintain connections at least to the peers we know are subscribed to the topics of interest. Another thought, is that our mesh peers send us messages and we forward to other mesh peers. The peers that are outside the mesh (but still subscribed to the topic) won't often send us messages and we won't often send them messages, except for the gossip. So I'd imagine we'd be often sending messages to mesh peers and they would be kept alive and the other peers are less likely to send/recv messages and we are more likely to timeout with them. We might want to adjust the timeout based on number of messages we expect and/or gossip time (the time we gossip to our peers on the same topic). I think its probably just a balance of setting the right DEFAULT_KEEP_ALIVE for the maximum time we expect to not send/recv messages on the topic with the least frequency to maintain a connection. Again, it might be necessary to couple with Do you think we should try and be fancy and pass information to the peer handlers about the mesh and global peer status to the handler to implement the logic you suggest? |
I think passing extra messages is a viable solution. Similar approach is taken in libp2p-relay.
I don't think peers moving in and out of the mesh is a big deal, as each change to the mesh would only require a message to be passed from the
Can you expand on how
I am not sure what you are referring to with "global peer status". I do think that your initial proposal of only using a timeout mechanism would work in most cases. After all, as you described above, connections to peers in the mesh are heavily used and would thus never time out. Still, to make this more robust, and prevent edge cases like #2034 I would suggest going the extra mile, making sure precious connections to peers in the mesh stay connected, e.g. by having the I had a short chat with @vyzo to find out how the Golang GossipSub implementation does connection management. Connections to peers in the mesh are tagged as protected to prevent the Golang connection manager from disconnecting them. To other connections decaying tags are added. From a high level perspective, this seems to be in-line with our approach here. I am sorry for this short bug fix pull request to turn in such a long discussion @AgeManning! Does the above sound good to you? |
Yep this sounds good to me. I'll make the changes at some point next week :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for investing more time into this @AgeManning!
protocols/gossipsub/src/behaviour.rs
Outdated
.push_back(NetworkBehaviourAction::NotifyHandler { | ||
peer_id: peer, | ||
event: Arc::new(InboundHandlerEvent::LeftMesh), | ||
handler: NotifyHandler::Any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say there are two nodes Node-A and Node-B communicating via two connections Conn-1 and Conn-2. From the perspective of Node-A Node-B joins its mesh. Node-A then sends a InboundHandlerEvent::JoinedMesh
to the Conn-1 handler. Later on, again from the perspective of Node-A, Node-B leaves the mesh. Node-A then sends a InboundHandlerEvent::LeftMesh
to Conn-2
, possible as NotifyHandler::Any
is used.
As far as I can tell, the above would lead to a state mismatch. Conn-1 thinks Node-B is still in Node-A's mesh and thus keeps Conn-1 alive. Conn-2 received an InboundHandlerEvent::LeftMesh
before previously receiving an InboundHandlerEvent::JoinedMesh
.
Off the top of my head to solve this I suggest to extend Gossipsub::peer_protocols
not only tracking the protocol (i.e. kind) of a peer, but as well its connections. When joining or leaving, one would always inform the handler of the first connection to the peer. Something along the lines of:
struct Gossipsub {
// ...
connected_peers: HashMap<PeerId, Peer>,
// ...
}
struct Peer {
connections: Vec<ConnectionId>,
kind: PeerKind,
}
Note: I am advocating to only inform the first connection, not all connections, to allow idle secondary connections to eventually close. I have not fully thought this through. Let me know if you see problems with this strategy.
Given that I am not deeply familiar with the implementation, I would very much appreciate suggestions from your side @AgeManning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. The whole implementation was built prior to the multiple connection feature. When it was introduced, the NotifyHandler::Any was just added to all the events. Tbh, I've not thought through the implications of multiple connections to a peer. I guess, we just have an extra redundant connections, which off the top of my head doesn't cause any real issues.
However, the state mismatch is an issue. What you propose sounds reasonable to me. I'll add it in.
protocols/gossipsub/src/config.rs
Outdated
@@ -181,6 +182,13 @@ impl GossipsubConfig { | |||
self.max_transmit_size | |||
} | |||
|
|||
/// The time a connection is maintained to a peer without being in the mesh and without | |||
/// send/receiving a message from. Peers that idle beyond this timeout are disconnected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// send/receiving a message from. Peers that idle beyond this timeout are disconnected. | |
/// send/receiving a message from. Connections that idle beyond this timeout are disconnected. |
protocols/gossipsub/src/config.rs
Outdated
@@ -524,6 +533,14 @@ impl GossipsubConfigBuilder { | |||
self | |||
} | |||
|
|||
/// The time a connection is maintained to a peer without being in the mesh and without | |||
/// send/receiving a message from. Peers that idle beyond this timeout are disconnected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// send/receiving a message from. Peers that idle beyond this timeout are disconnected. | |
/// send/receiving a message from. Connections that idle beyond this timeout are disconnected. |
protocols/gossipsub/src/handler.rs
Outdated
@@ -200,6 +216,9 @@ impl ProtocolsHandler for GossipsubHandler { | |||
|
|||
self.inbound_substreams_created += 1; | |||
|
|||
// Inbound substream is live set the keep alive | |||
self.keep_alive = KeepAlive::Yes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting KeepAlive::Yes
both when the inbound and the outbound substream is live would lead to idle connections never being closed, no? Is the keep_alive
setting in inject_fully_negotiated_inbound
and inject_fully_negotiated_outbound
needed at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No these are not needed. These are remnants of the first "fix".
protocols/gossipsub/src/behaviour.rs
Outdated
@@ -947,6 +943,17 @@ where | |||
topic_hash: topic_hash.clone(), | |||
}, | |||
); | |||
|
|||
// If the peer did not previously exist in any mesh, inform the handler | |||
if self.peer_added_to_mesh(&peer_id, topic_hash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These check if peer was added and if so inform the handler blocks are quite pervasive. I am not familiar enough with the codebase to tell whether they are inherent, or whether there is room for improvement. Would e.g. extracting the mesh tracking logic to a separate component be an option? Each time one instructs the separate component to add a peer to a mesh, it returns (with #[must-use]
) whether the corresponding handler needs to be informed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I agree.
I found it difficult to find a good pattern for this. Originally these were mutable functions which handled sending the events, but its difficult to inject them somewhat generically in the places they need to be with a mutable reference to self (even if I don't mutate the fields being borrowed). I resorted to immutable reference to self as a work-around.
Ideally a new structure of the mesh which handles this internally might be a nicer approach, but there are parts of the code that take mutable references to the peers in each mesh and mutate them. I'll have play try and refactor it a little
I've made the changes you've suggested and performed some simple tests. It seems to work as expected. It could probably be refactored a little nicer, but it wasn't immediately obvious to me how to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with the recent commits this is a solid step forward! I only have a couple of smaller comments left.
protocols/gossipsub/src/behaviour.rs
Outdated
connections: &HashMap<PeerId, PeerConnections>, | ||
) { | ||
// Ensure there is an active connection | ||
let connection_id = match connections.get(&peer_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I assuming correctly, that once the last connection to a peer in a mesh is removed, the peer is removed from the mesh? If so, would it not be safe to expect the peer to have a connection?
let connection_id = match connections.get(&peer_id) { | |
let connection_id = match connections.get(&peer_id).expect("To be connected to peer.") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is safe.
We remove the connection in inject_connection_closed
and then run inject_disconnected
. If we were to attempt to inform the handler in inject_disconnected
we could panic here. But we don't, so I think we are safe, I can't see any other case where we would not have any connections.
@mxinden - I'll make these changes, please check my logic here tho.
protocols/gossipsub/src/behaviour.rs
Outdated
@@ -228,7 +229,7 @@ pub struct Gossipsub< | |||
|
|||
/// A map of peers to their protocol kind. This is to identify different kinds of gossipsub | |||
/// peers. | |||
peer_protocols: HashMap<PeerId, PeerKind>, | |||
peer_protocols: HashMap<PeerId, PeerConnections>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both key and doc comment are off now, right?
How about:
peer_protocols: HashMap<PeerId, PeerConnections>, | |
/// A set of connected peers, indexed by their [`PeerId`], tracking both the [`PeerKind`] and the set of [`ConnectionId`]s. | |
connected_peers: HashMap<PeerId, PeerConnections>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still find the key confusing and would prefer connected_peers
. @AgeManning any particular reason why keeping peer_protocols
?
@elenaf9 @Frederik-Baetens could you test whether this most recent version of the patch-set fixes your issue in #2034 and #2036 respectively? |
Just tried it, both with and without mdns, and it all seems to work pretty well, thanks a lot for the quick fix! (keep in mind that i haven't tested performance or large deployments) |
Co-authored-by: Max Inden <mail@max-inden.de>
Co-authored-by: Max Inden <mail@max-inden.de>
Added the extra suggestions. There is an edge case, where a peer connects over two connections 1, and 2. We inform the handler on connection 1 about the various mesh state, then connection 1 drops and we start informing connection 2. If this is possible we can get potentially invalid states in the handler for connection 2. I think this is not much of a concern, as worst case we keep the peer when we might not want to, or drop it when its in a mesh (but idle). |
…ibp2p#2045) Bumps [styfle/cancel-workflow-action](https://github.com/styfle/cancel-workflow-action) from 0.8.0 to 0.9.0. - [Release notes](https://github.com/styfle/cancel-workflow-action/releases) - [Commits](styfle/cancel-workflow-action@0.8.0...89f242e) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
With `rand` `v0.8.0` platform support changed [1] due to its upgrade to `getrandom` `v0.2`. With `getrandom` `v0.2` `wasm32-unknown-unknown` is no longer supported out of the box: > This crate fully supports the wasm32-wasi and wasm32-unknown-emscripten > targets. However, the wasm32-unknown-unknown target is not automatically > supported since, from the target name alone, we cannot deduce which JavaScript > interface is in use (or if JavaScript is available at all). > > Instead, if the "js" Cargo feature is enabled, this crate will assume that you > are building for an environment containing JavaScript, and will call the > appropriate methods. Both web browser (main window and Web Workers) and > Node.js environments are supported, invoking the methods described above using > the wasm-bindgen toolchain. > > This feature has no effect on targets other than wasm32-unknown-unknown. This commit drops support for wasm32-unknown-unknown in favor of the two more specific targets wasm32-wasi and wasm32-unknown-emscripten. Note on `resolver = "2"`: The new resolver is required to prevent features being mixed, more specifically to prevent libp2p-noise to build with the `ring-resolver` feature. See [3] for details. --- [1] https://github.com/rust-random/rand/blob/master/CHANGELOG.md#platform-support [2] https://docs.rs/getrandom/0.2.2/getrandom/#webassembly-support [3] https://doc.rust-lang.org/nightly/cargo/reference/features.html#feature-resolver-version-2
* Update yamux requirement from 0.8.0 to 0.9.0 Updates the requirements on [yamux](https://github.com/paritytech/yamux) to permit the latest version. - [Release notes](https://github.com/paritytech/yamux/releases) - [Changelog](https://github.com/paritytech/yamux/blob/develop/CHANGELOG.md) - [Commits](https://github.com/paritytech/yamux/commits) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Max Inden <mail@max-inden.de>
Co-authored-by: Max Inden <mail@max-inden.de>
Bumps [actions/cache](https://github.com/actions/cache) from v2.1.4 to v2.1.5. - [Release notes](https://github.com/actions/cache/releases) - [Commits](actions/cache@v2.1.4...1a9e213) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ibp2p#2053) If you start listening after mdns joined a multicast group, the peers may not discover eachother until the 5min timeout expires. Co-authored-by: Max Inden <mail@max-inden.de>
Remove `regex-filter` feature flag thus always enabling `regex::RegexSubscriptionFilter`.
…ibp2p#2058) Changes in 45f07bf now seem to append `/p2p/<peer>` to multiaddr passed to transports. The regex in transport/wasm-ext doesn't support this fully qualified format. This commit adjusts the regex accordingly.
Can you expand on the concrete race conditions you see here? One that I can spot is: The local node has two connections to the same remote node. When the remote node joins the mesh, the first connection handler is informed. Then the first connection closes due to some benign reason. While the peer is in the mesh and still connected via the second connection, the second connection handler is not aware of the fact that the node it is handling is in the mesh, and thus closes the connection after the idle timeout. As far as I can tell the above can be fixed in |
protocols/gossipsub/src/behaviour.rs
Outdated
@@ -228,7 +229,7 @@ pub struct Gossipsub< | |||
|
|||
/// A map of peers to their protocol kind. This is to identify different kinds of gossipsub | |||
/// peers. | |||
peer_protocols: HashMap<PeerId, PeerKind>, | |||
peer_protocols: HashMap<PeerId, PeerConnections>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still find the key confusing and would prefer connected_peers
. @AgeManning any particular reason why keeping peer_protocols
?
Thank you for putting so much effort into the fix! I just tested it and it worked really well, so once it is merged I think #2034 should be fixed. (It has to be noted though that my tests are just manual tests on a smaller project, I currently don't have any written tests for proper testing.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last comment, other than that this looks good to me. Thanks for the continuous work here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me.
Would you mind including the diff below in this pull request? I suggest releasing this as a breaking change (minor bump) given its somewhat intrusive behavioral change. Thoughts?
@AgeManning would you like to have more time testing this change on a live network? Otherwise I am happy to cut rust-libp2p v0.38.0
once this is merged.
diff --git a/Cargo.toml b/Cargo.toml
index f19dd5e0..55d05c8f 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -66,7 +66,7 @@ futures = "0.3.1"
lazy_static = "1.2"
libp2p-core = { version = "0.28.3", path = "core", default-features = false }
libp2p-floodsub = { version = "0.29.0", path = "protocols/floodsub", optional = true }
-libp2p-gossipsub = { version = "0.30.1", path = "./protocols/gossipsub", optional = true }
+libp2p-gossipsub = { version = "0.31.0", path = "./protocols/gossipsub", optional = true }
libp2p-identify = { version = "0.29.0", path = "protocols/identify", optional = true }
libp2p-kad = { version = "0.30.0", path = "protocols/kad", optional = true }
libp2p-mplex = { version = "0.28.0", path = "muxers/mplex", optional = true }
diff --git a/protocols/gossipsub/CHANGELOG.md b/protocols/gossipsub/CHANGELOG.md
index 7bc1f9aa..f215203f 100644
--- a/protocols/gossipsub/CHANGELOG.md
+++ b/protocols/gossipsub/CHANGELOG.md
@@ -1,3 +1,10 @@
+# 0.31.0 [unreleased]
+
+- Keep connections to peers in a mesh alive. Allow closing idle connections to peers not in a mesh
+ [PR-2043].
+
+[PR-2043]: https://github.com/libp2p/rust-libp2p/pull/2043https://github.com/libp2p/rust-libp2p/pull/2043
+
# 0.30.1 [2021-04-27]
- Remove `regex-filter` feature flag thus always enabling `regex::RegexSubscriptionFilter` [PR
diff --git a/protocols/gossipsub/Cargo.toml b/protocols/gossipsub/Cargo.toml
index a0ecc7e3..2d6c7c6a 100644
--- a/protocols/gossipsub/Cargo.toml
+++ b/protocols/gossipsub/Cargo.toml
@@ -2,7 +2,7 @@
name = "libp2p-gossipsub"
edition = "2018"
description = "Gossipsub protocol for libp2p"
-version = "0.30.1"
+version = "0.31.0"
authors = ["Age Manning <Age@AgeManning.com>"]
license = "MIT"
repository = "https://github.com/libp2p/rust-libp2p"
Yeah, I have not done extensive testing on this. |
Sounds good to me. Thanks @AgeManning! Let me know how it goes. |
Initial testing looks like this works fine. :) 🎉 |
Great. I will cut a release (likely) tomorrow, unless anything comes up till then. |
I seem to not be allowed to merge |
Oh sorry. I thought it would default to allow edits from maintainers. Strange it didn't. I've merged latest master now. |
A 30 second timeout was introduced, which gave newly connected peers 30 seconds in order to establish a long-lived substream.
The timeout was not being updated once substreams were negotiated. Thus, after 30 seconds the gossipsub protocol was setting its keep-alive to expire after the initial 30 second timeout.
This PR updates gossipsub to set the keep alive to
Yes
once a substream has been established. The keep alive resets toNo
once there are no more existing substreams as expected.