Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where peers connecting to one another simultaneously never establish a pubsub stream #105

Closed
wants to merge 3 commits into from

Conversation

bigs
Copy link

@bigs bigs commented Sep 14, 2018

This patch establishes a failing test and then provides patch for a subtle bug, where two peers connecting to one another at the same time fail to establish an active pubsub stream. This bug is rooted in the fact that both peers tend to get a notification that their OUTBOUND dial succeeds before their peer's INBOUND dial. As a result of the old logic, both peers would then happily throw away the stream they established on their outbound connection to favor the peer's inbound connection, resulting in both peers holding on to bad streams.

This fix is a bit heavy handed, but the logic was the most sound I could muster. My patch changes behavior such that peers favor a connection initiated by the peer with the greatest peer ID (evaluated in integral space).

This branch closes #93.

When initializing a PubSub connection to another peer, we now inspect
the peer ID of the peer that initiated a `Conn`. When encountering
subsequent connections to a peer, we measure (in integral space) the
initiating peer ID of THAT connection against the existing connection.
If, and only if, that peer ID is GREATER THAN the existing initiator
peer ID, we close our old stream and open a stream on that new
connection. This unambiguous logic ensures that both peers maintain the
same stream, instead of closing their stream (the first to notify in a
simultaneous connect scenario) and favoring the other peer's stream,
which the other peer will in turn close.
@bigs
Copy link
Author

bigs commented Sep 14, 2018

in particular, i could see storing an inet.Stream in the rpcpair instead of the peer.ID, though i strayed away from that because stream closing/reset is handled elsewhere in the code.

Copy link
Collaborator

@vyzo vyzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this very ugly...

pubsub.go Outdated
@@ -86,7 +87,7 @@ type PubSub struct {
// eval thunk in event loop
eval chan func()

peers map[peer.ID]chan *RPC
peers map[peer.ID]*rpcpair
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't need to change this type (and the floodsub/gossipsub code) at all.

@vyzo
Copy link
Collaborator

vyzo commented Sep 14, 2018

Let's just keep the connection with the highest ID in key space.
I don't think we need this rpcpair ugliness - you shouldn't have to change the peers map (and any code in floodsub/gossipsub) at all. The higher ID can be deduced by the key in the map, compared to our own ID.

@bigs bigs added the status/blocked Unable to be worked further until needs are met label Sep 14, 2018
@bigs
Copy link
Author

bigs commented Sep 14, 2018

tagged as blocked until we can decide on a better fix at swarm level

@bigs
Copy link
Author

bigs commented Sep 14, 2018

@Stebalien this is a kind of complicated issue. connection de-duplication only seems to happen at the Host level (i.e. BasicHost#Connect) and it's not race safe at all. it simply queries Connectedness before starting a dial. i'm actually not so sure what would be the best course of action. my gut is shifting towards keeping the fix in floodsub (& generally protocol implementation) territory. users should have some degree of freedom over what connection to a peer they want to use.

@vyzo
Copy link
Collaborator

vyzo commented Sep 14, 2018

Seems it's an instance of this: libp2p/go-libp2p-swarm#79

@Stebalien
Copy link
Member

@vyzo same deal but not quite the same, as far as I know. However, in this case, I think we can just use two streams, right? We could also disambiguate but IMO, two streams is the correct approach.

@bigs
Copy link
Author

bigs commented Sep 14, 2018 via email

@Stebalien
Copy link
Member

Reconnection? In this case, I'd just have both sides use their own stream for sending, that should "just work".

@bigs
Copy link
Author

bigs commented Sep 19, 2018

closing this as it's plain wrong!

@bigs bigs closed this Sep 19, 2018
@ghost ghost removed the in progress label Sep 19, 2018
@vyzo vyzo deleted the bug/sequential-connections branch April 23, 2020 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/blocked Unable to be worked further until needs are met
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Peers not always joining a topic and/or connecting
3 participants