-
Notifications
You must be signed in to change notification settings - Fork 37
Simultaneous Dial bug #79
Comments
@vyzo was on the money, this is the same issue affecting pubsub. i think we can take the logic i tried to use at floodsub: keep the connection initiated by the peer with the greater ID |
@Stebalien i've been trying to investigate this, but i can't seem to figure out where the connection is being cancelled... pretty sure it's not a dial cancel, as i've piped swarm full of logs and haven't seen it reported in my test case, but i do see an error trying to |
Here: Line 218 in 47c2aca
Basically, when we get a connection and add it to the swarm, we cancel in-progress dials. |
@Stebalien turns out the connection close is initiated in this line https://github.com/libp2p/go-libp2p-swarm/blob/master/swarm_conn.go#L99 "duplicate stream initiated" in this case edit: ah, so in this case it's coming from yamux. i'm going to pick this up tomorrow. it's a few different stream session problems. |
to clarify, the reason this was so hard to track was because the connection is being closed due to muxing failures sometimes, and can do so in at least two ways 😱 |
Oh. That's a totally different issue. That's libp2p/go-tcp-transport#21. Basically, we're creating one connection but both sides think they own the connection. Then, they both try creating streams. This issue royally sucks. We used to avoid this by just having both sides hang waiting for the other to send |
can we solve that with `Direction`?
…On Tue, Sep 18, 2018 at 20:03 Steven Allen ***@***.***> wrote:
Oh. That's a totally different issue. That's libp2p/go-tcp-transport#21
<libp2p/go-tcp-transport#21>. Basically, we're
creating *one* connection but both sides think they own the connection.
Then, they both try creating streams. This issue royally *sucks*. We used
to avoid this by just having both sides hang waiting for the other to send
/multistream/1.0 (because we used to have the server send first and, in
this case, both sides think they have the client). However, that just meant
that both sides hung for a while.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#79 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AANBWmZbiCTSxYsBubpA4jox1SROtACIks5ucYnhgaJpZM4UqjBY>
.
|
Been reading through the issues referenced here; quite an elusive scenario. IIUC, both sides of the connection end up thinking they're the dialer, and carrying out the handshake ceremonies (security, muxing) under the assumption they're the initiator/client. Couple of questions here:
For example, if this is the first message in the security sequence: "HANDSHAKE_INIT", we could prefix it with a 0x00 or a 0xFF reporting what the sender thinks they are, e.g. 0x00 initiator, 0xFF receiver, e.g. If both parties believe they're the initiator (0x00) they would fall back to a transport-specific heuristic to resolve the conflict, like the highest [IP address, port] tuple becoming the initiator if we're using TCP/IP (where the problem is seen). Just some brainstorming, I hope I'm not adding noise. |
Let's move this discussion over to relevant issue to avoid splintering it. |
We fixed this in #174 |
When a peer successfully dials us, we cancel all in-progress dials to them. This way, we avoid doing any unnecessary work opening multiple connections.
Unfortunately, it turns out that the receiving side of the connection "finishes" before the dialing side. This means that, in rare cases, two nodes can dial each-other, finish accepting the dial from the other end, and then cancel the outbound dials (closing the established connections).
One solution is to just not cancel outbound dials when we successfully accept an inbound dial. IIRC, this is what we did before the refactor. However, we'll end up doing a bunch of additional work.
To reproduce, repeatedly run the TestConnectionCollision test in go-libp2p-kad-dht.
The text was updated successfully, but these errors were encountered: