Simultaneous Dial bug #79

Stebalien · 2018-06-16T17:06:54Z

When a peer successfully dials us, we cancel all in-progress dials to them. This way, we avoid doing any unnecessary work opening multiple connections.

Unfortunately, it turns out that the receiving side of the connection "finishes" before the dialing side. This means that, in rare cases, two nodes can dial each-other, finish accepting the dial from the other end, and then cancel the outbound dials (closing the established connections).

One solution is to just not cancel outbound dials when we successfully accept an inbound dial. IIRC, this is what we did before the refactor. However, we'll end up doing a bunch of additional work.

To reproduce, repeatedly run the TestConnectionCollision test in go-libp2p-kad-dht.

bigs · 2018-09-18T22:35:25Z

@vyzo was on the money, this is the same issue affecting pubsub. i think we can take the logic i tried to use at floodsub: keep the connection initiated by the peer with the greater ID

bigs · 2018-09-18T23:00:14Z

@Stebalien i've been trying to investigate this, but i can't seem to figure out where the connection is being cancelled... pretty sure it's not a dial cancel, as i've piped swarm full of logs and haven't seen it reported in my test case, but i do see an error trying to Accept on the listen side that the connection has been closed

Stebalien · 2018-09-18T23:04:11Z

Here:

go-libp2p-swarm/swarm.go

Line 218 in 47c2aca

s.dsync.CancelDial(p)

Basically, when we get a connection and add it to the swarm, we cancel in-progress dials.

bigs · 2018-09-18T23:10:27Z

@Stebalien turns out the connection close is initiated in this line https://github.com/libp2p/go-libp2p-swarm/blob/master/swarm_conn.go#L99

"duplicate stream initiated" in this case

edit: ah, so in this case it's coming from yamux. i'm going to pick this up tomorrow. it's a few different stream session problems.

bigs · 2018-09-18T23:30:11Z

to clarify, the reason this was so hard to track was because the connection is being closed due to muxing failures sometimes, and can do so in at least two ways 😱

Stebalien · 2018-09-19T00:03:45Z

Oh. That's a totally different issue. That's libp2p/go-tcp-transport#21. Basically, we're creating one connection but both sides think they own the connection. Then, they both try creating streams. This issue royally sucks. We used to avoid this by just having both sides hang waiting for the other to send /multistream/1.0 (because we used to have the server send first and, in this case, both sides think they have the client). However, that just meant that both sides hung for a while.

bigs · 2018-09-19T03:20:45Z

can we solve that with `Direction`?

…

On Tue, Sep 18, 2018 at 20:03 Steven Allen ***@***.***> wrote: Oh. That's a totally different issue. That's libp2p/go-tcp-transport#21 <libp2p/go-tcp-transport#21>. Basically, we're creating *one* connection but both sides think they own the connection. Then, they both try creating streams. This issue royally *sucks*. We used to avoid this by just having both sides hang waiting for the other to send /multistream/1.0 (because we used to have the server send first and, in this case, both sides think they have the client). However, that just meant that both sides hung for a while. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#79 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AANBWmZbiCTSxYsBubpA4jox1SROtACIks5ucYnhgaJpZM4UqjBY> .

raulk · 2018-09-19T12:09:27Z

Been reading through the issues referenced here; quite an elusive scenario. IIUC, both sides of the connection end up thinking they're the dialer, and carrying out the handshake ceremonies (security, muxing) under the assumption they're the initiator/client.

Couple of questions here:

is there some trick we can play with syscalls after we get the TCP socket to check if it was indeed opened as a result of our connect, or if it pre-dated our request? e.g. some kind of flag, or even checking timestamps, or sequences, or something?
how do other systems solve this issue? I would imagine it being pretty common across TCP applications.
can we add a phase in the upgrader that does not incur in an extra exchange of packets (e.g. like the "iamclient" solution described in Simultaneous open go-tcp-transport#21 would do), but instead piggybacks on the next message that would be sent anyway as part of the handshake? Instead of explicitly setting the direction of the connection, we would rely on that mechanism to report back to us.

For example, if this is the first message in the security sequence: "HANDSHAKE_INIT", we could prefix it with a 0x00 or a 0xFF reporting what the sender thinks they are, e.g. 0x00 initiator, 0xFF receiver, e.g. 0x00HANDSHAKE_INIT. This mechanism would act as a proxy in the Read(), stripping off the extra byte, and informing the next layer (security, muxer) to change their initial assumption, or handing off the message as-is if the assumption stood.

If both parties believe they're the initiator (0x00) they would fall back to a transport-specific heuristic to resolve the conflict, like the highest [IP address, port] tuple becoming the initiator if we're using TCP/IP (where the problem is seen).

Just some brainstorming, I hope I'm not adding noise.

Stebalien · 2018-09-19T16:14:19Z

Let's move this discussion over to relevant issue to avoid splintering it.

Stebalien · 2021-02-25T02:50:19Z

We fixed this in #174

Stebalien added the kind/bug A bug in existing code (including security flaws) label Jun 16, 2018

Stebalien mentioned this issue Aug 14, 2018

dial attempt failed: EOF when go test in macOS libp2p/go-libp2p-pubsub#96

Open

vyzo mentioned this issue Sep 14, 2018

Fix bug where peers connecting to one another simultaneously never establish a pubsub stream libp2p/go-libp2p-pubsub#105

Closed

raulk added need/community-input Needs input from the wider community status/in-progress In progress labels Sep 19, 2018

Stebalien mentioned this issue Sep 19, 2018

Simultaneous open libp2p/go-tcp-transport#21

Closed

Stebalien mentioned this issue Apr 11, 2019

Simultaneous open test is flaky #117

Closed

raulk mentioned this issue May 14, 2019

dialer v1 may not clean up succeeded but unselected connections #123

Closed

hsanjuan mentioned this issue Oct 29, 2019

Simultaneous connection opening with TCP+TLS libp2p/go-libp2p#732

Closed

Stebalien closed this as completed Feb 25, 2021

Stebalien mentioned this issue Feb 25, 2021

Swarm dosen't support a SimOpen libp2p/go-libp2p#1060

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simultaneous Dial bug #79

Simultaneous Dial bug #79

Stebalien commented Jun 16, 2018

bigs commented Sep 18, 2018

bigs commented Sep 18, 2018

Stebalien commented Sep 18, 2018

bigs commented Sep 18, 2018 •

edited

Loading

bigs commented Sep 18, 2018 •

edited

Loading

Stebalien commented Sep 19, 2018

bigs commented Sep 19, 2018 via email

raulk commented Sep 19, 2018 •

edited

Loading

Stebalien commented Sep 19, 2018

Stebalien commented Feb 25, 2021

Simultaneous Dial bug #79

Simultaneous Dial bug #79

Comments

Stebalien commented Jun 16, 2018

bigs commented Sep 18, 2018

bigs commented Sep 18, 2018

Stebalien commented Sep 18, 2018

bigs commented Sep 18, 2018 • edited Loading

bigs commented Sep 18, 2018 • edited Loading

Stebalien commented Sep 19, 2018

bigs commented Sep 19, 2018 via email

raulk commented Sep 19, 2018 • edited Loading

Stebalien commented Sep 19, 2018

Stebalien commented Feb 25, 2021

bigs commented Sep 18, 2018 •

edited

Loading

bigs commented Sep 18, 2018 •

edited

Loading

raulk commented Sep 19, 2018 •

edited

Loading