Skip to content
This repository has been archived by the owner on May 26, 2022. It is now read-only.

Simultaneous Dial bug #79

Closed
Stebalien opened this issue Jun 16, 2018 · 10 comments
Closed

Simultaneous Dial bug #79

Stebalien opened this issue Jun 16, 2018 · 10 comments
Labels
kind/bug A bug in existing code (including security flaws) need/community-input Needs input from the wider community status/in-progress In progress

Comments

@Stebalien
Copy link
Member

When a peer successfully dials us, we cancel all in-progress dials to them. This way, we avoid doing any unnecessary work opening multiple connections.

Unfortunately, it turns out that the receiving side of the connection "finishes" before the dialing side. This means that, in rare cases, two nodes can dial each-other, finish accepting the dial from the other end, and then cancel the outbound dials (closing the established connections).

One solution is to just not cancel outbound dials when we successfully accept an inbound dial. IIRC, this is what we did before the refactor. However, we'll end up doing a bunch of additional work.


To reproduce, repeatedly run the TestConnectionCollision test in go-libp2p-kad-dht.

@bigs
Copy link
Contributor

bigs commented Sep 18, 2018

@vyzo was on the money, this is the same issue affecting pubsub. i think we can take the logic i tried to use at floodsub: keep the connection initiated by the peer with the greater ID

@bigs
Copy link
Contributor

bigs commented Sep 18, 2018

@Stebalien i've been trying to investigate this, but i can't seem to figure out where the connection is being cancelled... pretty sure it's not a dial cancel, as i've piped swarm full of logs and haven't seen it reported in my test case, but i do see an error trying to Accept on the listen side that the connection has been closed

@Stebalien
Copy link
Member Author

Here:

s.dsync.CancelDial(p)

Basically, when we get a connection and add it to the swarm, we cancel in-progress dials.

@bigs
Copy link
Contributor

bigs commented Sep 18, 2018

@Stebalien turns out the connection close is initiated in this line https://github.com/libp2p/go-libp2p-swarm/blob/master/swarm_conn.go#L99

"duplicate stream initiated" in this case

edit: ah, so in this case it's coming from yamux. i'm going to pick this up tomorrow. it's a few different stream session problems.

@bigs
Copy link
Contributor

bigs commented Sep 18, 2018

to clarify, the reason this was so hard to track was because the connection is being closed due to muxing failures sometimes, and can do so in at least two ways 😱

@Stebalien
Copy link
Member Author

Oh. That's a totally different issue. That's libp2p/go-tcp-transport#21. Basically, we're creating one connection but both sides think they own the connection. Then, they both try creating streams. This issue royally sucks. We used to avoid this by just having both sides hang waiting for the other to send /multistream/1.0 (because we used to have the server send first and, in this case, both sides think they have the client). However, that just meant that both sides hung for a while.

@bigs
Copy link
Contributor

bigs commented Sep 19, 2018 via email

@raulk
Copy link
Member

raulk commented Sep 19, 2018

Been reading through the issues referenced here; quite an elusive scenario. IIUC, both sides of the connection end up thinking they're the dialer, and carrying out the handshake ceremonies (security, muxing) under the assumption they're the initiator/client.

Couple of questions here:

  • is there some trick we can play with syscalls after we get the TCP socket to check if it was indeed opened as a result of our connect, or if it pre-dated our request? e.g. some kind of flag, or even checking timestamps, or sequences, or something?
  • how do other systems solve this issue? I would imagine it being pretty common across TCP applications.
  • can we add a phase in the upgrader that does not incur in an extra exchange of packets (e.g. like the "iamclient" solution described in Simultaneous open go-tcp-transport#21 would do), but instead piggybacks on the next message that would be sent anyway as part of the handshake? Instead of explicitly setting the direction of the connection, we would rely on that mechanism to report back to us.

For example, if this is the first message in the security sequence: "HANDSHAKE_INIT", we could prefix it with a 0x00 or a 0xFF reporting what the sender thinks they are, e.g. 0x00 initiator, 0xFF receiver, e.g. 0x00HANDSHAKE_INIT. This mechanism would act as a proxy in the Read(), stripping off the extra byte, and informing the next layer (security, muxer) to change their initial assumption, or handing off the message as-is if the assumption stood.

If both parties believe they're the initiator (0x00) they would fall back to a transport-specific heuristic to resolve the conflict, like the highest [IP address, port] tuple becoming the initiator if we're using TCP/IP (where the problem is seen).

Just some brainstorming, I hope I'm not adding noise.

@raulk raulk added need/community-input Needs input from the wider community status/in-progress In progress labels Sep 19, 2018
@Stebalien
Copy link
Member Author

Let's move this discussion over to relevant issue to avoid splintering it.

@Stebalien
Copy link
Member Author

We fixed this in #174

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug A bug in existing code (including security flaws) need/community-input Needs input from the wider community status/in-progress In progress
Projects
None yet
Development

No branches or pull requests

3 participants