Skip to content
This repository has been archived by the owner on May 26, 2022. It is now read-only.

dialer v1 may not clean up succeeded but unselected connections #123

Closed
raulk opened this issue May 14, 2019 · 10 comments
Closed

dialer v1 may not clean up succeeded but unselected connections #123

raulk opened this issue May 14, 2019 · 10 comments
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@raulk
Copy link
Member

raulk commented May 14, 2019

@vyzo's investigations on the relay infrastructure yielded a good lead on why the connection manager may be closing less connections than expected: we may actually be holding more than one connection per peer.

Going through dialer v1, there is one edge case where this can happen, and I don't think it's so much of an edge, but rather a hot path:

  1. The peer has multiple publicly diallable addresses that would all succeed (e.g. TCP, QUIC, WSS).
  2. We dial all those addresses in parallel -- the default dials per peer limit allows it (8).
  3. As soon as we hit the first success (e.g. TCP), we return that connection.
  4. While we cancel inflight dial jobs (by cancelling the context), there may be extra dials that have already succeeded waiting to be consumed.
  5. We don't drain and close those unselected dials (dialer v2 does), and the result is a lingering phantom connection that takes up a socket (as we see in netdata) but is never enrolled in the swarm.
@vyzo vyzo added the kind/bug A bug in existing code (including security flaws) label May 14, 2019
@raulk
Copy link
Member Author

raulk commented May 14, 2019

This bug may underlie the effects observed in #79.

@vyzo
Copy link
Contributor

vyzo commented May 14, 2019

Printing out multiple connections during trimming reveals that there are peers with multiple TCP connections to the relay (up to 6 observed).

Sample:

May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.004 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/154.160.27.129/tcp/18090 /ip4/154.160.25.148/tcp/4176] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.004 ERROR    connmgr: peer has abnormal number of conns: 3 [/ip4/182.253.8.0/tcp/2188 /ip4/182.253.8.0/tcp/4357 /ip4/182.253.8.0/tcp/2412] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.005 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/79.126.161.236/tcp/4001 /ip4/79.126.224.125/tcp/4001] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.005 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/125.125.117.202/tcp/34210 /ip4/183.142.215.7/tcp/34210] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.005 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/180.65.39.78/tcp/4020 /ip4/180.65.39.78/tcp/4010] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.006 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/179.28.74.210/tcp/4001 /ip4/186.51.28.158/tcp/4001] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.006 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/180.249.38.54/tcp/49141 /ip4/180.249.38.54/tcp/49040] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.006 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/178.221.185.54/tcp/64167 /ip4/178.221.185.54/tcp/63844] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.006 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/197.207.217.195/tcp/25519 /ip4/105.105.105.69/tcp/10795] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.006 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/188.27.151.106/tcp/4001 /ip4/188.213.217.242/tcp/4001] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.006 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/13.251.211.185/tcp/28376 /ip4/13.251.211.185/tcp/28814] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.007 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/203.175.73.93/tcp/18568 /ip4/203.175.73.93/tcp/18094] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.007 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/105.100.85.136/tcp/4001 /ip4/105.100.82.159/tcp/4001] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.007 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/87.119.67.234/tcp/45233 /ip4/87.119.67.234/tcp/45451] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.008 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/14.192.210.13/tcp/6033 /ip4/14.192.214.149/tcp/2029] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.008 ERROR    connmgr: peer has abnormal number of conns: 4 [/ip4/77.242.91.19/tcp/4001 /ip4/77.242.91.19/tcp/54133 /ip4/77.242.91.19/tcp/54134 /ip4/77.242.91.19/tcp/54135] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.009 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/77.138.206.30/tcp/62265 /ip4/77.138.206.30/tcp/4001] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.009 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/219.133.100.5/tcp/54730 /ip4/219.133.100.5/tcp/56767] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.018 ERROR    connmgr: peer has abnormal number of conns: 5 [/ip4/78.111.185.56/tcp/37849 /ip4/78.111.185.56/tcp/44174 /ip4/78.111.185.56/tcp/25639 /ip4/78.111.185.56/tcp/35809 /ip4/78.111.185.56/tcp/23529] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.018 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/71.14.218.66/tcp/4001 /ip4/71.14.218.66/tcp/1398] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.018 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/167.56.65.44/tcp/4001 /ip4/179.26.210.254/tcp/4001] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.019 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/152.167.254.120/tcp/1046 /ip4/152.167.254.19/tcp/1045] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.019 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/213.149.51.183/tcp/36839 /ip4/213.149.51.183/tcp/36769] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.021 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/190.104.20.246/tcp/26369 /ip4/190.104.20.246/tcp/30208] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.021 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/109.242.249.176/tcp/27453 /ip4/79.107.242.223/tcp/27453] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.022 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/186.179.100.123/tcp/39245 /ip4/186.179.100.123/tcp/39690] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.022 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/51.52.205.63/tcp/4001 /ip4/51.52.205.63/tcp/60007] connmgr.go:250
May 14 15:20:35 ip-172-31-46-15 relay.sh[25826]: 15:20:35.023 ERROR    connmgr: peer has abnormal number of conns: 2 [/ip4/175.137.11.153/tcp/5001 /ip4/175.137.11.153/tcp/4001] connmgr.go:250

@vyzo
Copy link
Contributor

vyzo commented May 14, 2019

Adding some logs to catch the duplicate connections when they happen, the plot thickens: the duplicate connections are very close to each other temporaly:

May 14 16:41:53 ip-172-31-46-15 relay.sh[4348]: 16:41:53.122 ERROR    connmgr: duplicate connection from QmQMhoaEzFCrmjXiwSAfmbha848o9vPcRsL73Z1csUtGzu; time since first connection: 104.549567ms connmgr.go:398
May 14 16:41:54 ip-172-31-46-15 relay.sh[4348]: 16:41:54.909 ERROR    connmgr: duplicate connection from QmXNkKtsTPgfRD445TuESPnM6WKPRQNHRCtqs3sme8cJir; time since first connection: 3.41757597s connmgr.go:398
May 14 16:41:54 ip-172-31-46-15 relay.sh[4348]: 16:41:54.964 ERROR    connmgr: duplicate connection from QmQMhoaEzFCrmjXiwSAfmbha848o9vPcRsL73Z1csUtGzu; time since first connection: 1.94637047s connmgr.go:398
May 14 16:41:55 ip-172-31-46-15 relay.sh[4348]: 16:41:55.883 ERROR    connmgr: duplicate connection from QmQCmCTq2LJjddx4fG7xGYbDiLdHLLqkB8prfxnF57dEs3; time since first connection: 2.489301788s connmgr.go:398
May 14 16:41:57 ip-172-31-46-15 relay.sh[4348]: 16:41:57.418 ERROR    connmgr: duplicate connection from QmNpRfabJSJaDKMkwsFT5bcWfxPq8XmkNVRYziNnKfdvxP; time since first connection: 6.476632634s connmgr.go:398
May 14 16:41:57 ip-172-31-46-15 relay.sh[4348]: 16:41:57.493 ERROR    connmgr: duplicate connection from QmQMhoaEzFCrmjXiwSAfmbha848o9vPcRsL73Z1csUtGzu; time since first connection: 4.475285063s connmgr.go:398
May 14 16:41:57 ip-172-31-46-15 relay.sh[4348]: 16:41:57.858 ERROR    connmgr: duplicate connection from QmQMhoaEzFCrmjXiwSAfmbha848o9vPcRsL73Z1csUtGzu; time since first connection: 4.840202151s connmgr.go:398
May 14 16:41:57 ip-172-31-46-15 relay.sh[4348]: 16:41:57.901 ERROR    connmgr: duplicate connection from QmQMhoaEzFCrmjXiwSAfmbha848o9vPcRsL73Z1csUtGzu; time since first connection: 4.883070173s connmgr.go:398
May 14 16:41:57 ip-172-31-46-15 relay.sh[4348]: 16:41:57.914 ERROR    connmgr: duplicate connection from QmQMhoaEzFCrmjXiwSAfmbha848o9vPcRsL73Z1csUtGzu; time since first connection: 4.896559644s connmgr.go:398
May 14 16:41:58 ip-172-31-46-15 relay.sh[4348]: 16:41:58.942 ERROR    connmgr: duplicate connection from QmQMhoaEzFCrmjXiwSAfmbha848o9vPcRsL73Z1csUtGzu; time since first connection: 5.924698024s connmgr.go:398
May 14 16:41:59 ip-172-31-46-15 relay.sh[4348]: 16:41:59.918 ERROR    connmgr: duplicate connection from QmQMhoaEzFCrmjXiwSAfmbha848o9vPcRsL73Z1csUtGzu; time since first connection: 6.900675387s connmgr.go:398
May 14 16:42:01 ip-172-31-46-15 relay.sh[4348]: 16:42:01.132 ERROR    connmgr: duplicate connection from QmQMhoaEzFCrmjXiwSAfmbha848o9vPcRsL73Z1csUtGzu; time since first connection: 8.114761803s connmgr.go:398
May 14 16:42:01 ip-172-31-46-15 relay.sh[4348]: 16:42:01.170 ERROR    connmgr: duplicate connection from QmRgH6RskpiLdSmncji2CDdYF8Y14Pwgp7U2ZFpXPv8vj5; time since first connection: 10.139670763s connmgr.go:398
May 14 16:42:01 ip-172-31-46-15 relay.sh[4348]: 16:42:01.202 ERROR    connmgr: duplicate connection from QmQMhoaEzFCrmjXiwSAfmbha848o9vPcRsL73Z1csUtGzu; time since first connection: 8.184803385s connmgr.go:398
May 14 16:42:02 ip-172-31-46-15 relay.sh[4348]: 16:42:02.402 ERROR    connmgr: duplicate connection from QmQMhoaEzFCrmjXiwSAfmbha848o9vPcRsL73Z1csUtGzu; time since first connection: 9.384375046s connmgr.go:398
...

@raulk
Copy link
Member Author

raulk commented May 14, 2019

Update: this was a false alarm, we do close unselected connections. respCh is not a buffered channel, so even if multiple connections are established at the same time, we’ll return the first one and the context cancellation would win over sending to respCh from the extra dialer goroutines that have successful connections.

@vyzo
Copy link
Contributor

vyzo commented May 14, 2019

We still have no explanation as to why all these duplicate connections.
Some peers appear to be particularly bad at it, the QmQMhoa... peer has multiple dupes in the first few seconds of operation.

@Stebalien
Copy link
Member

@vyzo could be js? Not sure. However, we never close fully established connections. Instead, we try to just pick one connection and keep using it. Ideally, we'd have some background logic to close idle, duplicate connections. Well, ideally we'd have migration...

@Stebalien
Copy link
Member

Stebalien commented May 14, 2019

Note: the dialer definitely throws all but one connection away.
Note 2: Are these connections actually live? IIRC, we need better keep alive.

@vyzo
Copy link
Contributor

vyzo commented May 14, 2019

So when the dialer throws away connections it doesn't want, does it also close them?

@Stebalien
Copy link
Member

It does unless we have both a connection and a non-nil error... see executeDial at the end of limiter.go in go-libp2p-swarm.

@raulk
Copy link
Member Author

raulk commented May 15, 2019

The dialer behaves well in this regard.

@raulk raulk closed this as completed May 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

No branches or pull requests

3 participants