Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Connectivity issue between validator and sentry on 0.8.23 #7198

Closed
tomaka opened this issue Sep 24, 2020 · 9 comments
Closed

Connectivity issue between validator and sentry on 0.8.23 #7198

tomaka opened this issue Sep 24, 2020 · 9 comments
Labels
I3-bug The node fails to follow expected behavior.

Comments

@tomaka
Copy link
Contributor

tomaka commented Sep 24, 2020

@ddorgan has reported this issue happening on our nodes. It seems that validators and sentries on Polkadot 0.8.23 (but also 0.8.24) have difficulties connecting to each other, specifically when using PeerIds of the new format 12.... It works fine when using old PeerIds (Qm...).

Here is an example log found on a sentry concerning the validator:

Sep 23 16:13:36 kusama-sentry-ae1-1 kusama[9336]: 2020-09-23 16:13:36.145 tokio-runtime-worker TRACE sub-libp2p  Libp2p => Dialing(PeerId("12D3KooWBJD4UJbUFasdRZDJ9UjJXQywVetX3AL1GYsdjbgWDCqz"))
Sep 23 16:13:42 kusama-sentry-ae1-1 kusama[9336]: 2020-09-23 16:13:42.340 tokio-runtime-worker DEBUG sub-libp2p  Libp2p => Dial failure for PeerId("12D3KooWBJD4UJbUFasdRZDJ9UjJXQywVetX3AL1GYsdjbgWDCqz")
Sep 23 16:13:42 kusama-sentry-ae1-1 kusama[9336]: 2020-09-23 16:13:42.340 tokio-runtime-worker DEBUG sub-libp2p  PSM <= Dropped(PeerId("12D3KooWBJD4UJbUFasdRZDJ9UjJXQywVetX3AL1GYsdjbgWDCqz"))
Sep 23 16:13:42 kusama-sentry-ae1-1 kusama[9336]: 2020-09-23 16:13:42.340 tokio-runtime-worker TRACE sub-libp2p  Libp2p => Failed to reach PeerId("12D3KooWBJD4UJbUFasdRZDJ9UjJXQywVetX3AL1GYsdjbgWDCqz") through "/ip4/10.15.0.16/tcp/30333": Pending connection: Invalid peer ID.

Similarly, the logs found on the validator concerning the sentry:

Sep 23 16:23:25 kusama-validator-ae1-2 kusama[7855]: 2020-09-23 16:23:25.892 tokio-runtime-worker TRACE sub-libp2p  Libp2p => Reach failure for Some(PeerId("12D3KooWGKWWpmaKxtj97FXEaBznrNta1Kpvk2XjNg7bp6pzLQKz")) through "/ip4/10.15.0.4/tcp/30333": InvalidPeerId
Sep 23 16:23:25 kusama-validator-ae1-2 kusama[7855]: 2020-09-23 16:23:25.893 tokio-runtime-worker DEBUG sub-libp2p  Libp2p => Dial failure for PeerId("12D3KooWGKWWpmaKxtj97FXEaBznrNta1Kpvk2XjNg7bp6pzLQKz")
Sep 23 16:23:25 kusama-validator-ae1-2 kusama[7855]: 2020-09-23 16:23:25.893 tokio-runtime-worker DEBUG sub-libp2p  PSM <= Dropped(PeerId("12D3KooWGKWWpmaKxtj97FXEaBznrNta1Kpvk2XjNg7bp6pzLQKz"))
Sep 23 16:23:25 kusama-validator-ae1-2 kusama[7855]: 2020-09-23 16:23:25.893 tokio-runtime-worker TRACE sub-libp2p  Libp2p => Failed to reach PeerId("12D3KooWGKWWpmaKxtj97FXEaBznrNta1Kpvk2XjNg7bp6pzLQKz") through "/ip4/10.15.0.4/tcp/30333": Pending connection: Invalid peer ID.

The PeerId in question indeed matches the one self-reported by the target node.
(i.e. the validator is indeed 12D3KooWGKWWpmaKxtj97FXEaBznrNta1Kpvk2XjNg7bp6pzLQKz, and the sentry is indeed 12D3KooWBJD4UJbUFasdRZDJ9UjJXQywVetX3AL1GYsdjbgWDCqz)

The IP addresses also match the IP address the target node is listening on.

@tomaka tomaka added the I3-bug The node fails to follow expected behavior. label Sep 24, 2020
@tomaka
Copy link
Contributor Author

tomaka commented Sep 24, 2020

I observed in a recent burnin that we indeed have way more InvalidPeerId dialing errors, and the changes in the PR don't really fully explain such increase.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 24, 2020

Also found on the validator's logs:

Sep 23 16:24:19 kusama-validator-ae1-2 kusama[7855]: 2020-09-23 16:24:19.907 tokio-runtime-worker TRACE sub-libp2p  Libp2p => IncomingConnection(/ip4/10.15.0.16/tcp/30333,/ip4/10.15.0.4/tcp/56720))
Sep 23 16:24:20 kusama-validator-ae1-2 kusama[7855]: 2020-09-23 16:24:20.036 tokio-runtime-worker DEBUG sub-libp2p  Libp2p => Connection (ConnectionId(TaskId(970)),Listener { local_addr: "/ip4/10.15.0.16/tcp/30333", send_back_addr: "/ip4/10.15.0.4/tcp/56720" }) to 12D3KooWFdBXp9j63KZYbBNmAL293hy1mvViyWCFQYvNaRetaLRY established.
Sep 23 16:24:20 kusama-validator-ae1-2 kusama[7855]: 2020-09-23 16:24:20.037 tokio-runtime-worker DEBUG sub-libp2p  Libp2p => Connected(12D3KooWFdBXp9j63KZYbBNmAL293hy1mvViyWCFQYvNaRetaLRY, Listener { local_addr: "/ip4/10.15.0.16/tcp/30333", send_back_addr: "/ip4/10.15.0.4/tcp/56720" }): Incoming connection
Sep 23 16:24:20 kusama-validator-ae1-2 kusama[7855]: 2020-09-23 16:24:20.037 tokio-runtime-worker DEBUG sub-libp2p  PSM <= Incoming(12D3KooWFdBXp9j63KZYbBNmAL293hy1mvViyWCFQYvNaRetaLRY, IncomingIndex(182)).
Sep 23 16:24:20 kusama-validator-ae1-2 kusama[7855]: 2020-09-23 16:24:20.037 tokio-runtime-worker TRACE sub-libp2p  Libp2p => Connected(PeerId("12D3KooWFdBXp9j63KZYbBNmAL293hy1mvViyWCFQYvNaRetaLRY"))
Sep 23 16:24:20 kusama-validator-ae1-2 kusama[7855]: 2020-09-23 16:24:20.037 tokio-runtime-worker DEBUG sub-libp2p  PSM => Reject(IncomingIndex(182), PeerId("12D3KooWFdBXp9j63KZYbBNmAL293hy1mvViyWCFQYvNaRetaLRY")): Rejecting connections.
Sep 23 16:24:20 kusama-validator-ae1-2 kusama[7855]: 2020-09-23 16:24:20.037 tokio-runtime-worker DEBUG sub-libp2p  Handler(PeerId("12D3KooWFdBXp9j63KZYbBNmAL293hy1mvViyWCFQYvNaRetaLRY")) <= Disable
Sep 23 16:24:20 kusama-validator-ae1-2 kusama[7855]: 2020-09-23 16:24:20.079 tokio-runtime-worker DEBUG sub-libp2p  Libp2p => Connection (ConnectionId(TaskId(970)),Listener { local_addr: "/ip4/10.15.0.16/tcp/30333", send_back_addr: "/ip4/10.15.0.4/tcp/56720" }) to 12D3KooWFdBXp9j63KZYbBNmAL293hy1mvViyWCFQYvNaRetaLRY closed.
Sep 23 16:24:20 kusama-validator-ae1-2 kusama[7855]: 2020-09-23 16:24:20.079 tokio-runtime-worker DEBUG sub-libp2p  Libp2p => Disconnected(12D3KooWFdBXp9j63KZYbBNmAL293hy1mvViyWCFQYvNaRetaLRY): Was disabled.
Sep 23 16:24:20 kusama-validator-ae1-2 kusama[7855]: 2020-09-23 16:24:20.079 tokio-runtime-worker TRACE sub-libp2p  Libp2p => Disconnected(PeerId("12D3KooWFdBXp9j63KZYbBNmAL293hy1mvViyWCFQYvNaRetaLRY"), Some(IO(Custom { kind: Other, error: "yamux error: connection is closed" })))

10.15.0.4 is the IP address of the sentry, whose PeerId is normally 12D3KooWGKWWpmaKxtj97FXEaBznrNta1Kpvk2XjNg7bp6pzLQKz. 12D3KooWGKWWpmaKxtj97FXEaBznrNta1Kpvk2XjNg7bp6pzLQKz is indeed the one the sentry itself reports to the telemetry.

There might be an issue in how the Noise handshake interprets the PeerId sent by the remote.

@romanb
Copy link
Contributor

romanb commented Sep 24, 2020

Just a few questions and comments off the top of my head:

  1. Which substrate commits / versions do 0.8.23 and 0.8.24 correspond to and which version is run by the validator or sentry, respectively? The headline suggests even 0.8.23 <-> 0.8.23 shows this behaviour? It is still a bit unclear what code is running on such two problematic nodes.

  2. What does it mean to "use peer IDs of the new format" w.r.t. polkadot/substrate? Does this imply specific configuration options or just the use of specific versions?

  3. Is there a change in behaviour over time as the nodes are running?

  4. Is it known whether this happened before or only after No longer actively open legacy substreams #7076? I guess both No longer actively open legacy substreams #7076 and Upgrade to libp2p-0.28. #7077 are in 0.8.24 but not in 0.8.23?

10.15.0.4 is the IP address of the sentry, whose PeerId is normally 12D3KooWGKWWpmaKxtj97FXEaBznrNta1Kpvk2XjNg7bp6pzLQKz. This PeerId is the one the sentry itself reports to the telemetry.

So you're seeing connections being made in the logs to an IP address and a different peer ID than what is supposed to be the "correct" peer ID associated with that address?

As a first thought, it seems likely that legacy peer IDs can be floating around the DHT for quite a while, not just in routing tables but also in the form of previously stored records for the same "authority id" by the authority discovery. After all, until just recently connecting to such peer IDs worked fine and there was no reason to evict them. If there is no validator who still publishes its legacy peer ID in the context of authority discovery, the records should eventually be replaced in / evicted from the DHT, of course. If this behaviour / these errors are dependent on the current contents of the DHT, this can be difficult to reproduce.

There might be an issue in how the Noise handshake interprets the PeerId sent by the remote.

What do you suspect goes wrong in the noise handshake w.r.t. "interpreting the peer ID"?

@tomaka
Copy link
Contributor Author

tomaka commented Sep 24, 2020

Which substrate commits / versions do 0.8.23 and 0.8.24 correspond to and which version is run by the validator or sentry, respectively? The headline suggests even 0.8.23 <-> 0.8.23 shows this behaviour? It is still a bit unclear what code is running on such two problematic nodes.

The logs that I showed above happen with 0.8.23 <-> 0.8.23.

The connectivity issue was initially detected when attempting to update the two nodes to 0.8.24. As part of the update attempt, the PeerIds were updated. The update was then reverted, and it was later discovered that the connectivity issue was actually caused by updating PeerIds.
In other words, 0.8.24 <-> 0.8.24 also doesn't work, presumably because of the same problem.

What does it mean to "use peer IDs of the new format" w.r.t. polkadot/substrate? Does this imply specific configuration options or just the use of specific versions?

I'm referring to the --reserved-nodes option, that lets you specify an IP+PeerId pair of a node to always try to be connected to. The sentry has the validator's identity as reserved node, and vice-versa.
Additionally, the validator node is in --reserved-only mode, meaning that it will quickly close connections coming from nodes whose PeerId wasn't passed through --reserved-nodes.

I haven't investigated what was actually exchanged on the wire.

Is there a change in behaviour over time as the nodes are running?

We didn't try for more than approximately 20 minutes. During these 20 minutes, the two nodes constantly try to connect to each other and never succeed.

Is it known whether this happened before or only after #7076? I guess both #7076 and #7077 are in 0.8.24 but not in 0.8.23?

#7076 isn't merged yet. The problem happens without this PR, and it's unclear whether this PR exacerbates the problem or if it was already there before.

#7077 wasn't in 0.8.23, so the problem apparently happens both before and after #7077

So you're seeing connections being made in the logs to an IP address and a different peer ID than what is supposed to be the "correct" peer ID associated with that address?

Indeed!

What do you suspect goes wrong in the noise handshake w.r.t. "interpreting the peer ID"?

This was mostly an hypothesis that could explain it, but I don't have any reason to believe that this in particular is the cause.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 25, 2020

The case of the burnin of #7076 is interesting, and seems to show some issues as well.

Here is the number of errors when the node (with #7076) dials out another node:
In green, the number of InvalidPeerId errors per minute.
The deployment happened on September 21st around 12:00, where the number suddenly raises. Before this, the node was using Polkadot 0.7.23.
The 23rd around 17:00, the node got restarted. I've verified that the commit on the node indeed remained the same.

image

As you can see, the number suddenly dropped after the restart.

Here is the number of disconnects per second:
In blue, the number of disconnects caused by the other side closing the connection.

image

The reason why the remote closes the connection is unknown, but I do suspect InvalidPeerId errors happening.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 25, 2020

It's unclear to me which configuration flags the node had during the various phases.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 25, 2020

It seems that between the 21st and 23rd, the node had --reserved-nodes CLI options, similar to the situation earlier in the issue.
After the 23rd, these flags were removed.
The node has just now been restarted with --reserved-nodes flags again, and the graphs now look similar to the period between the 21st and the 23rd.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 28, 2020

We just did an experiment. Rather than generating a new configuration with PeerIds in the new format, we took the existing configuration of a node (using only old PeerIds) and manually replaced old with new PeerIds. The outcome is that the number of peers is the same as before, and the rate of InvalidPeerIds is the same as before.

Therefore I'd consider that it's not a bug in the networking code but a problem when generating the new configuration.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 30, 2020

False alarm, the --node-key option was passed to the node after the update to 0.8.24 while it wasn't before.

@tomaka tomaka closed this as completed Sep 30, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I3-bug The node fails to follow expected behavior.
Projects
None yet
Development

No branches or pull requests

2 participants