-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Connectivity issue between validator and sentry on 0.8.23 #7198
Comments
I observed in a recent burnin that we indeed have way more |
Also found on the validator's logs:
There might be an issue in how the Noise handshake interprets the |
Just a few questions and comments off the top of my head:
So you're seeing connections being made in the logs to an IP address and a different peer ID than what is supposed to be the "correct" peer ID associated with that address? As a first thought, it seems likely that legacy peer IDs can be floating around the DHT for quite a while, not just in routing tables but also in the form of previously stored records for the same "authority id" by the authority discovery. After all, until just recently connecting to such peer IDs worked fine and there was no reason to evict them. If there is no validator who still publishes its legacy peer ID in the context of authority discovery, the records should eventually be replaced in / evicted from the DHT, of course. If this behaviour / these errors are dependent on the current contents of the DHT, this can be difficult to reproduce.
What do you suspect goes wrong in the noise handshake w.r.t. "interpreting the peer ID"? |
The logs that I showed above happen with 0.8.23 <-> 0.8.23. The connectivity issue was initially detected when attempting to update the two nodes to 0.8.24. As part of the update attempt, the
I'm referring to the I haven't investigated what was actually exchanged on the wire.
We didn't try for more than approximately 20 minutes. During these 20 minutes, the two nodes constantly try to connect to each other and never succeed.
#7076 isn't merged yet. The problem happens without this PR, and it's unclear whether this PR exacerbates the problem or if it was already there before. #7077 wasn't in 0.8.23, so the problem apparently happens both before and after #7077
Indeed!
This was mostly an hypothesis that could explain it, but I don't have any reason to believe that this in particular is the cause. |
The case of the burnin of #7076 is interesting, and seems to show some issues as well. Here is the number of errors when the node (with #7076) dials out another node: As you can see, the number suddenly dropped after the restart. Here is the number of disconnects per second: The reason why the remote closes the connection is unknown, but I do suspect |
It's unclear to me which configuration flags the node had during the various phases. |
It seems that between the 21st and 23rd, the node had |
We just did an experiment. Rather than generating a new configuration with Therefore I'd consider that it's not a bug in the networking code but a problem when generating the new configuration. |
False alarm, the |
@ddorgan has reported this issue happening on our nodes. It seems that validators and sentries on Polkadot 0.8.23 (but also 0.8.24) have difficulties connecting to each other, specifically when using
PeerId
s of the new format12...
. It works fine when using oldPeerId
s (Qm...).Here is an example log found on a sentry concerning the validator:
Similarly, the logs found on the validator concerning the sentry:
The
PeerId
in question indeed matches the one self-reported by the target node.(i.e. the validator is indeed
12D3KooWGKWWpmaKxtj97FXEaBznrNta1Kpvk2XjNg7bp6pzLQKz
, and the sentry is indeed12D3KooWBJD4UJbUFasdRZDJ9UjJXQywVetX3AL1GYsdjbgWDCqz
)The IP addresses also match the IP address the target node is listening on.
The text was updated successfully, but these errors were encountered: