-
Notifications
You must be signed in to change notification settings - Fork 725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct multiple dial bug #5113
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ENRs to dial added in
self.peers_to_dial.push(peer); |
poped in
if let Some(enr) = self.peers_to_dial.pop() { |
The extra condition prevents to dial ENRs that are currently in the "pending to dial" state. Once picked by the network behaviour, and in ConnectingType::Dialing
status, can the same ENR be re-dialed?
Yeah good question. The logic is a bit convoluted. When we discover, we call the Before adding it to the list, we check its state here: https://github.com/sigp/lighthouse/blob/stable/beacon_node/lighthouse_network/src/peer_manager/mod.rs#L409 Which makes sure its not in the dialing state: https://github.com/sigp/lighthouse/blob/stable/beacon_node/lighthouse_network/src/peer_manager/peerdb.rs#L121 |
Just to elaborate, The reason its done like this, is because libp2p keeps its own internal state of peer status. Historically we could not access this, so Lighthouse keeps its own track of peer connection status. To keep things consistent, we never update our lighthouse state, until we know for sure the call to libp2p has been made. So in this particular instance, there is a period of where lighthouse wants to dial, but we haven't informed libp2p yet, so we haven't updated our lighthouse state. And there is a small race condition in this intermediate state. At some point in the future, it would be nice to remove the logic around Lighthouse tracking a peer's state and use solely libp2ps. |
I realised we'd log that we were dialing when we could reject it a bit later. This logic had changed recently, so I've improved the logging and modified the logic slightly to only count peers as dialing if they are in the correct state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice pick, makes sense to nest the logs on the dial_peer conditional. One minor nit on the function doc
Co-authored-by: Lion - dapplion <35266934+dapplion@users.noreply.github.com>
* Dialing the same peer-id error fix * Improve dialing logging * Update beacon_node/lighthouse_network/src/peer_manager/mod.rs Co-authored-by: Lion - dapplion <35266934+dapplion@users.noreply.github.com> --------- Co-authored-by: Lion - dapplion <35266934+dapplion@users.noreply.github.com>
Issue Addressed
There is a race-condition where discovery queries pile up and return the same peer. Before the peer manager has a chance to dial peers, the queue can build up multiple peers to attempt to dial.
This removes the chance of duplication of a peer in the dial queue and will therefore prevent the error that can result when we attempt to dial a peer twice.