Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct multiple dial bug #5113

Merged
merged 3 commits into from
Jan 23, 2024
Merged

Conversation

AgeManning
Copy link
Member

Issue Addressed

There is a race-condition where discovery queries pile up and return the same peer. Before the peer manager has a chance to dial peers, the queue can build up multiple peers to attempt to dial.

This removes the chance of duplication of a peer in the dial queue and will therefore prevent the error that can result when we attempt to dial a peer twice.

@AgeManning AgeManning added ready-for-review The code is ready for review v4.6.0 ETA Q1 2024 labels Jan 23, 2024
Copy link
Collaborator

@dapplion dapplion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ENRs to dial added in

poped in

if let Some(enr) = self.peers_to_dial.pop() {

The extra condition prevents to dial ENRs that are currently in the "pending to dial" state. Once picked by the network behaviour, and in ConnectingType::Dialing status, can the same ENR be re-dialed?

@AgeManning
Copy link
Member Author

Yeah good question. The logic is a bit convoluted.

When we discover, we call the dial_peer() function.

https://github.com/sigp/lighthouse/blob/stable/beacon_node/lighthouse_network/src/peer_manager/mod.rs#L341

Before adding it to the list, we check its state here: https://github.com/sigp/lighthouse/blob/stable/beacon_node/lighthouse_network/src/peer_manager/mod.rs#L409

Which makes sure its not in the dialing state: https://github.com/sigp/lighthouse/blob/stable/beacon_node/lighthouse_network/src/peer_manager/peerdb.rs#L121

@AgeManning
Copy link
Member Author

Just to elaborate,

The reason its done like this, is because libp2p keeps its own internal state of peer status. Historically we could not access this, so Lighthouse keeps its own track of peer connection status.

To keep things consistent, we never update our lighthouse state, until we know for sure the call to libp2p has been made. So in this particular instance, there is a period of where lighthouse wants to dial, but we haven't informed libp2p yet, so we haven't updated our lighthouse state. And there is a small race condition in this intermediate state.

At some point in the future, it would be nice to remove the logic around Lighthouse tracking a peer's state and use solely libp2ps.

@AgeManning
Copy link
Member Author

I realised we'd log that we were dialing when we could reject it a bit later.

This logic had changed recently, so I've improved the logging and modified the logic slightly to only count peers as dialing if they are in the correct state.

Copy link
Collaborator

@dapplion dapplion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice pick, makes sense to nest the logs on the dial_peer conditional. One minor nit on the function doc

Co-authored-by: Lion - dapplion <35266934+dapplion@users.noreply.github.com>
@AgeManning AgeManning merged commit a36a12a into sigp:unstable Jan 23, 2024
2 checks passed
danielramirezch pushed a commit to danielramirezch/lighthouse that referenced this pull request Feb 14, 2024
* Dialing the same peer-id error fix

* Improve dialing logging

* Update beacon_node/lighthouse_network/src/peer_manager/mod.rs

Co-authored-by: Lion - dapplion <35266934+dapplion@users.noreply.github.com>

---------

Co-authored-by: Lion - dapplion <35266934+dapplion@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-review The code is ready for review v4.6.0 ETA Q1 2024
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants