-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Garbage collect incompatible peers in Host::run() #5624
Conversation
Garbage collect peers in Host::run() which are deemed incompatible - an incompatible peer is one with which we will never be able to connect to successfully peer with (e.g. on a different chain or network, not running any capabilities).
libp2p/Peer.cpp
Outdated
case IncompatibleProtocol: | ||
case UnexpectedIdentity: | ||
case UserReason: | ||
return numeric_limits<unsigned>::max(); | ||
case TooManyPeers: | ||
return 25 * (m_failedAttempts + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m_failedAttempts
seems to affect only the value of fallbackSeconds currently. Maybe we should use it now to retry several times and then go to "critical error, disconnect" state.
(at least for some cases of failures)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gumb0 Good idea - what do you think of this being taken care of in another PR? I'd like to limit the amount of changes I make to the peer gc logic in this PR so if something ends up breaking it will be easier to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok for another PR, but it seems to be the matter of only adding condition like m_failedAttempts >= 20
to uselessPeer
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional thought is that we could make all this change more conservative if we leave fallbackSeconds()
as it was before (so that we do the same reconnects as before) and have just this check for failed attempts count in uselessPeer
(plus the new check for handshake failures)
This way it would reconnect with the same intervals for each case as before, but stop after limited number of attempts.
But I'm fine with it if you think it's better to immediately stop in some cases.
Rather than use a magic number to indicate a useless peer (numeric_limits<unsigned>::max()), add a function to Peer (Peer::uselessPeer) which can be called to detect this. Also update Peer::fallbackSeconds() to return a number which is large enough to prevent us from realistically ever connecting to a useless peer but which is small enough to not overflow when added to Peer::m_lastAttempted. Also, log a message with error verbosity in Host::handshakeFailed() if a peer is not found rather than asserting, since while we never expect users to hit this case, the EIP8 handshake tests can.
Update Peer::uselessPeer() to check peer type and return false if it's a required peer, since in this case the user obviously wants to try to stay connected to this peer.
Codecov Report
@@ Coverage Diff @@
## master #5624 +/- ##
==========================================
+ Coverage 62.62% 62.63% +<.01%
==========================================
Files 350 350
Lines 29690 29742 +52
Branches 3344 3350 +6
==========================================
+ Hits 18593 18628 +35
- Misses 9889 9901 +12
- Partials 1208 1213 +5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments and a small bug in the loop iterating over peers.
libp2p/Host.h
Outdated
std::shared_ptr<Peer> peer(NodeID const& _n) const; | ||
|
||
/// Set a handshake failure reason for a peer | ||
void handshakeFailed(NodeID const& _n, HandshakeFailureReason _r); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better onHandshakeFailed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I would make it public and declare close to startPeerSession
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gumb0 : Why make this public, does it make sense to expose the concept of a handshake to consumers of Host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it looks to me like a callback similar to startPeerSession
, the callback called by RLPXHandshake
when handshake is finished. One is for success another one is for failure.
It works as private, because RLPXHandshake
is a friend of Host
, but idealluy we should get rid of this friend declarations at some point.
Exposing it to the clients of Host
is of course not great, but the proper way to deal with it could be to create a separate interface with these callbacks only, don't expose it to Host
clients, but pass it only to RLPXHandshake
. That's a bit complicated change, at least it's not for this PR.
(In other words, we won't make it much worse, because startPeerSession
is already public)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it looks to me like a callback similar to
startPeerSession
, the callback called byRLPXHandshake
when handshake is finished. One is for success another one is for failure.It works as private, because
RLPXHandshake
is a friend ofHost
, but idealluy we should get rid of this friend declarations at some point.Exposing it to the clients of
Host
is of course not great, but the proper way to deal with it could be to create a separate interface with these callbacks only, don't expose it toHost
clients, but pass it only toRLPXHandshake
. That's a bit complicated change, at least it's not for this PR.(In other words, we won't make it much worse, because
startPeerSession
is already public)
Ah that makes sense, thank you for clarifying! 😄 I'll make the change before merging.
libp2p/Peer.cpp
Outdated
case UselessPeer: | ||
case IncompatibleProtocol: | ||
case UnexpectedIdentity: | ||
case UserReason: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about UserReason
- this could in theory be any kind of reason specific to subprotocol, including some temporary reasons, maybe it makes sense to try to reconnect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to treat UserReason as terminal because we only use it in 2 cases - if status validation fails:
aleth/libethereum/BlockChainSync.cpp
Lines 190 to 202 in 505aead
std::string disconnectReason; | |
if (peerSessionInfo->clientVersion.find("/v0.7.0/") != string::npos) | |
disconnectReason = "Blacklisted client version."; | |
else | |
disconnectReason = _peer.validate( | |
host().chain().genesisHash(), host().protocolVersion(), host().networkId()); | |
if (!disconnectReason.empty()) | |
{ | |
LOG(m_logger) << "Peer " << _peer.id() << " not suitable for sync: " << disconnectReason; | |
m_host.capabilityHost().disconnect(_peer.id(), p2p::UserReason); | |
return; | |
} |
Here's where we perform the actual validation in EthereumPeer::validate:
aleth/libethereum/EthereumPeer.cpp
Lines 58 to 68 in 505aead
if (m_networkId != _hostNetworkId) | |
error << "Network identifier mismatch. Host network id: " << _hostNetworkId | |
<< ", peer network id: " << m_networkId; | |
else if (m_protocolVersion != _hostProtocolVersion) | |
error << "Protocol version mismatch. Host protocol version: " << _hostProtocolVersion | |
<< ", peer protocol version: " << m_protocolVersion; | |
else if (m_genesisHash != _hostGenesisHash) | |
error << "Genesis hash mismatch. Host genesis hash: " << _hostGenesisHash.abridged() | |
<< ", peer genesis hash: " << m_genesisHash.abridged(); | |
else if (m_asking != Asking::State && m_asking != Asking::Nothing) | |
error << "Peer banned for unexpected status message."; |
- If there's a bug in our network code and Session read validation fails:
Lines 409 to 418 in 505aead
else if (_length != _expected) { // with static m_data-sized buffer this shouldn't happen unless there's a regression // sec recommends checking anyways (instead of assert) LOG(m_netLoggerError) << "Error reading - TCP read buffer length differs from expected frame size (" << _length << " != " << _expected << ")"; disconnect(UserReason); return false; }
You bring up a good point though, technically it can be any reason specific to a subprotocol and other clients can also choose to send it to us for reasons specific to their implementation. I think that rather than treating it as immediately critical we should use it in some sort of reconnection count threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gumb0 I've decided to change UserReason
-> UselessPeer
when status message validation fails for a peer (since if that happens the peer is effectively useless to us i.e. we can't sync with it). I've also added in logic in Peer::isUseless
to take the number of failed connection attempts into account when the last disconnect isn't critical.
Status validation failing means we have no chance of syncing with the peer (e.g. it's running on a different network) so a UselessPeer disconnect reason makes more sense than UserReason. This also has the benefit of status failure validation being treated as a critical disconnect and these nodes being gc'd in Host::run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, feel free to make onHandshakeFailed
public, if you agree with my reasoning, but it's not critical
Various minor changes in RLPxHandshake, Host, and Peer classes: * Initialize RLPxHandshake::m_failureReason in ctor * Reduce redundant code in RLPxHandshake which sets the last failure reason when TCP errors occur * Rename Host::handshakeFailed * Fix m_peers iteration bug in Host * Make Host::onHandshakeFailed public (we want to avoid using friend classes where possible) * Rename Peers::uselessPeer * Update deprecated comment in Peer * Take # of failed connection attempts into account in Peer function which determines if instance is useless or not Move fallback seconds computation for default case to anonymous namespace function
Garbage collect peers in
Host::run()
which are deemed incompatible - an incompatible peer is one with which we will never be able to successfully peer with (e.g. on a different chain or network, not running any capabilities).