-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Fix multiple Ping and assertion failure in Discovery #5483
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5483 +/- ##
==========================================
- Coverage 61.9% 61.89% -0.02%
==========================================
Files 345 345
Lines 28702 28725 +23
Branches 3266 3266
==========================================
+ Hits 17768 17779 +11
- Misses 9768 9778 +10
- Partials 1166 1168 +2 |
The assert failure still happens here. |
Still crashes :) |
This still doesn't fix assertion failure, just improves handling of attempts to Ping multiple times. But it gets quite confusing and difficult to keep in mind all the possible cases. I'm thinking now that |
libp2p/NodeTable.cpp
Outdated
@@ -325,6 +331,9 @@ void NodeTable::ping(NodeEntry const& _nodeEntry, boost::optional<NodeID> const& | |||
if (_ec || m_timers.isStopped()) | |||
return; | |||
|
|||
if (contains(m_sentPings, _nodeEntry.id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe log something here?
libp2p/NodeTable.cpp
Outdated
if (contains(m_sentPings, _nodeEntry.id)) | ||
return; | ||
|
||
NodeIPEndpoint src; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Nit) Can be combined onto one line?
libp2p/NodeTable.cpp
Outdated
PingNode p(src, _nodeEntry.endpoint); | ||
p.ts = nextRequestExpirationTime(); | ||
auto const pingHash = p.sign(m_secret); | ||
LOG(m_logger) << p.typeName() << " to " << _nodeEntry.id << "@" << p.destination; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just log the _nodeEntry here to get both the ID and the destination endpoint?
It turned out to be quite a lot of change, but I think it haven't got more complicated at least. |
Don't put not yet validated nodes there, neither the ones that don't fit to the bucket and are replacement nodes for evicted ones. Replacement nodes are kept only in the m_sentPing items.
3ef9cf6
to
0daaaa5
Compare
Rebased and addressed @halfalicious's comments |
libp2p/NodeTable.h
Outdated
@@ -126,14 +126,17 @@ class NodeTable : UDPSocketEvents | |||
/// Called by implementation which provided handler to process NodeEntryAdded/NodeEntryDropped events. Events are coalesced by type whereby old events are ignored. | |||
void processEvents(); | |||
|
|||
/// Add node to the list of all nodes and ping it to trigger the endpoint proof. | |||
/// Starts async node adding tot the node table by pinging it to trigger the endpoint proof. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in comment (tot) and think it could be rephrased a little bit eg Starts async add of node to the node table...
libp2p/NodeTable.h
Outdated
/// | ||
/// @return True if the node has been added. | ||
/// @return True if the node id valid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar: “If the node is is valid”
What were the cases where we would send double pings? And why would we hit the assertion? |
Cases when we sent double pings:
The assertion failure: |
} | ||
|
||
void NodeTable::noteActiveNode(Public const& _pubk, bi::udp::endpoint const& _endpoint) | ||
void NodeTable::noteActiveNode(shared_ptr<NodeEntry> _nodeEntry, bi::udp::endpoint const& _endpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this method are only:
shared_ptr<NodeEntry>
instead ofPublic
paramter.- insert into
m_allNodes
together with inserting into the bucket
Tests seem to be fixed now, I think it's ready for review |
// Don't sent Ping if one is already sent | ||
if (contains(m_sentPings, _node.id)) | ||
{ | ||
LOG(m_logger) << "Ignoring request to ping " << _node << ", because it's already pinged"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being logged quite a lot I should say
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we benefit from more logging since it makes it easier to track what's going on and detect bugs, but wading through the log spew can be challenging. It would be nice to eventually have some additional log levels so we could have more targeted logging...in the meantime I think we should keep/add logs unless they significantly impact the readability of the Aleth logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separating logs into INFO and DEBUG levels would be good. In INFO only actual changes (node added, node replaced by) and/or stats (X new nodes discovered).
The rest should go to DEBUG. Later we can consider splitting DEBUG into DEBUG and TRACE.
For aleth-bootnode the networking INFO level should be enabled by default.
For aleth all networking logs should be disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For aleth-bootnode the networking INFO level should be enabled by default.
Agreed, filed #5499 to track this
entry = createNodeEntry(_node, 0, 0); | ||
else | ||
entry = it->second; | ||
needToPing = (it == m_allNodes.end() || !it->second->hasValidEndpointProof()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also check m_sentPings? What if we sent a ping recently but just haven't received a response yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah good point, it would improve things, but unfortunately it's not easy to do it here, because m_sentPings
is accessed only from the network thread...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could either make addNode
asynchronous; or we could create another private method for adding nodes from the packet handlers (but it would be very similar to addNode
); or maybe we could check before sending Ping in ping()
whether the node already happened to be validated, then skip additional Ping
.
All this seems quite ugly for me for now, I'd suggest first to observe how often we get double pings because of this problem, then decide whether it make sense to complicate the code...
Can we address this by adding the ping information in schedulePing() before we actually schedule the ping? One danger with doing this is that there's a delay of the ping not actually executing until after the UDP datagram time to live (1 minute) which means that the ping is already considered expired when it gets sent out over the wire. I don't know much about boost deadline timers but I don't think this can realistically happen, even if there are a lot of boost deadline timers expiring around the same time (since you'd have to execute a lot of handlers to consume 1 minute of wall clock time and I don't think we create that many deadline timers). Another possible way to address this would be via something like m_queuedPings, and we could check both this and m_sentPings before deciding if we should queue a new ping. |
ping(*entry); | ||
if (needToPing) | ||
{ | ||
LOG(m_logger) << "Pending " << _node; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log and "Adding node " are redundant. How about we long either "Pending" or "Added".
// Don't sent Ping if one is already sent | ||
if (contains(m_sentPings, _node.id)) | ||
{ | ||
LOG(m_logger) << "Ignoring request to ping " << _node << ", because it's already pinged"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separating logs into INFO and DEBUG levels would be good. In INFO only actual changes (node added, node replaced by) and/or stats (X new nodes discovered).
The rest should go to DEBUG. Later we can consider splitting DEBUG into DEBUG and TRACE.
For aleth-bootnode the networking INFO level should be enabled by default.
For aleth all networking logs should be disabled.
LOG(m_logger) << "Skipping making node with unallowed endpoint active. Node " << _pubk | ||
<< "@" << _endpoint; | ||
LOG(m_logger) << "Skipping making node with unallowed endpoint active. Node " | ||
<< _nodeEntry->id << "@" << _endpoint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can log _nodeEntry
here.
|
||
LOG(m_logger) << "Active node " << _nodeEntry->id << '@' << _endpoint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can long _nodeEntry
here.
Filed #5500 to track this |
Addresses #5471 and probably some things from #5484
Summary of changes:
Ping
(by looking atm_sentPings
) before sending another oneping
method is split intoping
andschedulePing
- when we are already in the network thread, we can just directly accessm_sentPings
without additionalm_timers.schedule(0, ...)
. So now when we need to ping from the network thread we just callping
method without schedulingm_allNodes
now contains only the nodes from the node table buckets.Pong
not received yet) and the nodes that didn't fit into the bucket, and wait for older node to be evicted, are not put intom_allNodes
m_allNodes
only innoteActiveNode
together with being put into the bucket.This allows us not to care about erasing from
m_allNodes
the nodes that didn't get validated or are being thrown away when eviction ends with the old node answering.(maybe
m_allNodes
should be renamed now)m_sentPings
items (as ashared_ptr<NodeEntry> replacementNodeEntry
member). They are dropped when we erase fromm_sentPings
inPong
handler.NodeIPEndpoint
inm_sentPings
now, because this data is sent to us inPing
(or inNeighbours
) and when we receive it from an unnknown node, we have to save it somewhere until the moment when we'll add it to the node table and to them_allNodes
.(actually I think it's only TCP port number that is important to save, the rest is overwritten from the actual source of UDP packet anyway. But I expect this part to be changed somewhat when addressing the problem described in Discovery: additional security check before sending Neighbours #5455 (comment)
I expect
m_sentPings
to have in the future the UDP endpoint as a map key instead ofNodeID
)