Skip to content
This repository has been archived by the owner on Oct 28, 2021. It is now read-only.

Fix multiple Ping and assertion failure in Discovery #5483

Merged
merged 9 commits into from
Feb 19, 2019
Merged

Conversation

gumb0
Copy link
Member

@gumb0 gumb0 commented Feb 11, 2019

Addresses #5471 and probably some things from #5484

Summary of changes:

  • checking whether we've already sent one Ping (by looking at m_sentPings) before sending another one
  • ping method is split into ping and schedulePing - when we are already in the network thread, we can just directly access m_sentPings without additional m_timers.schedule(0, ...). So now when we need to ping from the network thread we just call ping method without scheduling
  • m_allNodes now contains only the nodes from the node table buckets.
  • So the nodes that are still pending (Pong not received yet) and the nodes that didn't fit into the bucket, and wait for older node to be evicted, are not put into m_allNodes
  • Nodes are added to m_allNodes only in noteActiveNode together with being put into the bucket.
    This allows us not to care about erasing from m_allNodes the nodes that didn't get validated or are being thrown away when eviction ends with the old node answering.
    (maybe m_allNodes should be renamed now)
  • replacement nodes are remembered only in the m_sentPings items (as a shared_ptr<NodeEntry> replacementNodeEntry member). They are dropped when we erase from m_sentPings in Pong handler.
  • we have to save node's NodeIPEndpoint in m_sentPings now, because this data is sent to us in Ping (or in Neighbours) and when we receive it from an unnknown node, we have to save it somewhere until the moment when we'll add it to the node table and to the m_allNodes.
    (actually I think it's only TCP port number that is important to save, the rest is overwritten from the actual source of UDP packet anyway. But I expect this part to be changed somewhat when addressing the problem described in Discovery: additional security check before sending Neighbours #5455 (comment)
    I expect m_sentPings to have in the future the UDP endpoint as a map key instead of NodeID)

@codecov-io
Copy link

codecov-io commented Feb 11, 2019

Codecov Report

Merging #5483 into master will decrease coverage by 0.01%.
The diff coverage is 85.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5483      +/-   ##
==========================================
- Coverage    61.9%   61.89%   -0.02%     
==========================================
  Files         345      345              
  Lines       28702    28725      +23     
  Branches     3266     3266              
==========================================
+ Hits        17768    17779      +11     
- Misses       9768     9778      +10     
- Partials     1166     1168       +2

@chfast
Copy link
Member

chfast commented Feb 12, 2019

The assert failure still happens here.

@chfast
Copy link
Member

chfast commented Feb 12, 2019

Still crashes :)

@gumb0
Copy link
Member Author

gumb0 commented Feb 12, 2019

This still doesn't fix assertion failure, just improves handling of attempts to Ping multiple times.

But it gets quite confusing and difficult to keep in mind all the possible cases. I'm thinking now that m_allNodes shouldn't contain both the pending nodes and the nodes already in the node table. Maybe we'll create the NodeEntry only when it's actually added to the node table. Then there won't be the need to this assert.

@@ -325,6 +331,9 @@ void NodeTable::ping(NodeEntry const& _nodeEntry, boost::optional<NodeID> const&
if (_ec || m_timers.isStopped())
return;

if (contains(m_sentPings, _nodeEntry.id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe log something here?

if (contains(m_sentPings, _nodeEntry.id))
return;

NodeIPEndpoint src;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Nit) Can be combined onto one line?

PingNode p(src, _nodeEntry.endpoint);
p.ts = nextRequestExpirationTime();
auto const pingHash = p.sign(m_secret);
LOG(m_logger) << p.typeName() << " to " << _nodeEntry.id << "@" << p.destination;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just log the _nodeEntry here to get both the ID and the destination endpoint?

@gumb0
Copy link
Member Author

gumb0 commented Feb 14, 2019

It turned out to be quite a lot of change, but I think it haven't got more complicated at least.

@gumb0 gumb0 changed the title Fix multiple Ping Fix multiple Ping and assertion failure in Discovery Feb 14, 2019
Don't put not yet validated nodes there, neither the ones that don't fit to the bucket and are replacement nodes for evicted ones. Replacement nodes are kept only in the m_sentPing items.
@gumb0 gumb0 force-pushed the you-only-ping-once branch from 3ef9cf6 to 0daaaa5 Compare February 14, 2019 17:34
@gumb0
Copy link
Member Author

gumb0 commented Feb 14, 2019

Rebased and addressed @halfalicious's comments

@@ -126,14 +126,17 @@ class NodeTable : UDPSocketEvents
/// Called by implementation which provided handler to process NodeEntryAdded/NodeEntryDropped events. Events are coalesced by type whereby old events are ignored.
void processEvents();

/// Add node to the list of all nodes and ping it to trigger the endpoint proof.
/// Starts async node adding tot the node table by pinging it to trigger the endpoint proof.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment (tot) and think it could be rephrased a little bit eg Starts async add of node to the node table...

///
/// @return True if the node has been added.
/// @return True if the node id valid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: “If the node is is valid”

libp2p/NodeTable.cpp Show resolved Hide resolved
@halfalicious
Copy link
Contributor

halfalicious commented Feb 14, 2019

What were the cases where we would send double pings? And why would we hit the assertion?

@gumb0
Copy link
Member Author

gumb0 commented Feb 15, 2019

Cases when we sent double pings:

  • We receive Neighbours packet, it has some nodes that should go into the same bucket. But the bucket is full, when trying to add the first new node, we start eviction by pinging the oldest node of the bucket. Before Ping arrives, we add subsequent new nodes, and this leads to pinging the oldest node again.
    (after the fix it ingores the attempts to add the subsequent new nodes)
  • Sometimes we've already scheduled initiating a ping (with m_timers.schedule(0, ...), if another request to add the same node happens immediately in succession, we schedule another ping (i.e. we don't look into io_service to check whether we already plan to ping soon) - I beleive this sometimes still can happen after these fixes, but should be less often after the split to ping/schedulePing (regular ping now doesn't do scheduling with m_timers)

The assertion failure:
The reason was m_allNodes and m_sentPings getting out of sync, i.e. m_sentPings (in Pong handler) contained the ID of some node that already was erased from m_allNodes
I think this could have happened when we ping the node that is considered already a "replacement node" (the one that we already tried to add to the bucket, but the bucket was full and therefore we pinged the oldest node instead), but then the Pong from the oldest node arrives and we erase replacement node from m_allNodes before Pong of replacement node arrives.
The fix now is just to not put replacement nodes into m_allNodes.

}

void NodeTable::noteActiveNode(Public const& _pubk, bi::udp::endpoint const& _endpoint)
void NodeTable::noteActiveNode(shared_ptr<NodeEntry> _nodeEntry, bi::udp::endpoint const& _endpoint)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this method are only:

  • shared_ptr<NodeEntry> instead of Public paramter.
  • insert into m_allNodes together with inserting into the bucket

@gumb0
Copy link
Member Author

gumb0 commented Feb 15, 2019

Tests seem to be fixed now, I think it's ready for review

@gumb0 gumb0 removed the in progress label Feb 15, 2019
@gumb0 gumb0 requested a review from halfalicious February 15, 2019 13:18
@gumb0 gumb0 requested a review from chfast February 15, 2019 13:18
// Don't sent Ping if one is already sent
if (contains(m_sentPings, _node.id))
{
LOG(m_logger) << "Ignoring request to ping " << _node << ", because it's already pinged";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being logged quite a lot I should say

Copy link
Contributor

@halfalicious halfalicious Feb 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we benefit from more logging since it makes it easier to track what's going on and detect bugs, but wading through the log spew can be challenging. It would be nice to eventually have some additional log levels so we could have more targeted logging...in the meantime I think we should keep/add logs unless they significantly impact the readability of the Aleth logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separating logs into INFO and DEBUG levels would be good. In INFO only actual changes (node added, node replaced by) and/or stats (X new nodes discovered).
The rest should go to DEBUG. Later we can consider splitting DEBUG into DEBUG and TRACE.

For aleth-bootnode the networking INFO level should be enabled by default.
For aleth all networking logs should be disabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For aleth-bootnode the networking INFO level should be enabled by default.

Agreed, filed #5499 to track this

entry = createNodeEntry(_node, 0, 0);
else
entry = it->second;
needToPing = (it == m_allNodes.end() || !it->second->hasValidEndpointProof());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check m_sentPings? What if we sent a ping recently but just haven't received a response yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good point, it would improve things, but unfortunately it's not easy to do it here, because m_sentPings is accessed only from the network thread...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could either make addNode asynchronous; or we could create another private method for adding nodes from the packet handlers (but it would be very similar to addNode); or maybe we could check before sending Ping in ping() whether the node already happened to be validated, then skip additional Ping.

All this seems quite ugly for me for now, I'd suggest first to observe how often we get double pings because of this problem, then decide whether it make sense to complicate the code...

@halfalicious
Copy link
Contributor

halfalicious commented Feb 16, 2019

* Sometimes we've already scheduled initiating a ping (with `m_timers.schedule(0, ...`), if another request to add the same node happens immediately in succession, we schedule another ping (i.e. we don't look into `io_service` to check whether we already plan to ping soon) - I beleive this sometimes still can happen after these fixes, but should be less often after the split to `ping/schedulePing` (regular `ping` now doesn't do scheduling with `m_timers`)

Can we address this by adding the ping information in schedulePing() before we actually schedule the ping? One danger with doing this is that there's a delay of the ping not actually executing until after the UDP datagram time to live (1 minute) which means that the ping is already considered expired when it gets sent out over the wire. I don't know much about boost deadline timers but I don't think this can realistically happen, even if there are a lot of boost deadline timers expiring around the same time (since you'd have to execute a lot of handlers to consume 1 minute of wall clock time and I don't think we create that many deadline timers).

Another possible way to address this would be via something like m_queuedPings, and we could check both this and m_sentPings before deciding if we should queue a new ping.

ping(*entry);
if (needToPing)
{
LOG(m_logger) << "Pending " << _node;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log and "Adding node " are redundant. How about we long either "Pending" or "Added".

// Don't sent Ping if one is already sent
if (contains(m_sentPings, _node.id))
{
LOG(m_logger) << "Ignoring request to ping " << _node << ", because it's already pinged";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separating logs into INFO and DEBUG levels would be good. In INFO only actual changes (node added, node replaced by) and/or stats (X new nodes discovered).
The rest should go to DEBUG. Later we can consider splitting DEBUG into DEBUG and TRACE.

For aleth-bootnode the networking INFO level should be enabled by default.
For aleth all networking logs should be disabled.

LOG(m_logger) << "Skipping making node with unallowed endpoint active. Node " << _pubk
<< "@" << _endpoint;
LOG(m_logger) << "Skipping making node with unallowed endpoint active. Node "
<< _nodeEntry->id << "@" << _endpoint;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can log _nodeEntry here.


LOG(m_logger) << "Active node " << _nodeEntry->id << '@' << _endpoint;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can long _nodeEntry here.

@chfast chfast merged commit b09978b into master Feb 19, 2019
@chfast chfast deleted the you-only-ping-once branch February 19, 2019 11:36
@halfalicious
Copy link
Contributor

* Sometimes we've already scheduled initiating a ping (with `m_timers.schedule(0, ...`), if another request to add the same node happens immediately in succession, we schedule another ping (i.e. we don't look into `io_service` to check whether we already plan to ping soon) - I beleive this sometimes still can happen after these fixes, but should be less often after the split to `ping/schedulePing` (regular `ping` now doesn't do scheduling with `m_timers`)

Can we address this by adding the ping information in schedulePing() before we actually schedule the ping? One danger with doing this is that there's a delay of the ping not actually executing until after the UDP datagram time to live (1 minute) which means that the ping is already considered expired when it gets sent out over the wire. I don't know much about boost deadline timers but I don't think this can realistically happen, even if there are a lot of boost deadline timers expiring around the same time (since you'd have to execute a lot of handlers to consume 1 minute of wall clock time and I don't think we create that many deadline timers).

Another possible way to address this would be via something like m_queuedPings, and we could check both this and m_sentPings before deciding if we should queue a new ping.

Filed #5500 to track this

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants