Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

Multiple improvements to discovery ping handling #8771

Merged
merged 8 commits into from
Jul 11, 2018

Conversation

jimpo
Copy link
Contributor

@jimpo jimpo commented Jun 3, 2018

This addresses some of the problems listed in #8757. The PR is rather large, so it is best reviewed commit by commit and could be split into multiple PRs if reviewers would like.

There are a few behavior changes:

  • Nodes are only added to the k-buckets after responding to a PING
  • The echo hash on PONG messages is verified against the request packet
  • FIND_NODE request timeouts are handled similarly to PING timeouts
  • Selection of the last seen node in a k-bucket is improved
  • A node is allowed 5 consecutive request timeouts before being evicted from a k-bucket

There are more improvements to come (the k-buckets are still not limited to k entries), but I think this is a step in the right direction.

@parity-cla-bot
Copy link

It looks like @jimpo signed our Contributor License Agreement. 👍

Many thanks,

Parity Technologies CLA Bot

@debris debris added A0-pleasereview 🤓 Pull request needs code review. M4-core ⛓ Core client code / Rust. labels Jun 4, 2018
@debris debris requested review from twittner and tomaka June 4, 2018 08:23
@5chdn 5chdn added this to the 1.12 milestone Jun 4, 2018
@jimpo jimpo force-pushed the discovery-tracking branch 2 times, most recently from 1230ef0 to bd208f3 Compare June 5, 2018 16:46
Copy link
Contributor

@twittner twittner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The changes look good to me.

send_queue: VecDeque<Datagramm>,
check_timestamps: bool,
adding_nodes: Vec<NodeEntry>,
ip_filter: IpFilter,
request_backoff: Vec<Duration>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for NodeBucket::request_backoff? Why not declare REQUEST_BACKOFFas a [Duration; 4]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's on Discovery, not NodeBucket. And the motivation is just injecting dependencies explicitly. It's nice because the backoff can be tweaked in tests and stuff (which I make use of).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's on Discovery, not NodeBucket.

Of course, sorry. Typo.

And the motivation is just injecting dependencies explicitly.

I was just thinking that since Discovery::new refers to const REQUEST_BACKOFF anyway only to map its integers to Durations, why not declare REQUEST_BACKOFF itself as a &'static [Duration] and only assign the reference to Duration::request_backoff. After all, client code can not change the backoff setting and in test modules one can still define other backoff sequences and overwrite the default. But it was just a quick thought and certainly not important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 364c8e1. Let me know if you prefer that or want me to revert. Shouldn't make any difference performance-wise because Discovery is only instantiated once.

self.adding_nodes = nodes;
self.update_new_nodes();
pub fn add_node_list(&mut self, mut nodes: Vec<NodeEntry>) {
for node in nodes.drain(..) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to drain because nodes is moved here. for node in nodes { .. } is sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. I assume the drain can be removed from init_node_list as well? (which is why I did it here).

};
packet[32..(32 + 65)].clone_from_slice(&signature[..]);
let signed_hash = keccak(&packet[32..]);
packet[0..32].clone_from_slice(&signed_hash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use copy_from_slice.

}

/// Add a list of known nodes to the table.
pub fn init_node_list(&mut self, mut nodes: Vec<NodeEntry>) {
for n in nodes.drain(..) {
for n in nodes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nodes parameter here and in add_node_list does not need to be declared as mut.

@jimpo jimpo force-pushed the discovery-tracking branch 3 times, most recently from d9a4ad9 to b8581b3 Compare June 12, 2018 02:16
@5chdn 5chdn requested a review from twittner July 2, 2018 12:06
@5chdn
Copy link
Contributor

5chdn commented Jul 2, 2018

@twittner @tomaka please use the files > review > approve functionality to review a PR :)

rlp.begin_list(c.len());
for n in 0 .. c.len() {
rlp.begin_list(4);
c[n].endpoint.to_rlp(&mut rlp);
rlp.append(&c[n].id);
}
append_expiration(&mut rlp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a stupid question but isn't this a modification of the protocol?
(because if so, what about compatibility?)

Copy link
Contributor Author

@jimpo jimpo Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this should not affect behavior. The change is that the expiry timestamp used to be appended in send_packet and now it is appended to the RLP payload before calling that method.

fn assemble_packet(packet_id: u8, bytes: &[u8], secret: &Secret) -> Result<Bytes, Error> {
let mut packet = Bytes::with_capacity(bytes.len() + 32 + 65 + 1);
packet.extend_from_slice(&[0; 32 + 65][..]);
packet.extend_from_slice(&[packet_id][..]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to create temporaries:

packet.resize(32 + 65, 0);
packet.put_u8(packet_id);

} else { false };

if remove_from_bucket {
let id_hash = keccak(&node_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not moving this block directly after entry.remove();?

if !is_expected {
debug!(target: "discovery", "Got unexpected Neighbors from {:?}", &from);
return Ok(None);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand response_count. If we receive two responses with 10 results each, this would only consider the first one and ignore the second response? I am probably thick, but I would be grateful if you could explain the idea a bit more. Also, would datagram duplication not interfere with response_counts? (Not sure how much of a problem this would be in practice.)

Copy link
Contributor Author

@jimpo jimpo Jul 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is not really ideal, but I don't have a better solution. Since the NEIGHBORS packet does not have any ID or hash that can be used to associate it with a particular request, the best we can do is ensure there is at most one FIND_NODE in flight to any node at all times and associate the response based on sender. The sender is not supposed to send more than k=16 results to any query, so if they do, this code will ignore the second 10 result packet in your example. As far as packet duplication, if it is a concern, we could dedup the results based on node ID and track unique results in the response count.

I'm open to suggestions here.

jimpo added 8 commits July 6, 2018 17:52
This function is useful inside unit tests.
Previously the discovery algorithm would add nodes to the routing table
before confirming that the endpoint is participating in the protocol. This
now tracks in-flight pings and adds to the routing table only after receiving
a response.
Now that we may ping nodes before adding to a k-bucket, the timeout tracking
must be separate from BucketEntry.
Stores packet hash with in-flight requests and matches with pong response.
UDP packets may get dropped, so instead of immediately booting nodes that fail
to respond to a ping, retry 4 times with exponential backoff.
@5chdn 5chdn requested a review from tomaka July 9, 2018 06:52
}
};
if entry.get().response_count == BUCKET_SIZE {
entry.remove();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a node reports less than BUCKET_SIZE responses it will not have it's entry removed here and instead timeout. This will eventually cause the node from being removed altogether, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, ideally the packet would have an indication that it's the end of a series of NEIGHBORS packets, but since it doesn't I think this is the best strategy. From my reading, Geth has the same behavior (https://github.com/ethereum/go-ethereum/blob/master/p2p/discover/udp.go#L329) of triggering a timeout if it receives fewer than k results (and more than k would be incorrect).

I'm assuming that this won't be too much of a problem in practice because 1) you only return <16 if there's <16 nodes in the whole routing table (which I suppose is possible on a small network) 2) there are 4 retries and if any of them are a PING, it will clear the failure count. In a future PR, it would probably be good to put something in to ensure we PING nodes that have failed a few FIND_NODE requests.

Another alternative would be to keep the PendingRequest alive but not treat it as a timeout if it clears with <16 responses. On the other hand, I think it's more likely in that case that we stay connected to a high latency node that continues to give incomplete NEIGHBORS responses without ever booting them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative would be to keep the PendingRequest alive but not treat it as a timeout if it clears with <16 responses.

This is what I had in mind, but I agree that the approach chosen here is probably better in practice.

@5chdn 5chdn added the A1-onice 🌨 Pull request is reviewed well, but should not yet be merged. label Jul 10, 2018
@5chdn 5chdn modified the milestones: 2.0, 2.1 Jul 10, 2018
@5chdn 5chdn removed the A1-onice 🌨 Pull request is reviewed well, but should not yet be merged. label Jul 11, 2018
}
};
if entry.get().response_count == BUCKET_SIZE {
entry.remove();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative would be to keep the PendingRequest alive but not treat it as a timeout if it clears with <16 responses.

This is what I had in mind, but I agree that the approach chosen here is probably better in practice.

@5chdn 5chdn added B0-patchthis A8-looksgood 🦄 Pull request is reviewed well. and removed A0-pleasereview 🤓 Pull request needs code review. labels Jul 11, 2018
@5chdn 5chdn merged commit 01f825b into openethereum:master Jul 11, 2018
5chdn pushed a commit that referenced this pull request Jul 12, 2018
* discovery: Only add nodes to routing table after receiving pong.

Previously the discovery algorithm would add nodes to the routing table
before confirming that the endpoint is participating in the protocol. This
now tracks in-flight pings and adds to the routing table only after receiving
a response.

* discovery: Refactor packet creation into its own function.

This function is useful inside unit tests.

* discovery: Additional testing for new add_node behavior.

* discovery: Track expiration of pings to non-yet-in-bucket nodes.

Now that we may ping nodes before adding to a k-bucket, the timeout tracking
must be separate from BucketEntry.

* discovery: Verify echo hash on pong packets.

Stores packet hash with in-flight requests and matches with pong response.

* discovery: Track timeouts on FIND_NODE requests.

* discovery: Retry failed pings with exponential backoff.

UDP packets may get dropped, so instead of immediately booting nodes that fail
to respond to a ping, retry 4 times with exponential backoff.

* !fixup Use slice instead of Vec for request_backoff.
@5chdn 5chdn mentioned this pull request Jul 12, 2018
15 tasks
5chdn added a commit that referenced this pull request Jul 17, 2018
* parity-version: betalize 2.0

* Multiple improvements to discovery ping handling (#8771)

* discovery: Only add nodes to routing table after receiving pong.

Previously the discovery algorithm would add nodes to the routing table
before confirming that the endpoint is participating in the protocol. This
now tracks in-flight pings and adds to the routing table only after receiving
a response.

* discovery: Refactor packet creation into its own function.

This function is useful inside unit tests.

* discovery: Additional testing for new add_node behavior.

* discovery: Track expiration of pings to non-yet-in-bucket nodes.

Now that we may ping nodes before adding to a k-bucket, the timeout tracking
must be separate from BucketEntry.

* discovery: Verify echo hash on pong packets.

Stores packet hash with in-flight requests and matches with pong response.

* discovery: Track timeouts on FIND_NODE requests.

* discovery: Retry failed pings with exponential backoff.

UDP packets may get dropped, so instead of immediately booting nodes that fail
to respond to a ping, retry 4 times with exponential backoff.

* !fixup Use slice instead of Vec for request_backoff.

* Add separate database directory for light client (#8927) (#9064)

* Add seperate default DB path for light client (#8927)

* Improve readability

* Revert "Replace `std::env::home_dir` with `dirs::home_dir` (#9077)" (#9097)

* Revert "Replace `std::env::home_dir` with `dirs::home_dir` (#9077)"

This reverts commit 7e77932.

* Restore some of the changes

* Update parity-common

* Offload cull to IoWorker. (#9099)

* Fix work-notify. (#9104)

* Update hidapi, fixes #7542 (#9108)

* docker: add cmake dependency (#9111)

* Update light client hardcoded headers (#9098)

* Insert Kovan hardcoded headers until #7690241

* Insert Kovan hardcoded headers until block 7690241

* Insert Ropsten hardcoded headers until #3612673

* Insert Mainnet hardcoded headers until block 5941249

* Make sure to produce full blocks. (#9115)

* Insert ETC (classic) hardcoded headers until block #6170625 (#9121)

* fix verification in ethcore-sync collect_blocks (#9135)

* Completely remove all dapps struct from rpc (#9107)

* Completely remove all dapps struct from rpc

* Remove unused pub use

* `evm bench` fix broken dependencies (#9134)

* `evm bench` use valid dependencies

Benchmarks of the `evm` used stale versions of a couple a crates that
this commit fixes!

* fix warnings

* Update snapcraft.yaml (#9132)
@jimpo jimpo deleted the discovery-tracking branch July 18, 2018 16:27
@jimpo jimpo mentioned this pull request Sep 12, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A8-looksgood 🦄 Pull request is reviewed well. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants