Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1. Fix some address crawler timing issues #3293

Merged
merged 12 commits into from
Jan 4, 2022
Merged

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Dec 24, 2021

Motivation

Zebra's address crawler is fragile - small changes can cause significant regressions.

This PR deals with a few different sources of that instability.

Solution

  • Fix a race between the first heartbeat and getaddr requests
  • Temporarily reduce the getaddr fanout to 1, until we implement Limit fanouts to the number of active peers, Credit: Equilibrium #2214
  • Make sure finished requests are always sent as soon as they are ready
  • Simplify the Connection awaiting request state machine
  • Improve connection metrics and logging

Review

This PR is ready for review.

Let's assign a reviewer when people are back from leave.

Reviewer Checklist

  • Code implements Specs and Designs

I've tested this PR locally, and the crawler and address book metrics look a lot better. The number of peer addresses rises rapidly, and gets full after a few hours. And Zebra doesn't hang any more, even after a few days.

It's hard to test this PR, because the timing issues are subtle, and don't show up in some network environments. We should have better tests after we implement #1592.

@teor2345 teor2345 added P-Medium C-security Category: Security issues I-hang A Zebra component stops responding to requests I-slow Problems with performance or responsiveness I-invalid-data Zebra relies on invalid or untrusted data, or sends invalid data A-network Area: Network protocol updates or fixes A-diagnostics Area: Diagnosing issues or monitoring performance labels Dec 24, 2021
@teor2345 teor2345 self-assigned this Dec 24, 2021
@codecov
Copy link

codecov bot commented Dec 24, 2021

Codecov Report

Merging #3293 (84535f2) into main (9b12716) will decrease coverage by 0.02%.
The diff coverage is 8.69%.

@@            Coverage Diff             @@
##             main    #3293      +/-   ##
==========================================
- Coverage   77.25%   77.23%   -0.03%     
==========================================
  Files         265      265              
  Lines       31380    31445      +65     
==========================================
+ Hits        24244    24287      +43     
- Misses       7136     7158      +22     

@teor2345
Copy link
Contributor Author

teor2345 commented Dec 28, 2021

Bumping this up to high priority, because this fixes some occasional CI failures, and it might conflict with other PRs.

@teor2345 teor2345 added the I-integration-fail Continuous integration fails, including build and test failures label Dec 28, 2021
@teor2345
Copy link
Contributor Author

I've tested this PR locally, and the crawler and address book metrics look a lot better. The number of peer addresses rises rapidly, and gets full after a few hours. And Zebra doesn't hang any more, even after a few days.

@teor2345 teor2345 force-pushed the fix-address-crawler-timing branch from efd6bb0 to fa44edf Compare December 30, 2021 03:30
@teor2345 teor2345 enabled auto-merge (squash) January 3, 2022 07:58
@teor2345 teor2345 requested a review from dconnolly January 4, 2022 01:14
Comment on lines +991 to +993
// If the heartbeat is delayed, also delay all future heartbeats.
// (Shorter heartbeat intervals just add load, without any benefit.)
interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Delay);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +425 to +433
Message::Block(block) => format!(
"block {{ height: {}, hash: {} }}",
block
.coinbase_height()
.as_ref()
.map(|h| h.0.to_string())
.unwrap_or_else(|| "None".into()),
block.hash(),
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +82 to +94
// Display heights for single-block responses (which Zebra requests and expects)
Response::Blocks(blocks) if blocks.len() == 1 => {
let block = blocks.first().expect("len is 1");
format!(
"Block {{ height: {}, hash: {} }}",
block
.coinbase_height()
.as_ref()
.map(|h| h.0.to_string())
.unwrap_or_else(|| "None".into()),
block.hash(),
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@teor2345 teor2345 merged commit 469fa6b into main Jan 4, 2022
Copy link
Contributor

@dconnolly dconnolly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@teor2345 teor2345 deleted the fix-address-crawler-timing branch January 4, 2022 23:43
@mpguerra mpguerra mentioned this pull request Jan 27, 2022
49 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-diagnostics Area: Diagnosing issues or monitoring performance A-network Area: Network protocol updates or fixes C-security Category: Security issues I-hang A Zebra component stops responding to requests I-integration-fail Continuous integration fails, including build and test failures I-invalid-data Zebra relies on invalid or untrusted data, or sends invalid data I-slow Problems with performance or responsiveness
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants