Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Further improved availability recovery #3711

Merged
merged 20 commits into from
Aug 27, 2021
Merged

Conversation

eskimor
Copy link
Member

@eskimor eskimor commented Aug 24, 2021

  • Don't fetch more chunks than needed, except when we have a non zero error rate.
  • Properly track undead (soft cancelled) requests in order to make sure to always have enough "live" requests in flight.
  • Add metrics that should be useful

@eskimor eskimor added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Aug 24, 2021
node/primitives/src/lib.rs Outdated Show resolved Hide resolved
while let Some(request_result) =
self.requesting_chunks.next().timeout(MAX_CHUNK_WAIT).await.flatten()
self.requesting_chunks.next_with_timeout(TIMEOUT_START_NEW_REQUESTS).await
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@ordian ordian added this to the v0.9.10 milestone Aug 27, 2021
@ordian ordian merged commit 8fab1d8 into master Aug 27, 2021
@ordian ordian deleted the rk-availabilty-recovery-v2 branch August 27, 2021 16:59
ordian added a commit that referenced this pull request Aug 27, 2021
* master:
  Further improved availability recovery (#3711)
  node/service: Update finality target to fix disputes tests (#3732)
// 2. We request more chunks to make up for it
// 3. Bandwidth is spread out even more, so we get even more timeouts
// 4. We request more chunks to make up for it ...
let max_requests_boundary = std::cmp::min(N_PARALLEL, threshold);
Copy link
Contributor

@rphmeier rphmeier Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note: we should revisit N_PARALLEL. Perhaps this is something the lower-level networking code should care about and not the higher-level.

registry,
)?,
time_chunk_request: prometheus::register(
prometheus::Histogram::with_opts(prometheus::HistogramOpts::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the default buckets definitely good enough for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I will check!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants