Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix poll_ready usage for the Inbound service #1620

Merged
merged 15 commits into from
Feb 2, 2021

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Jan 22, 2021

Motivation

The Inbound service misuses poll_ready, filling up buffers, and potentially causing hangs.

See #1593 for a general explanation.

Solution

Make sure each poll_ready is followed by a call:

The code in this pull request has:

  • Documentation Comments
  • Manual Testing

Review

@yaahc and @oxarbitrage might want to look at this PR to see some poll_ready example code.

Related Issues

Solves some of the hangs from #1435

Follow Up Work

Do the rest of #1593

@teor2345 teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code C-cleanup Category: This is a cleanup labels Jan 22, 2021
@teor2345 teor2345 added this to the 2021 Sprint 1 milestone Jan 22, 2021
@teor2345 teor2345 requested review from yaahc and oxarbitrage January 22, 2021 13:14
@teor2345 teor2345 self-assigned this Jan 22, 2021
Copy link
Contributor

@yaahc yaahc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. I left one suggestion that I'm not sure will work but that I think is worth looking into.

zebrad/src/components/inbound.rs Outdated Show resolved Hide resolved
zebra-state/src/service.rs Outdated Show resolved Hide resolved
Uses the `ServiceExt::oneshot` design pattern from ZcashFoundation#1593.
Use `ServiceExt::oneshot` to perform state requests.

Explain that `ServiceExt::call_all` calls `poll_ready` internally.
Document a state service invariant imposed by `ServiceExt::call_all`.
ServiceExt::call_all leaks Tower::Buffer reservations, so we can't use
it in Zebra.

Instead, use a loop in the returned future.

See ZcashFoundation#1593 for details.
And log an info-level message as a diagnostic, in case setup takes a
long time.
This change encodes a bunch of invariants in the type system,
and adds explicit failure states for:
* a closed oneshot,
* bugs in the initialization code.
@mpguerra mpguerra linked an issue Jan 25, 2021 that may be closed by this pull request
1 task
@teor2345 teor2345 requested a review from yaahc January 29, 2021 03:10
@teor2345
Copy link
Contributor Author

@yaahc @oxarbitrage I think we're pretty close to the final version of this fix

@teor2345
Copy link
Contributor Author

The windows failure is an unrelated linker error #1651

@mpguerra mpguerra modified the milestones: 2021 Sprint 1, 2021 Sprint 2 Jan 29, 2021
@teor2345
Copy link
Contributor Author

teor2345 commented Feb 1, 2021

I'm re-running CI to test that windows passes after the latest fixes.

@oxarbitrage
Copy link
Contributor

You need to rebase the PR this the windows test to pass.

@teor2345
Copy link
Contributor Author

teor2345 commented Feb 1, 2021

You need to rebase the PR this the windows test to pass.

GitHub runs CI on a merge commit with the latest main branch, but sometimes it only makes a new merge when you push new commits.

@teor2345
Copy link
Contributor Author

teor2345 commented Feb 2, 2021

The macOS testnet failure is a known issue.

Copy link
Contributor Author

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what we need to revert from this PR, based on the latest Buffer/Batch constraints in #1593.

Comment on lines +248 to +251
// Correctness:
//
// We can't use `call_all` here, because it leaks buffer slots:
// https://github.com/tower-rs/tower/blob/master/tower/src/util/call_all/common.rs#L112
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is true, but for slightly different reasons:

  • call_all can hold one buffer slot per clone for a long time
  • we're not sure when the returned future will complete
  • we don't limit how many returned futures can be concurrently running

So we still can't use call_all.

zebrad/src/components/inbound.rs Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug C-cleanup Category: This is a cleanup
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants