Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

router: fixing a watermark bug for streaming retries #10866

Merged
merged 3 commits into from
Apr 22, 2020

Conversation

alyssawilk
Copy link
Contributor

@alyssawilk alyssawilk commented Apr 20, 2020

Fixes an issue where, if a retry was attempted when the upstream connection was watermark-overrun, data might spool upstream but reading from downstream would not resume. This is a preexisting design flaw which manifests now that we have streaming retries (causing them to time out rather than succeed, if the upstream buffer limit is smaller than the downstream buffer limit, and it backs up due to upstream slowness) because if the whole request is read, the loop of unwinding pause between requests takes care of it.

Risk Level: Medium (watermarks)
Testing: new unit test
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
ggreenway
ggreenway previously approved these changes Apr 20, 2020
Buffer::OwnedImpl data(std::string(640000, 'a'));
codec_client_->sendData(encoder_decoder.first, data, false);
if (paused(test_server_)) {
usleep(5000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of sleeping, are there any metrics we can monitor (rx bytes on downstream, tx bytes on upstream maybe?) to monitor for when the desired state is achieved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I think it's a similar race there. I think if I stopped doing this for both H1 and H2 I could probably get something hard-coded that worked at least on a given platform. Given this is testing in-Envoy retry logic and is pretty protocol agnostic I think that'd be sufficient for an integration test. Would you prefer that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always a fan of not using sleep in tests whenever possible. If it looks viable, and it'll be less racy than sleep, I'd prefer that. If it ends up being just as racy, don't bother.

@ggreenway
Copy link
Contributor

Also, you have a real failure in the coverage build.

@ggreenway
Copy link
Contributor

/wait

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk
Copy link
Contributor Author

Yeah, after spending over an hour I can't get a non-flaky version that's not terrible. I'll just stick with the unit test :-/

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk alyssawilk merged commit 38fb390 into envoyproxy:master Apr 22, 2020
penguingao pushed a commit to penguingao/envoy that referenced this pull request Apr 22, 2020
Fixes an issue where, if a retry was attempted when the upstream connection was watermark-overrun, data might spool upstream but reading from downstream would not resume. This is a preexisting design flaw which manifests now that we have streaming retries (causing them to time out rather than succeed, if the upstream buffer limit is smaller than the downstream buffer limit, and it backs up due to upstream slowness) because if the whole request is read, the loop of unwinding pause between requests takes care of it.

Risk Level: Medium (watermarks)
Testing: new unit test
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: pengg <pengg@google.com>
spenceral added a commit to spenceral/envoy that referenced this pull request Apr 23, 2020
Signed-off-by: Spencer Lewis <slewis@squareup.com>
* master: (46 commits)
  allow specifying the API version of bootstrap from the command line (envoyproxy#10803)
  config: adding connect matcher (unused) (envoyproxy#10894)
  Add missing dependency on `assert.h` (envoyproxy#10918)
  Lower heap and disk space used by kafka tests (envoyproxy#10915)
  [tools] handle commits merged without PR in deprecated script (envoyproxy#10723)
  tools: including working tree in modified_since_last_github_commit.sh diff. (envoyproxy#10911)
  rocketmq_proxy: implement rocketmq proxy
  [docs] PR template to include commit message (envoyproxy#10900)
  docs: breaking long word to stop content overflow. (envoyproxy#10880)
  Delete legacy connection pool code. (envoyproxy#10881)
  wasm: clarify how configuration is passed (envoyproxy#10782)
  issue template: clarify security/crash reporting (envoyproxy#10885)
  api/faq: add entry on incremental xDS. (envoyproxy#10876)
  router: retry overloaded requests (envoyproxy#10847)
  Remove inclusion of pthread.h, not needed for linux compilation (envoyproxy#10895)
  request_id: Add option to always set request id in response (envoyproxy#10808)
  xray: Use correct types for segment document output (envoyproxy#10834)
  router: fixing a watermark bug for streaming retries (envoyproxy#10866)
  http: auditing Path() calls for safety with Pathless CONNECT (envoyproxy#10851)
  Remove hardcoded type urls Part.2 (envoyproxy#10848)
  ...
@alyssawilk alyssawilk deleted the retry_with_overrun branch August 27, 2020 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants