Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RetryPolicy on VirtualHost for upstream transient connection issues #1267

Conversation

norbjd
Copy link
Contributor

@norbjd norbjd commented Jun 23, 2024

Changes

  • 🎁 Add RetryPolicy on VirtualHost for upstream transient connection issues

/kind enhancement


For context, we are operating kourier "at scale":

  • thousands of knative services
  • hundreds of requests/second
  • a lot of activity with new ksvc created, updated, deleted continuously

In the gateway access logs, we sometimes see clients requests ending up in 503. These 503 are always mostly accompanied by UC response flag, meaning (from the docs):

Upstream connection termination in addition to 503 response code.

Connections can be terminated for many reasons, but most of the time, these are linked to transient TCP failures (connect timeout, reset, disconnect, etc.). As of now, the only way to deal with these 503 UC errors is to retry on the caller side, which is not really convenient - and not always possible - for callers.

In order to increase robustness and handle these specific transient cases, Envoy allows setting retry policies at VirtualHost level and/or Route level. By default, these retry policies are not configured by Kourier, so this is why Envoy throws directly 503s to the caller, sadly.

This PR configures the RetryPolicy at the VirtualHost level (because every VH has multiple routes, it would be cumbersome to define it on every route). The conditions to retry, as explained in Envoy docs, covers all transient upstream connection failures:

  • reset: Envoy will attempt a retry if the upstream server does not respond at all (disconnect/reset/read timeout.)
  • connect-failure: Envoy will attempt a retry if a request is failed because of a connection failure to the upstream server (connect timeout, etc.). [...] NOTE: A connection failure/timeout is a the TCP level, not the request level

Note 1: BTW, this is also what istio seems to do by default: https://istio.io/latest/docs/concepts/traffic-management/#retries, https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPRetry, but I'm not really familiar with it, so I can just trust the docs and what I find on the web.

Note 2: It might be tempting to retry on every upstream errors (e.g. 5xx), but it is probably not a good idea, as "real" 5xx sent by the users applications (in the ksvc) might not be retriable and can cause more harm if retried. Here, we will just focus on TCP connection errors.


Regarding the changes made on the PR itself: I'm pretty sure setting the retry policy for every VirtualHost can't be harmful; but, as I don't know your opinion on this, I've hidden it behind an option (WithRetryOnTransientUpstreamFailure(), using option pattern).

For now, the option is always on (pkg/generator/ingress_translator.go), but if you prefer, I can easily make it configurable through Kourier configmap (e.g. if retry-on-upstream-transient-failures: true in the config, call the option; otherwise, bypass it). When adding this option, I have also changed NewVirtualHostWithExtAuthz, because I didn't want to add ifs everywhere, and managing this with options is far more convenient.

So, from there, there are 2 paths I can take:

  1. configure the retry by default: in that case, I can completely remove the "options" pattern, and just configure the RetryPolicy in NewVirtualHost method (and NewVirtualHostWithExtAuthz, by extension)
  2. make the retry configurable by the user: in that case, I will keep the "options" pattern, but I will need to change translateIngress signature (and all methods calling it) to include the WithRetryOnTransientUpstreamFailure() option only if the user have opted-in in kourier config

Solution 1 is easier to implement but does not guarantee side-effects (just adding retries in case of TCP connection issues should be fine though...), while solution 2 allows to be more configurable and retry can stay disabled by default.

Tell me what you think. Thanks 🙏

Release Note

Add retry policy on every virtual host for upstream transient connection issues

Docs

N/A

Copy link

knative-prow bot commented Jun 23, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: norbjd
Once this PR has been reviewed and has the lgtm label, please assign davidhadas for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 23, 2024
Copy link

codecov bot commented Jun 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 62.50%. Comparing base (593ddde) to head (c48df14).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1267      +/-   ##
==========================================
+ Coverage   62.31%   62.50%   +0.19%     
==========================================
  Files          24       24              
  Lines        1632     1635       +3     
==========================================
+ Hits         1017     1022       +5     
+ Misses        553      552       -1     
+ Partials       62       61       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@norbjd
Copy link
Contributor Author

norbjd commented Jun 23, 2024

/retest

@norbjd
Copy link
Contributor Author

norbjd commented Jun 23, 2024

Note: I don't know how to write an integration test for this... I've tried many things to kill TCP connections, mimic a flaky network, but can't have something working consistently. I guess network issues happen when we expect them the least 😅 The closest I've found to "cut" connections is to use network policies, but alas it's not supported on kind (kubernetes-sigs/kind#842) so it won't help in our case.

@norbjd norbjd marked this pull request as draft June 26, 2024 14:41
@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 26, 2024
@norbjd
Copy link
Contributor Author

norbjd commented Jun 26, 2024

Putting back in draft because I need to test thoroughly to ensure it works as expected, sorry for the ping!

@norbjd
Copy link
Contributor Author

norbjd commented Jul 27, 2024

Closing, because I cannot guarantee that the retry set in this MR works properly 😕 Sorry again for the premature ping to reviewers 🙇

@norbjd norbjd closed this Jul 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/enhancement size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant