-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RetryPolicy on VirtualHost for upstream transient connection issues #1267
Add RetryPolicy on VirtualHost for upstream transient connection issues #1267
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: norbjd The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1267 +/- ##
==========================================
+ Coverage 62.31% 62.50% +0.19%
==========================================
Files 24 24
Lines 1632 1635 +3
==========================================
+ Hits 1017 1022 +5
+ Misses 553 552 -1
+ Partials 62 61 -1 ☔ View full report in Codecov by Sentry. |
/retest |
Note: I don't know how to write an integration test for this... I've tried many things to kill TCP connections, mimic a flaky network, but can't have something working consistently. I guess network issues happen when we expect them the least 😅 The closest I've found to "cut" connections is to use network policies, but alas it's not supported on |
Putting back in draft because I need to test thoroughly to ensure it works as expected, sorry for the ping! |
Closing, because I cannot guarantee that the retry set in this MR works properly 😕 Sorry again for the premature ping to reviewers 🙇 |
Changes
/kind enhancement
For context, we are operating kourier "at scale":
In the gateway access logs, we sometimes see clients requests ending up in 503. These 503 are always mostly accompanied by
UC
response flag, meaning (from the docs):Connections can be terminated for many reasons, but most of the time, these are linked to transient TCP failures (connect timeout, reset, disconnect, etc.). As of now, the only way to deal with these 503 UC errors is to retry on the caller side, which is not really convenient - and not always possible - for callers.
In order to increase robustness and handle these specific transient cases, Envoy allows setting retry policies at
VirtualHost
level and/orRoute
level. By default, these retry policies are not configured by Kourier, so this is why Envoy throws directly 503s to the caller, sadly.This PR configures the
RetryPolicy
at theVirtualHost
level (because every VH has multiple routes, it would be cumbersome to define it on every route). The conditions to retry, as explained in Envoy docs, covers all transient upstream connection failures:Note 1: BTW, this is also what istio seems to do by default: https://istio.io/latest/docs/concepts/traffic-management/#retries, https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPRetry, but I'm not really familiar with it, so I can just trust the docs and what I find on the web.
Note 2: It might be tempting to retry on every upstream errors (e.g. 5xx), but it is probably not a good idea, as "real" 5xx sent by the users applications (in the ksvc) might not be retriable and can cause more harm if retried. Here, we will just focus on TCP connection errors.
Regarding the changes made on the PR itself: I'm pretty sure setting the retry policy for every
VirtualHost
can't be harmful; but, as I don't know your opinion on this, I've hidden it behind an option (WithRetryOnTransientUpstreamFailure()
, using option pattern).For now, the option is always on (
pkg/generator/ingress_translator.go
), but if you prefer, I can easily make it configurable through Kourier configmap (e.g. ifretry-on-upstream-transient-failures: true
in the config, call the option; otherwise, bypass it). When adding this option, I have also changedNewVirtualHostWithExtAuthz
, because I didn't want to addif
s everywhere, and managing this with options is far more convenient.So, from there, there are 2 paths I can take:
RetryPolicy
inNewVirtualHost
method (andNewVirtualHostWithExtAuthz
, by extension)translateIngress
signature (and all methods calling it) to include theWithRetryOnTransientUpstreamFailure()
option only if the user have opted-in in kourier configSolution 1 is easier to implement but does not guarantee side-effects (just adding retries in case of TCP connection issues should be fine though...), while solution 2 allows to be more configurable and retry can stay disabled by default.
Tell me what you think. Thanks 🙏
Release Note
Docs
N/A