Add RetryPolicy on VirtualHost for upstream transient connection issues #1267

norbjd · 2024-06-23T16:21:36Z

Changes

🎁 Add RetryPolicy on VirtualHost for upstream transient connection issues

/kind enhancement

For context, we are operating kourier "at scale":

thousands of knative services
hundreds of requests/second
a lot of activity with new ksvc created, updated, deleted continuously

In the gateway access logs, we sometimes see clients requests ending up in 503. These 503 are always mostly accompanied by UC response flag, meaning (from the docs):

Upstream connection termination in addition to 503 response code.

Connections can be terminated for many reasons, but most of the time, these are linked to transient TCP failures (connect timeout, reset, disconnect, etc.). As of now, the only way to deal with these 503 UC errors is to retry on the caller side, which is not really convenient - and not always possible - for callers.

In order to increase robustness and handle these specific transient cases, Envoy allows setting retry policies at VirtualHost level and/or Route level. By default, these retry policies are not configured by Kourier, so this is why Envoy throws directly 503s to the caller, sadly.

This PR configures the RetryPolicy at the VirtualHost level (because every VH has multiple routes, it would be cumbersome to define it on every route). The conditions to retry, as explained in Envoy docs, covers all transient upstream connection failures:

reset: Envoy will attempt a retry if the upstream server does not respond at all (disconnect/reset/read timeout.)

connect-failure: Envoy will attempt a retry if a request is failed because of a connection failure to the upstream server (connect timeout, etc.). [...] NOTE: A connection failure/timeout is a the TCP level, not the request level

Note 1: BTW, this is also what istio seems to do by default: https://istio.io/latest/docs/concepts/traffic-management/#retries, https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPRetry, but I'm not really familiar with it, so I can just trust the docs and what I find on the web.

Note 2: It might be tempting to retry on every upstream errors (e.g. 5xx), but it is probably not a good idea, as "real" 5xx sent by the users applications (in the ksvc) might not be retriable and can cause more harm if retried. Here, we will just focus on TCP connection errors.

Regarding the changes made on the PR itself: I'm pretty sure setting the retry policy for every VirtualHost can't be harmful; but, as I don't know your opinion on this, I've hidden it behind an option (WithRetryOnTransientUpstreamFailure(), using option pattern).

For now, the option is always on (pkg/generator/ingress_translator.go), but if you prefer, I can easily make it configurable through Kourier configmap (e.g. if retry-on-upstream-transient-failures: true in the config, call the option; otherwise, bypass it). When adding this option, I have also changed NewVirtualHostWithExtAuthz, because I didn't want to add ifs everywhere, and managing this with options is far more convenient.

So, from there, there are 2 paths I can take:

configure the retry by default: in that case, I can completely remove the "options" pattern, and just configure the RetryPolicy in NewVirtualHost method (and NewVirtualHostWithExtAuthz, by extension)
make the retry configurable by the user: in that case, I will keep the "options" pattern, but I will need to change translateIngress signature (and all methods calling it) to include the WithRetryOnTransientUpstreamFailure() option only if the user have opted-in in kourier config

Solution 1 is easier to implement but does not guarantee side-effects (just adding retries in case of TCP connection issues should be fine though...), while solution 2 allows to be more configurable and retry can stay disabled by default.

Tell me what you think. Thanks 🙏

Release Note

Add retry policy on every virtual host for upstream transient connection issues

Docs

N/A

knative-prow · 2024-06-23T16:21:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: norbjd
Once this PR has been reviewed and has the lgtm label, please assign davidhadas for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/generator/ingress_translator.go

codecov · 2024-06-23T16:51:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 62.50%. Comparing base (593ddde) to head (c48df14).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1267      +/-   ##
==========================================
+ Coverage   62.31%   62.50%   +0.19%     
==========================================
  Files          24       24              
  Lines        1632     1635       +3     
==========================================
+ Hits         1017     1022       +5     
+ Misses        553      552       -1     
+ Partials       62       61       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pkg/envoy/api/virtual_host.go

norbjd · 2024-06-23T17:48:36Z

/retest

norbjd · 2024-06-23T17:56:02Z

Note: I don't know how to write an integration test for this... I've tried many things to kill TCP connections, mimic a flaky network, but can't have something working consistently. I guess network issues happen when we expect them the least 😅 The closest I've found to "cut" connections is to use network policies, but alas it's not supported on kind (kubernetes-sigs/kind#842) so it won't help in our case.

norbjd · 2024-06-26T14:42:18Z

Putting back in draft because I need to test thoroughly to ensure it works as expected, sorry for the ping!

norbjd · 2024-07-27T16:41:06Z

Closing, because I cannot guarantee that the retry set in this MR works properly 😕 Sorry again for the premature ping to reviewers 🙇

Add RetryPolicy on VirtualHost for transient connection issues

c7c1285

knative-prow bot added the kind/enhancement label Jun 23, 2024

knative-prow bot requested review from izabelacg and skonto June 23, 2024 16:21

knative-prow bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 23, 2024

norbjd commented Jun 23, 2024

View reviewed changes

pkg/generator/ingress_translator.go Show resolved Hide resolved

norbjd added 3 commits June 23, 2024 18:26

Fix build issues

905f86e

Fix tests build issues

cadcd14

Forgot to remove a parameter

711cb36

Lint

c48df14

norbjd commented Jun 23, 2024

View reviewed changes

pkg/envoy/api/virtual_host.go Show resolved Hide resolved

norbjd marked this pull request as draft June 26, 2024 14:41

knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 26, 2024

norbjd closed this Jul 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RetryPolicy on VirtualHost for upstream transient connection issues #1267

Add RetryPolicy on VirtualHost for upstream transient connection issues #1267

norbjd commented Jun 23, 2024

knative-prow bot commented Jun 23, 2024

codecov bot commented Jun 23, 2024 •

edited

Loading

norbjd commented Jun 23, 2024

norbjd commented Jun 23, 2024

norbjd commented Jun 26, 2024

norbjd commented Jul 27, 2024

Add RetryPolicy on VirtualHost for upstream transient connection issues #1267

Add RetryPolicy on VirtualHost for upstream transient connection issues #1267

Conversation

norbjd commented Jun 23, 2024

Changes

knative-prow bot commented Jun 23, 2024

codecov bot commented Jun 23, 2024 • edited Loading

Codecov Report

norbjd commented Jun 23, 2024

norbjd commented Jun 23, 2024

norbjd commented Jun 26, 2024

norbjd commented Jul 27, 2024

codecov bot commented Jun 23, 2024 •

edited

Loading