Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable Retries in HTTPRoute #1731

Closed
robscott opened this issue Feb 15, 2023 · 24 comments · Fixed by #3199 or #3301
Closed

Configurable Retries in HTTPRoute #1731

robscott opened this issue Feb 15, 2023 · 24 comments · Fixed by #3199 or #3301
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. kind/gep PRs related to Gateway Enhancement Proposal(GEP) priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@robscott
Copy link
Member

What would you like to be added:
I would like to be able to configure the following in HTTPRoute:

  1. The max number of times to retry a request
  2. The reason(s) and/or status codes a request should be retried
  3. The timeout for each retry attempt

I believe all 3 of these would be implementable for Envoy based implementations, the first 2 would be implementable for HAProxy based implementations, unclear what would be implementable for NGINX or others (cc @pleshakov @shaneutt).

Why this is needed:
This is a common feature request and represents a concept that would likely get tied to a variety of custom policies if we did not include it in the main API.

@robscott robscott added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 15, 2023
@shaneutt shaneutt added this to the v0.7.0 milestone Feb 15, 2023
@shaneutt shaneutt added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Feb 15, 2023
@Xunzhuo
Copy link
Member

Xunzhuo commented Mar 8, 2023

/assign
/remove-help

@k8s-ci-robot k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 8, 2023
@kflynn
Copy link
Contributor

kflynn commented Mar 9, 2023

@robscott, on the mesh front, Linkerd actually doesn't support the fixed-number-of-retries concept. Instead there's a retry budget: if your retry budget is e.g. 20%, then as long as not more than 20% of your request volume is retries, Linkerd can continue retrying.

It would be lovely to be able to configure that form of retry, too, without having to resort to crazy custom stuff. Just to complicate your life. 😉

@bowei
Copy link
Contributor

bowei commented Mar 10, 2023

That's a very interesting approach. Is the 20% calculated "globally" or on a-per proxy basis? This may be difficult for more distributed proxy systems to do for a global percentage.

@dprotaso
Copy link
Contributor

@kflynn does setting 0% disable retries?

@ramaraochavali
Copy link

https://www.envoyproxy.io/docs/envoy/v1.26.1/api-v3/config/cluster/v3/circuit_breaker.proto#envoy-v3-api-field-config-cluster-v3-circuitbreakers-thresholds-retry-budget - Envoy also supports retry budgets. If you do not specify any value for this, it disables retries

@robscott
Copy link
Member Author

robscott commented May 8, 2023

@ramaraochavali thanks for the reference! I'd assumed this would be implemented by RetryPolicy in xDS which does not seem to require retry budgets, but I'm also far from an Envoy expert so may be missing some nuance here.

@ramaraochavali
Copy link

@shaneutt shaneutt added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels May 18, 2023
@shaneutt shaneutt removed this from the v0.8.0 milestone May 18, 2023
@frankbu
Copy link
Contributor

frankbu commented Jun 23, 2023

Although configuring retry budgets and other circuit breaking mechanisms would be useful to support, I don't think they are typically configurable on the individual route (vs service) level. Since the title of this issue is about configuring retries on HTTPRoute specifically, I wonder if it make sense to start with a GEP that only addresses retry configuration on an HTTPRoute and then handle retry budgets, etc. (probably using policy attachments instead of explicit fields in the API) in a separate GEP.

@robscott If you want to assign the issue to me, I can put together a first pass GEP to discuss this further.

@robscott
Copy link
Member Author

robscott commented Jun 23, 2023

Thanks @frankbu!

Although configuring retry budgets and other circuit breaking mechanisms would be useful to support, I don't think they are typically configurable on the individual route (vs service) level.

I'm certainly biased because GCP load balancers configure retries at the routing layer. My understanding is that both HAProxy and Envoy are also capable of this, but definitely could be wrong on either of those.

I wonder if it make sense to start with a GEP that only addresses retry configuration on an HTTPRoute and then handle retry budgets, etc

That approach makes sense to me. In general, we want to include concepts in the API that are portable and have a path for >50% of implementations to support. I think what you've recommended starting with would meet that criteria, but I'm not sure retry budgets have the same portability right now (again could be wrong on that).

I think a GEP is a great idea here, and similar to timeouts and session affinity, it would likely be helpful to provide an overview of the current state of the world in that GEP before going too far with details. Thanks for volunteering to help out with this!

/assign @frankbu

@kflynn
Copy link
Contributor

kflynn commented Jun 24, 2023

Linkerd configures retries at the route, too, FWIW...

@pleshakov
Copy link
Contributor

NGINX supports retries per route too with proxy_next_upstream* directives

@robscott
Copy link
Member Author

@pleshakov would proxy_next_upstream_timeout be similar to the backendRequest timeout proposed by GEP 1742? (https://gateway-api.sigs.k8s.io/geps/gep-1742/#timeout-values)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2024
@frankbu frankbu removed their assignment Jan 31, 2024
@lubronzhan
Copy link

lubronzhan commented Feb 18, 2024

Question, are we gonna implement per route way?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 19, 2024
@mikemorris
Copy link
Contributor

Discussed this during a meeting of Gateway API maintainers at KubeCon EU 2024 as a potential priority for Gateway API v1.2, first step would be a memorandum GEP documenting existing configuration and behavior across a range of implementations.

/remove-lifecycle rotten
/assign @mikemorris

@achetronic
Copy link

Confirmed as a needed feature. This is affecting us too (using Istio as implementation). How is the status of this on May? :)

@shaneutt
Copy link
Member

shaneutt commented Jun 5, 2024

Thanks for confirming your desire for this feature.

How is the status of this on May? :)

Can you help me to better understand this? I'm uncertain what is meant 🤔

@mikemorris
Copy link
Contributor

mikemorris commented Jun 5, 2024

We're proposing this for inclusion in the Gateway API v1.2 scope as an experimental feature. I'm not quite sure of the expected release date for v1.2, but I'd expect roughly fall/late Q3 2024.

@achetronic
Copy link

achetronic commented Jun 10, 2024

Sorry for the delay, guys

Thanks for confirming your desire for this feature.

How is the status of this on May? :)

Can you help me to better understand this? I'm uncertain what is meant 🤔

@shaneutt I meant "how is the current status for that?" sorry for the miss-explanation

We're proposing this for inclusion in the Gateway API v1.2 scope as an experimental feature. I'm not quite sure of the expected release date for v1.2, but I'd expect roughly fall/late Q3 2024.

Thank you for the info! I will check it

@robscott
Copy link
Member Author

If anyone's interested in seeing this in scope for v1.2, please upvote and/or comment on Mike's v1.2 scoping proposal.

@achetronic
Copy link

If anyone's interested in seeing this in scope for v1.2, please upvote and/or comment on Mike's v1.2 scoping proposal.

Done! thank you for clarifying this

@robscott robscott added the kind/gep PRs related to Gateway Enhancement Proposal(GEP) label Jun 11, 2024
@robscott robscott added this to the v1.2.0 milestone Jun 11, 2024
@robscott
Copy link
Member Author

robscott commented Aug 2, 2024

/reopen to track lifecycle of GEP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. kind/gep PRs related to Gateway Enhancement Proposal(GEP) priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
Status: Experimental