Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subgraph retry policy and circuit breaking #338

Closed
Geal opened this issue Jan 18, 2022 · 4 comments · Fixed by #2006
Closed

subgraph retry policy and circuit breaking #338

Geal opened this issue Jan 18, 2022 · 4 comments · Fixed by #2006
Assignees

Comments

@Geal
Copy link
Contributor

Geal commented Jan 18, 2022

we should be able to specify a timeout for queries to a subgraph, and different techniques to retry queries or stop sending traffic temporarily to a subgraph.

While those options are set globally for a subgraph, we should take care of maintaining a list of hosts for that subgraph with query timing and success statistics per host.

@abernix
Copy link
Member

abernix commented Feb 14, 2022

There are some existing Tower layers that might be worth looking at here. They may or may not be suitable, be GraphQL-aware, particularly if there are non-retryable properties. (In those cases, perhaps we can wrap one of the Tower layers that offers functionality and provide it the knowledge it needs)

(This could also relate to the fetcher Error-handling conversation.)

@abernix abernix removed the triage label May 23, 2022
@abernix abernix changed the title per subgraph timeout and retry policy / circuit breaker subgraph retry policy and circuit breaking Aug 12, 2022
@abernix
Copy link
Member

abernix commented Aug 12, 2022

Follows up #1347

@Geal Geal assigned Geal and unassigned Geal Oct 25, 2022
@Geal
Copy link
Contributor Author

Geal commented Oct 26, 2022

adding a retry would fix #1956

@Geal
Copy link
Contributor Author

Geal commented Oct 26, 2022

So I looked into the tower retry layer, and along with the issue of recognizing if a query is retryable (easy to decide on HTTP errors, less so for graphql ones), I am encountering architecture issues with plugins, the same I had in #1889: the layer will need the underlying service to be clonable.

In that PR, I solved it by moving the query deduplication layer out of the traffic shaping plugin, and instead applying it directly at the subgraph service creation. Then plugins like the traffic shaping are applied above that, taking as argument and returning a (non clonable) BoxService. I was able in that PR to keep the deduplication configuration in the traffic shaping configuration though.

I could do the same for retries, but we will encounter issues with plugin ordering:

  • a subgraph request would go through -> timeout -> rate limit -> -> retry -> deduplication -> subgraph (don't know where circuit breaking would fit here)
  • while what we actually need is -> timeout -> deduplication -> retry -> rate limit -> circuit breaking -> subgraph

So I would like to remove the subgraph plugin part of the traffic shaping plugin and move it to the subgraph service, or better, add a more generic internal trait that can be implemented by some plugins, to add layers to the subgraph and other services, without working on a BoxService. Again, that would still use the same traffic shaping configuration section, but dispatch the code elsewhere

@BrynCooke BrynCooke modified the milestones: v1-NEXT, v1.5.0 Dec 2, 2022
@garypen garypen added this to the v1.5.0 milestone Dec 5, 2022
@BrynCooke BrynCooke modified the milestone: v1.5.0 Dec 5, 2022
tinnou pushed a commit to Netflix-Skunkworks/router that referenced this issue Oct 16, 2023
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants