Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc: switch servers and retry on error #15892

Merged
merged 2 commits into from
Jan 5, 2023
Merged

Conversation

boxofrad
Copy link
Contributor

@boxofrad boxofrad commented Jan 4, 2023

Description

This is the OSS portion of enterprise PR 3822, it has been reviewed thoroughly there.

It adds a custom gRPC balancer that replicates the router's server cycling behavior. It also enables automatic retries for RESOURCE_EXHAUSTED errors, which we now get for free.

Testing & Reproduction steps

The balancer package has unit tests that spin up real gRPC servers and clients.

Also manually tested in enterprise by:

  • Hacking the partition read endpoint to return RESOURCE_EXHAUSTED if req.Name != NodeName
  • Running two servers and a client agent
  • Running consul partition read server1 and consul partition read server2
  • Watching the agent logs and observing:
2022-12-09T12:15:45.850Z [TRACE] agent.grpc.balancer: witnessed RPC error: target=consul://dc1.7105466e-b7d7-8240-a500-88a14461aa49/server.dc1 server=dc1-127.0.0.1:9102 error="rpc error: code = ResourceExhausted desc = you got rate limited, my dude"
2022-12-09T12:15:45.850Z [DEBUG] agent.grpc.balancer: switching server: target=consul://dc1.7105466e-b7d7-8240-a500-88a14461aa49/server.dc1 from=dc1-127.0.0.1:9102 to=dc1-127.0.0.1:9101
2022-12-09T12:15:45.850Z [TRACE] agent.grpc.balancer: sub-connection state changed: target=consul://dc1.7105466e-b7d7-8240-a500-88a14461aa49/server.dc1 server=dc1-127.0.0.1:9101 state=CONNECTING
2022-12-09T12:15:45.851Z [TRACE] agent.grpc.balancer: sub-connection state changed: target=consul://dc1.7105466e-b7d7-8240-a500-88a14461aa49/server.dc1 server=dc1-127.0.0.1:9101 state=READY
2022-12-09T12:16:12.049Z [ERROR] agent.http: Request error: method=GET url=/v1/partition/foo from=127.0.0.1:62096 error="Partition not found for \"server2\""

Note that the balancer automatically switched connections and retried against the other server 🙌🏻

This is the OSS portion of enterprise PR 3822.

Adds a custom gRPC balancer that replicates the router's server cycling
behavior. Also enables automatic retries for RESOURCE_EXHAUSTED errors,
which we now get for free.
@boxofrad boxofrad requested a review from jmurret January 4, 2023 11:27
@boxofrad
Copy link
Contributor Author

boxofrad commented Jan 4, 2023

Hey @jmurret 👋🏻

Not sure if you've seen one of these manual ENT→OSS PRs before? Shout if you need any direction 🙇🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants