Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go-discover is only used during initial client introduction #17872

Open
tgross opened this issue Jul 10, 2023 · 1 comment
Open

go-discover is only used during initial client introduction #17872

tgross opened this issue Jul 10, 2023 · 1 comment

Comments

@tgross
Copy link
Member

tgross commented Jul 10, 2023

While we were working on #16490 @schmichael and I had a discussion about how we might improve the current process for server discovery and failover with respect to go-discover and Consul. Brainstorming some potential improvements:

  1. Punt Consul discovery to go-discover and call back into go-discover whenever a client can't find a single healthy server. (Probably more effort than its worth and significant backward compat risks.)
  2. Persist the server list to client state for faster registrations after node restarts. (This requires ensuring we don't increase our likelihood of hitting an error + long retry which might increase the number of down nodes due to restarts in stead of decreasing that!)
  3. Clients could request servers extend their heartbeat on graceful shutdown (2x the default? a bit risky)
  4. Actually document/design client disco/retry logic?

(4) is appealing as a bare minimum because our current logic predates RFCs and accreted new features in hopes just a bit more code would make it more robust. It's left us in a situation where there's some potential for weird emergent behavior, and it's all very difficult to debug, explain, or test for correctness in suboptimal conditions.

The existing code conflates some distinct properties and operations:

  1. "Consul Discovery" is just kind of sprinkled throughout... mostly I think with the intention of using it when we've run out of other options, but I don't know how precise that is and we may be placing far more trust in Consul's availability and correctness than we should!
  2. go-discover runs once concurrent with Client startup, so I think it races with Consul Discovery and could overwrite a perfectly good server list with a stale one? Again... hard to reason about since there's so many concurrent operations.
  3. Our discovery is concurrent with registration is concurrent with other RPCs and that causes tons of complexity: core: enforce strict steps for clients reconnect #15808

This needs some discussion and design.

@tgross
Copy link
Member Author

tgross commented Oct 26, 2023

Adding a note here that we're also adding go-netaddrs support in #18745.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants