-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement endpoint groups #5548
Conversation
20e4953
to
6a6bdab
Compare
One downside of using load balancing from gRPC is that dns resolution is done once during startup and is cached for a very long time, potentially forever. Addresses will be re-resolved when the downstream target goes away. This can be problematic if the endpoing group is scaled out and new targets are added. The recommended workaround here is to set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this! ✨
Some comments!
serviceConfig := ` | ||
{ | ||
"loadBalancingPolicy":"round_robin", | ||
"retryPolicy": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should be configurable via flags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this could be a good idea. The new endpoint config would also help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add this flag in this PR, with this particular config as the default? 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This policy is really an implementation detail, I don't think it makes sense to expose it to the end user. The retry always has to be there so that the client will re-resolve dns endpoints when the one it's connected to goes away. This is how HA is implemented in gRPC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see!
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
I am not sure what the status of the LFX project is, but this could be a short term solution that gets us unblocked until we have a better implementation. |
f7e061e
to
7a6c1e3
Compare
cmd/thanos/query.go
Outdated
@@ -11,6 +11,8 @@ | |||
"strings" | |||
"time" | |||
|
|||
"google.golang.org/grpc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typecheck:
other declaration of grpc
ℹ️ Learn about @sonatype-lift commands
You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.
Command | Usage |
---|---|
@sonatype-lift ignore |
Leave out the above finding from this PR |
@sonatype-lift ignoreall |
Leave out all the existing findings from this PR |
@sonatype-lift exclude <file|issue|path|tool> |
Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file |
Note: When talking to LiftBot, you need to refresh the page to see its response.
Click here to add LiftBot to another repo.
Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]
Hello 👋 Looks like there was no activity on this amazing PR for the last 30 days. |
Currently very interested in this work. What are the blockers for this being merged? |
There was an LFX project that was a superset of this functionality: #5505 Maybe @saswatamcode or @matej-g have some information on how far that got and whether it makes sense to continue working on this PR. |
Thanks for the link! |
Yes, we made some progress on that end, but I think there were some more items left as well. This is also related to #2600 I'll try to organize them a bit, and move this forward! 🙂 |
I see that HA endpoints are out of scope of that proposal: https://github.com/thanos-io/thanos/pull/5505/files#diff-5dad1d444b473dcd0b72f4770b3ba03089499cfaf027205c83914f63124644e5R38. Are you looking to cover them in your work? |
2d49543
to
8c75c4a
Compare
@SuperQ brought this feature up yesterday during contributor hours. Do we want to proceed with something like this until we have more extensive endpoint configuration? It would be great to not have to run envoy for load balancing gRPC connections since it's another component in the query path that can break. cc @saswatamcode @bwplotka |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think it would be great to proceed with this until we have a full-fledged endpoint configuration! 💪🏻
Just some minor comments on this!
serviceConfig := ` | ||
{ | ||
"loadBalancingPolicy":"round_robin", | ||
"retryPolicy": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add this flag in this PR, with this particular config as the default? 🙂
cmd/thanos/query.go
Outdated
@@ -128,6 +128,9 @@ func registerQuery(app *extkingpin.App) { | |||
endpoints := extkingpin.Addrs(cmd.Flag("endpoint", "Addresses of statically configured Thanos API servers (repeatable). The scheme may be prefixed with 'dns+' or 'dnssrv+' to detect Thanos API servers through respective DNS lookups."). | |||
PlaceHolder("<endpoint>")) | |||
|
|||
endpointGroups := extkingpin.Addrs(cmd.Flag("endpoint-group", "DNS name of statically configured Thanos API server groups (repeatable). Targets resolved from the DNS name will be queried in a round-robin instead of a fanout manner. This flag should be used when connecting a Thanos Query to HA groups of Thanos components."). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
endpointGroups := extkingpin.Addrs(cmd.Flag("endpoint-group", "DNS name of statically configured Thanos API server groups (repeatable). Targets resolved from the DNS name will be queried in a round-robin instead of a fanout manner. This flag should be used when connecting a Thanos Query to HA groups of Thanos components."). | |
endpointGroups := extkingpin.Addrs(cmd.Flag("endpoint-group", "DNS name of statically configured Thanos API server groups (repeatable). Targets resolved from the DNS name will be queried in a round-robin manner by default instead of a fanout manner. This flag should be used when connecting a Thanos Query to HA groups of Thanos components."). |
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
8c75c4a
to
d51fd71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I marked these flags as experimental, there could be some hidden dragons that we'll uncover over time. |
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
7343447
to
7fe57a1
Compare
@@ -128,6 +128,9 @@ func registerQuery(app *extkingpin.App) { | |||
endpoints := extkingpin.Addrs(cmd.Flag("endpoint", "Addresses of statically configured Thanos API servers (repeatable). The scheme may be prefixed with 'dns+' or 'dnssrv+' to detect Thanos API servers through respective DNS lookups."). | |||
PlaceHolder("<endpoint>")) | |||
|
|||
endpointGroups := extkingpin.Addrs(cmd.Flag("endpoint-group", "Experimental: DNS name of statically configured Thanos API server groups (repeatable). Targets resolved from the DNS name will be queried in a round-robin, instead of a fanout manner. This flag should be used when connecting a Thanos Query to HA groups of Thanos components."). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HA groups of Thanos components.
means store only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be any component, querier, store, exemplar etc.. as long as they implement the same APIs. So a group should not have a mix of components.
* Implement endpoint groups Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Fix imports Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Remove stray grpc option Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Mark as experimental and regenerate docs Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> --------- Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
* Implement endpoint groups Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Fix imports Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Remove stray grpc option Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Mark as experimental and regenerate docs Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> --------- Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
This draft PR illustrates how an HA group of endpoints can be configured in Thanos Query.
Looking for feedback on the approach!
Fixes #5335
Changes
Add the flags
endpoint-group
andendpoint-group-strict
to Thanos Query for load balancing instead of fanout.Verification