Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement endpoint groups #5548

Merged
merged 4 commits into from
Feb 9, 2023
Merged

Conversation

fpetkovski
Copy link
Contributor

@fpetkovski fpetkovski commented Jul 28, 2022

This draft PR illustrates how an HA group of endpoints can be configured in Thanos Query.

Looking for feedback on the approach!

Fixes #5335

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Add the flags endpoint-group and endpoint-group-strict to Thanos Query for load balancing instead of fanout.

Verification

@fpetkovski fpetkovski force-pushed the ha-endpoints branch 3 times, most recently from 20e4953 to 6a6bdab Compare July 28, 2022 08:20
@fpetkovski fpetkovski marked this pull request as draft July 28, 2022 08:21
@fpetkovski
Copy link
Contributor Author

One downside of using load balancing from gRPC is that dns resolution is done once during startup and is cached for a very long time, potentially forever. Addresses will be re-resolved when the downstream target goes away.

This can be problematic if the endpoing group is scaled out and new targets are added. The recommended workaround here is to set max_connection_age to something like 5m which will cause periodic dns resolution, but also a complete recreation of all connections to the endpoint group.

@fpetkovski
Copy link
Contributor Author

cc @saswatamcode @SrushtiSapkale

saswatamcode
saswatamcode previously approved these changes Aug 1, 2022
Copy link
Member

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this! ✨

Some comments!

serviceConfig := `
{
"loadBalancingPolicy":"round_robin",
"retryPolicy": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be configurable via flags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this could be a good idea. The new endpoint config would also help here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this flag in this PR, with this particular config as the default? 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This policy is really an implementation detail, I don't think it makes sense to expose it to the end user. The retry always has to be there so that the client will re-resolve dns endpoints when the one it's connected to goes away. This is how HA is implemented in gRPC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see!

pkg/query/endpointset.go Show resolved Hide resolved
@stale
Copy link

stale bot commented Oct 1, 2022

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Oct 1, 2022
@fpetkovski
Copy link
Contributor Author

I am not sure what the status of the LFX project is, but this could be a short term solution that gets us unblocked until we have a better implementation.

@@ -11,6 +11,8 @@
"strings"
"time"

"google.golang.org/grpc"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typecheck:
other declaration of grpc


ℹ️ Learn about @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage
@sonatype-lift ignore Leave out the above finding from this PR
@sonatype-lift ignoreall Leave out all the existing findings from this PR
@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
Click here to add LiftBot to another repo.


Was this a good recommendation?
[ 🙁 Not relevant ] - [ 😕 Won't fix ] - [ 😑 Not critical, will fix ] - [ 🙂 Critical, will fix ] - [ 😊 Critical, fixing now ]

@stale
Copy link

stale bot commented Jan 7, 2023

Hello 👋 Looks like there was no activity on this amazing PR for the last 30 days.
Do you mind updating us on the status? Is there anything we can help with? If you plan to still work on it, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next week, this issue will be closed (we can always reopen a PR if you get back to this!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jan 7, 2023
@trevorriles
Copy link

Currently very interested in this work. What are the blockers for this being merged?

@stale stale bot removed the stale label Jan 23, 2023
@fpetkovski
Copy link
Contributor Author

There was an LFX project that was a superset of this functionality: #5505

Maybe @saswatamcode or @matej-g have some information on how far that got and whether it makes sense to continue working on this PR.

@trevorriles
Copy link

There was an LFX project that was a superset of this functionality: #5505

Maybe @saswatamcode or @matej-g have some information on how far that got and whether it makes sense to continue working on this PR.

Thanks for the link!

@saswatamcode
Copy link
Member

Yes, we made some progress on that end, but I think there were some more items left as well. This is also related to #2600

I'll try to organize them a bit, and move this forward! 🙂

@fpetkovski
Copy link
Contributor Author

I see that HA endpoints are out of scope of that proposal: https://github.com/thanos-io/thanos/pull/5505/files#diff-5dad1d444b473dcd0b72f4770b3ba03089499cfaf027205c83914f63124644e5R38. Are you looking to cover them in your work?

@fpetkovski
Copy link
Contributor Author

@SuperQ brought this feature up yesterday during contributor hours. Do we want to proceed with something like this until we have more extensive endpoint configuration? It would be great to not have to run envoy for load balancing gRPC connections since it's another component in the query path that can break. cc @saswatamcode @bwplotka

saswatamcode
saswatamcode previously approved these changes Feb 9, 2023
Copy link
Member

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think it would be great to proceed with this until we have a full-fledged endpoint configuration! 💪🏻

Just some minor comments on this!

serviceConfig := `
{
"loadBalancingPolicy":"round_robin",
"retryPolicy": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this flag in this PR, with this particular config as the default? 🙂

@@ -128,6 +128,9 @@ func registerQuery(app *extkingpin.App) {
endpoints := extkingpin.Addrs(cmd.Flag("endpoint", "Addresses of statically configured Thanos API servers (repeatable). The scheme may be prefixed with 'dns+' or 'dnssrv+' to detect Thanos API servers through respective DNS lookups.").
PlaceHolder("<endpoint>"))

endpointGroups := extkingpin.Addrs(cmd.Flag("endpoint-group", "DNS name of statically configured Thanos API server groups (repeatable). Targets resolved from the DNS name will be queried in a round-robin instead of a fanout manner. This flag should be used when connecting a Thanos Query to HA groups of Thanos components.").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
endpointGroups := extkingpin.Addrs(cmd.Flag("endpoint-group", "DNS name of statically configured Thanos API server groups (repeatable). Targets resolved from the DNS name will be queried in a round-robin instead of a fanout manner. This flag should be used when connecting a Thanos Query to HA groups of Thanos components.").
endpointGroups := extkingpin.Addrs(cmd.Flag("endpoint-group", "DNS name of statically configured Thanos API server groups (repeatable). Targets resolved from the DNS name will be queried in a round-robin manner by default instead of a fanout manner. This flag should be used when connecting a Thanos Query to HA groups of Thanos components.").

cmd/thanos/query.go Outdated Show resolved Hide resolved
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
saswatamcode
saswatamcode previously approved these changes Feb 9, 2023
Copy link
Member

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@fpetkovski
Copy link
Contributor Author

I marked these flags as experimental, there could be some hidden dragons that we'll uncover over time.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
@saswatamcode saswatamcode enabled auto-merge (squash) February 9, 2023 08:44
@saswatamcode saswatamcode merged commit 6ac440a into thanos-io:main Feb 9, 2023
@@ -128,6 +128,9 @@ func registerQuery(app *extkingpin.App) {
endpoints := extkingpin.Addrs(cmd.Flag("endpoint", "Addresses of statically configured Thanos API servers (repeatable). The scheme may be prefixed with 'dns+' or 'dnssrv+' to detect Thanos API servers through respective DNS lookups.").
PlaceHolder("<endpoint>"))

endpointGroups := extkingpin.Addrs(cmd.Flag("endpoint-group", "Experimental: DNS name of statically configured Thanos API server groups (repeatable). Targets resolved from the DNS name will be queried in a round-robin, instead of a fanout manner. This flag should be used when connecting a Thanos Query to HA groups of Thanos components.").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HA groups of Thanos components. means store only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be any component, querier, store, exemplar etc.. as long as they implement the same APIs. So a group should not have a mix of components.

ngraham20 pushed a commit to ngraham20/thanos that referenced this pull request Apr 17, 2023
* Implement endpoint groups

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Fix imports

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Remove stray grpc option

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Mark as experimental and regenerate docs

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

---------

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
ngraham20 pushed a commit to ngraham20/thanos that referenced this pull request Apr 17, 2023
* Implement endpoint groups

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Fix imports

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Remove stray grpc option

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Mark as experimental and regenerate docs

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

---------

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load balancing for Store Gateways
4 participants