Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki: Add a ring to the query scheduler to allow discovery via the ring as an alternative to DNS #4424

Merged
merged 10 commits into from
Oct 21, 2021

Conversation

slim-bean
Copy link
Collaborator

@slim-bean slim-bean commented Oct 7, 2021

To support this change, several packages were forked from upstream Cortex. This is intended to be a transient fork to validate these changes in Loki before upstreaming.

The first commit is forking the files, the second commit includes all the changes.

The newly added ring watcher leverages the existing infrastructure which used DNS for finding the scheduler, there are some interfaces and config options I re-used for the the new ring watcher which are now not the most consistently named.

For example, the poll interval on the ring still uses the existing DNS named config options for polling, e.x. dns_lookup_duration

I debated introducing new config options but currently am of the opinion this just makes the config more confusing and cumbersome. Instead explaining the dual functionality in the documentation.

Checklist

  • Documentation added
  • Tests updated

@slim-bean slim-bean requested a review from a team as a code owner October 7, 2021 01:44
Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea but there seems to be an awful lot of copy pasta. What is the reason not to add it to Cortex?

@@ -0,0 +1,75 @@
package frontend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the package path stuttering? Why not pkg/loki/frontend/config.go?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is done mostly for simplicity, lokifrontend/config.go already existed and hopefully when we upstream these changes and removed the forked code it can just remain the way it was.

)

// GrpcRoundTripper is similar to http.RoundTripper, but works with HTTP requests converted to protobuf messages.
type GrpcRoundTripper interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we could add to dskit instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could eventually be moved to dskit (as it was taken verbatim from cortex), but I don't think that should be in scope for this PR.

@@ -0,0 +1,352 @@
package v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the change to the Cortex version?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand the question.


"github.com/cortexproject/cortex/pkg/frontend/v2/frontendv2pb"
"github.com/cortexproject/cortex/pkg/querier/stats"
"github.com/cortexproject/cortex/pkg/ring"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using ring from dskit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because Loki hasnt been updated to use the ring from dskit as a whole, that should be done as a separate step

Copy link
Contributor

@KMiller-Grafana KMiller-Grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking my approval of the docs part of this PR.
At some point, it'd be super to correct all the grammar and punctuation and improve the wording of all the configuration knobs. We'll do that in the code and then autogenerate this list of knobs.

@slim-bean
Copy link
Collaborator Author

I like the idea but there seems to be an awful lot of copy pasta. What is the reason not to add it to Cortex?

I tried to explain in the description, copying is intentional to prove the concept before upstreaming, also with the refactoring taking place with dskit it would be difficult to make this change upstream and revendor it within the timeframe I'd like to see this working for Loki.

Copy link
Contributor

@KMiller-Grafana KMiller-Grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review: I've added quite a few suggestions. They are not required, but would be very nice in the documentation for this PR.

docs/sources/configuration/_index.md Outdated Show resolved Hide resolved
docs/sources/configuration/_index.md Outdated Show resolved Hide resolved
docs/sources/configuration/_index.md Outdated Show resolved Hide resolved
docs/sources/configuration/_index.md Outdated Show resolved Hide resolved
docs/sources/configuration/_index.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@trevorwhitney trevorwhitney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, excited to get this merged!

My biggest question is around the change to module dependencies. I'm curious why we have to make the Querier and Frontend dependent on the Scheduler? If we need to do this, there are some configurations we're accounting for in other places that are no longer possible. It also seems to remove the need for the v1 Frontend altogether?

pkg/loki/loki.go Outdated
@@ -432,16 +433,16 @@ func (t *Loki) setupModuleManager() error {
Distributor: {Ring, Server, Overrides, TenantConfigs},
Store: {Overrides},
Ingester: {Store, Server, MemberlistKV, TenantConfigs},
Querier: {Store, Ring, Server, IngesterQuerier, TenantConfigs},
Querier: {QueryScheduler, Store, Ring, Server, IngesterQuerier, TenantConfigs},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is confusing to me. Do we in fact always want to spin up a scheduler when we spin up a querier? I'm missing the technical reason why, as it seems like the worker config can take a nil for the query schedule ReadRing. Is it an order thing, where when we do have both a querier and a scheduler we need to make sure the scheduler starts up first?

By making this change is seems like we would prevent the need to check for cfg.FrontendAddress as you would always have a scheduler, so you would probably want to either use the scheduler ring or scheduler address on localhost if none other was specified.

This change seems to make it impossible to have a frontend + querier configuration, or to have frontend + scheduler pods be separate from querier pods?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does that mean querierRunningStandalone will never return true anymore?

pkg/loki/loki.go Outdated
QueryFrontendTripperware: {Server, Overrides, TenantConfigs},
QueryFrontend: {QueryFrontendTripperware},
QueryScheduler: {Server, Overrides},
QueryFrontend: {QueryScheduler, QueryFrontendTripperware},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here, I'm confused why the scheduler needs to become a dependency of the frontend?

@@ -416,7 +417,7 @@ func (t *Loki) initQueryFrontend() (_ services.Service, err error) {
FrontendV1: t.Cfg.Frontend.FrontendV1,
FrontendV2: t.Cfg.Frontend.FrontendV2,
DownstreamURL: t.Cfg.Frontend.DownstreamURL,
}, disabledShuffleShardingLimits{}, t.Cfg.Server.GRPCListenPort, util_log.Logger, prometheus.DefaultRegisterer)
}, t.queryScheduler.ReadRing(), disabledShuffleShardingLimits{}, t.Cfg.Server.GRPCListenPort, util_log.Logger, prometheus.DefaultRegisterer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this call site is getting pretty big, how would you feel about pulling frontend.CombinedFrontendConfig into a local variable and/or moving the arguments to their own lines?

// is configured, Loki will default to using the frontend on localhost on it's own GRPC listening port.
if (*cfg.QuerierWorkerConfig).FrontendAddress == "" && (*cfg.QuerierWorkerConfig).SchedulerAddress == "" {
if cfg.SchedulerRing == nil && (*cfg.QuerierWorkerConfig).FrontendAddress == "" && (*cfg.QuerierWorkerConfig).SchedulerAddress == "" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depending on how we resolve the dependency changes (ie. if the querier really does depend on the scheduler), then we may want to set the scheduler address instead of the frontend address on this condition.

Comment on lines +76 to +82
if r.Worker.FrontendAddress == "" &&
r.Worker.SchedulerAddress == "" &&
r.Frontend.FrontendV2.SchedulerAddress == "" {
r.QueryScheduler.UseSchedulerRing = true
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a quick test for this in config_wrapper_test.go?

@@ -87,5 +95,6 @@ func applyMemberlistConfig(r *ConfigWrapper) {
r.Ingester.LifecyclerConfig.RingConfig.KVStore.Store = memberlistStr
r.Distributor.DistributorRing.KVStore.Store = memberlistStr
r.Ruler.Ring.KVStore.Store = memberlistStr
r.QueryScheduler.SchedulerRing.KVStore.Store = memberlistStr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a quick test for this in config_wrapper_test.go? there's a few other memberlist tests so maybe just adding to an existing for copy/pasing?

@trevorwhitney
Copy link
Collaborator

@slim-bean I reworked the module dependencies, and have been testing locally where I have the read and write targets, and everything is working. I want to try a local microservices setup with this code to make sure that still works, and maybe a single binary as well. I also re-based on master so I need to fix some of the uses of the wrong log library, and then I'll push up the changes.

@trevorwhitney
Copy link
Collaborator

@slim-bean ok, now I think it's actually ready to go, got the nil pointer issue resolved when running the querier or frontend without a co-located scheduler. I'll babysit this through the CI checks, but then we should be good for some more reviews.

Copy link
Collaborator

@trevorwhitney trevorwhitney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving my own changes to the module dependency issue, but would like a review from others.

slim-bean and others added 10 commits October 21, 2021 14:15
…worker and querier workers to find the scheduler address as an alternative to using DNS
…ince #4312 was merged after the original fork was created.
Signed-off-by: Trevor Whitney <trevorjwhitney@gmail.com>
Signed-off-by: Trevor Whitney <trevorjwhitney@gmail.com>
A few doc fixes

Co-authored-by: Karen Miller <84039272+KMiller-Grafana@users.noreply.github.com>
@slim-bean slim-bean merged commit 2b5f300 into main Oct 21, 2021
@slim-bean slim-bean deleted the ring-scheduler branch October 21, 2021 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants