discovery: preserve results from other resolve calls #7886

GiedriusS · 2024-11-05T14:49:48Z

Properly preserve results from other resolve calls. There is an assumption that resolve() is always called with the same addresses but that is not true with gRPC and --endpoint-group. Without this fix, multiple resolves could happen at the same time but some of the callers will not be able to retrieve the results leading to random errors.

Properly preserve results from other resolve calls. There is an assumption that resolve() is always called with the same addresses but that is not true with gRPC and `--endpoint-group`. Without this fix, multiple resolves could happen at the same time but some of the callers will not be able to retrieve the results leading to random errors. Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

yeya24 · 2024-11-19T00:35:32Z

pkg/discovery/dns/grpc.go

@@ -68,7 +73,7 @@ func (r *resolver) ResolveNow(_ grpcresolver.ResolveNowOptions) {}
 func (r *resolver) resolve() error {
 	ctx, cancel := context.WithTimeout(r.ctx, r.interval)
 	defer cancel()
-	return r.provider.Resolve(ctx, []string{r.target})
+	return r.provider.Resolve(ctx, []string{r.target}, false)


Hi @GiedriusS, can you please help me understand this change. This is the only place where we set it to false. Why it is the case?
Can we just enable it to true everywhere?

No, we cannot, because in gRPC the same resolver is reused between --endpoint-groups. In other words, Build() above is called from multiple places in gRPC but they all reuse the same resolver. They first resolve and then fetch the values from cache. If we flush here then some of the results are lost and the Query component will not connected to some of the endpoints.

Some addresses can be shared between --endpoint-groups so that's why I opted to reuse the same resolver.

yeya24

@GiedriusS
I don't know if it is intended but the behavior here is different when there is a resolve error.

Let's say worst case DNS resolve for all addresses failed. The previous resolved addresses are {"A": ["1"], "B": ["2"]}. It tries to resolve addresses B and C now.

Before the change, we will have resolve results {"B": ["2"], "C": nil}.

After this change, we will have resolved results {"A": ["1"], "B": ["2"]} as it only keeps previous old records and flushOld makes no difference when there is an error.

Is this intended? I expect flushOld to only change resolved addresses when there is no error but this implementation changes the behavior with error the same time.

pull-request-size bot added the size/L label Nov 5, 2024

niaurys approved these changes Nov 6, 2024

View reviewed changes

GiedriusS merged commit df3df36 into main Nov 6, 2024
22 checks passed

GiedriusS deleted the endpointgroup_fix branch November 6, 2024 07:46

yeya24 reviewed Nov 19, 2024

View reviewed changes

yeya24 mentioned this pull request Nov 19, 2024

Upgrade thanos version to fix store gateway buf reuse issue cortexproject/cortex#6346

Merged

3 tasks

yeya24 reviewed Nov 19, 2024

View reviewed changes

BrewTestBot mentioned this pull request Nov 25, 2024

thanos 0.37.0 Homebrew/homebrew-core#198927

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discovery: preserve results from other resolve calls #7886

discovery: preserve results from other resolve calls #7886

GiedriusS commented Nov 5, 2024

yeya24 Nov 19, 2024

GiedriusS Nov 19, 2024 •

edited

Loading

yeya24 left a comment •

edited

Loading

discovery: preserve results from other resolve calls #7886

discovery: preserve results from other resolve calls #7886

Conversation

GiedriusS commented Nov 5, 2024

yeya24 Nov 19, 2024

Choose a reason for hiding this comment

GiedriusS Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

yeya24 left a comment • edited Loading

Choose a reason for hiding this comment

GiedriusS Nov 19, 2024 •

edited

Loading

yeya24 left a comment •

edited

Loading