Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close block series client at the end to not reuse chunk buf #7915

Merged
merged 2 commits into from
Nov 18, 2024

Conversation

yeya24
Copy link
Contributor

@yeya24 yeya24 commented Nov 17, 2024

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

In Cortex we run some query fuzz tests to make sure query results are compatible with latest Cortex or Prometheus release.

I noticed a strange query failure when comparing Cortex query results with latest Prometheus v2.55.1 result https://github.com/cortexproject/cortex/actions/runs/11858364637/job/33061424720?pr=6340. Cortex Block is loaded in Store Gateway and the same block is loaded in Prometheus. I was trying to find out the issue and reproduce it locally and found out that sometimes (1 run out of 1000 ish) query results from Cortex only return 2 series rather than expected 3 series.

I added some logs on to print out the chunk content for the series it misses. It seems that the chunk buf gets reused even before the series response sent over gRPC.

// This is `blockSeriesClient.Recv()`. Chunk content is correct now.
receive label {__name__="test_series_b", status_code="502"}, chunk content: [0 13 198 140 227 134 231 100 64 28 0 0 0 0 0 0 224 212 3 212 39 183 3 109 11 75 96 125 20 209 107 19 219 193 183 5 164]

// Inside `bucket_store_merge_all` span, when the series response is going to be send to client. Content already changed.
send series labels {__name__="test_series_b", status_code="502"}, chunk content: [20 104 90 71 254 224 13 188 45 45 193 244 83 69 180 79 66 141 10 116 40 208 182 11 244 20 52 20 116 20 52 20 244 20 52 20 116]

// What Cortex Querier received at client side
get series into it, content: [20 104 90 71 254 224 13 188 45 45 193 244 83 69 180 79 66 141 10 116 40 208 182 11 244 20 52 20 116 20 52 20 244 20 52 20 116]

The issue here seems with https://github.com/thanos-io/thanos/pull/7821/files#diff-3e2896fafa6ff73509c77df2c4389b68828e02575bb4fb78b6c34bcfb922a7ceR3357, the block chunk reader is closed when the loser tree closes certain response series set. This doesn't guarantee the block chunk reader is closed at the end of the Series call because when a response series set is exhausted it will be closed first at https://github.com/thanos-io/thanos/blob/main/pkg/losertree/tree.go#L66 and then the chunk buffer can be reused.

Changes

Since the original idea of #7821 is to fix an issue for the in process store client, I revert the code for the block series client back. Now it closes block series client at the end of the Series function to make sure chunk buffere is not reused before the function returns.

Verification

I don't have a unit test to reproduce this bug but I was just re-running the same test case I use to reproduce the fuzzy test bug. Without this bug fix the test failed constantly. Not every run but 10000 queries probably fail 5 times. With the fix it always succeed.

	rt := &http.Transport{
		Proxy: http.ProxyFromEnvironment,
		DialContext: (&net.Dialer{
			Timeout:   10 * time.Minute,
			KeepAlive: 30 * time.Second,
		}).DialContext,
		TLSHandshakeTimeout: 10 * time.Second,
	}
	c, _ := api.NewClient(api.Config{Address: "http://localhost:8001/prometheus", RoundTripper: rt})
	a := v1.NewAPI(c)
	ctx := context.Background()
	q := `({__name__="test_series_a"} / atanh({__name__="test_series_b",series="1"}))`
	for i := 0; i < 10000; i++ {
		res, _, err := a.QueryRange(ctx, q, v1.Range{Start: time.Unix(1731815570, 0), End: time.Unix(1731819170, 0), Step: time.Second * 60})
		if err != nil {
			continue
		}
		m := res.(model.Matrix)
		require.Len(t, m, 3)
	}

Signed-off-by: Ben Ye <benye@amazon.com>
Signed-off-by: Ben Ye <benye@amazon.com>
@yeya24
Copy link
Contributor Author

yeya24 commented Nov 17, 2024

This probably also fix #7883 (comment) since we revert back the previous change to always close the block series reader.

@@ -1572,6 +1572,8 @@ func (s *BucketStore) Series(req *storepb.SeriesRequest, seriesSrv storepb.Store
tenant,
)

defer blockClient.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thats essentially how LabelValues and LabelNames do it too right now!

@MichaHoffmann
Copy link
Contributor

2 Approvals, only doc check failing, ill merge; this should go into 0.37 rc

@MichaHoffmann MichaHoffmann enabled auto-merge (squash) November 18, 2024 11:36
@MichaHoffmann MichaHoffmann merged commit f998fc5 into thanos-io:main Nov 18, 2024
21 of 22 checks passed
@yeya24 yeya24 deleted the close-block-client-at-end branch November 19, 2024 04:56
@jnyi jnyi mentioned this pull request Nov 19, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants