Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store gateway overfetches chunks and series data #6421

Open
yeya24 opened this issue Jun 7, 2023 · 4 comments
Open

Store gateway overfetches chunks and series data #6421

yeya24 opened this issue Jun 7, 2023 · 4 comments

Comments

@yeya24
Copy link
Contributor

yeya24 commented Jun 7, 2023

Is your proposal related to a problem?

Cortex uses Thanos Store Gateway for querying data on S3. With #6352, we are able to explore data fetched in store gateway. Here are some examples,

querier-764d7bf7f7-5sssq querier ts=2023-06-06T23:04:35.420994833Z caller=spanlogger.go:87 level=info msg="store gateway series request stats" instance=10.0.38.155:9095 queryable_chunk_bytes_fetched=194 queryable_data_bytes_fetched=265 blocks_queried=1 series_merged_count=1 chunks_merged_count=1 postings_touched=1 postings_touched_size_sum=28 postings_to_fetch=0 postings_fetched=0 postings_fetch_count=0 postings_fetched_size_sum=0 series_touched=6 series_touched_size_sum=720 series_fetched=0 series_fetch_count=0 series_fetched_size_sum=0 chunks_touched=1 chunks_touched_size_sum=164 chunks_fetched=1 chunks_fetch_count=1 chunks_fetched_size_sum=16000 data_downloaded_size_sum=16748
querier-764d7bf7f7-x2cb8 querier ts=2023-06-06T23:02:23.414267429Z caller=spanlogger.go:87 level=info msg="store gateway series request stats" instance=10.0.66.121:9095 queryable_chunk_bytes_fetched=2823750 queryable_data_bytes_fetched=3108190 blocks_queried=1 series_merged_count=2500 chunks_merged_count=2500 postings_touched=10 postings_touched_size_sum=2806 postings_to_fetch=0 postings_fetched=0 postings_fetch_count=0 postings_fetched_size_sum=0 series_touched=50000 series_touched_size_sum=6578711 series_fetched=0 series_fetch_count=0 series_fetched_size_sum=0 chunks_touched=2500 chunks_touched_size_sum=2747250 chunks_fetched=2500 chunks_fetch_count=11 chunks_fetched_size_sum=66166754 data_downloaded_size_sum=72748271

What I found is that, Store Gateway usually over fetched chunks and series. For the first example, total downloaded size is 16748 and SG fetched 16KB chunks. However, the actual chunk data touched is only 164. The over fetched data is just discarded afterwards. Same problem exists in the second log line, the actual chunk data touched is only 2.7M but SG fetched 66M chunks data.

The issue here is that there is no way to know how big a chunk/series is so store gateway tries to do estimate the size. https://github.com/thanos-io/thanos/blob/main/pkg/store/bucket.go#L77 The estimated chunk size is 16K and estimated series size is 64KB. This value might make sense in some situations, but in our real production block, the size is much lower than the limit, which means we are wasting a lot of resources fetching unused data.

Describe the solution you'd like

There are several ways to solve this:

  1. Include series and chunk size as part of the index header so that we know the exact data length we need to fetch from object store.
  2. Make the estimated chunk & series size configurable so we can adjust it accordingly based on some real data
  3. https://github.com/thanos-io/thanos/blob/main/pkg/block/index.go#L212 For blocks produced by compactor, GatherIndexHealthStats will check the index file and collect some stats. Currently it collects average, min and max chunk size. I think we can collect series size as well. Collected stats can be included into the meta.json file or another file associated with that block in the object storage. During the query time, max series and max chunk size can be loaded and we can use these size accordingly.
@yeya24
Copy link
Contributor Author

yeya24 commented Jun 7, 2023

Another thing to note is that, series size usually varies depending on block range. Thanos can have large blocks with 14d time range while in Cortex we usually use 1d block range.
This makes each Cortex series size small because each series contains fewer chunks than Thanos series does. So I think it makes more sense to not hardcode these values.

@yeya24
Copy link
Contributor Author

yeya24 commented Jun 12, 2023

Though we have #6426, it is still not perfect as we need to set a static value.

What I am thinking now is to include the chunk size and series size stats as part of the meta.json file per block.
After compaction, we will verify the index for the block and then we can attach those information to the meta.json file.

The value we need is MaxChunkSize and MaxSeriesSize mainly.

@yeya24 yeya24 closed this as completed Jun 12, 2023
@yeya24 yeya24 reopened this Jun 12, 2023
@yeya24
Copy link
Contributor Author

yeya24 commented Jun 16, 2023

I think we can close this one with the latest changes on meta file

@yeya24 yeya24 closed this as completed Jun 16, 2023
@yeya24 yeya24 reopened this Jun 18, 2023
@yeya24
Copy link
Contributor Author

yeya24 commented Jun 18, 2023

I think the issue of data overfetching is still unresolved.
We have the Gap based partitioner https://github.com/thanos-io/thanos/blob/main/pkg/store/bucket.go#L2845 and the gap size is very big (512KB) in order to merge small get ranges requests into a large get range requests to reduce API calls.

The gap size might be too big in some situation. Especially if the requested data is relatively sparse, then we are overfetching a lot.

Let's say we want to fetch [10, 100] and [500KB, 501KB] for the first chunk. In this case, partitioner will let us fetch [10, 501KB] range but we can see about 500KB data will be discarded because this part of data doesn't contain data we need.

A way to reduce overfetching is to reduce partitioner gap size. Let's say if gap size is 256KB then the two ranges won't be merged together. The issue of a smaller gap size is more requests going to objstore.

Things are worse with cashing bucket. Cashing bucket caches subranges with 16KiB as the subrange size. I think the 500KB data in the middleware will also be cached in memcached somehow, though maybe we never need to read them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant