Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v23.2.x] cloud_storage: use remote index in cloud timequery #13105

Merged
merged 5 commits into from
Aug 31, 2023

Conversation

VladLazar
Copy link
Contributor

@VladLazar VladLazar commented Aug 30, 2023

Backport of #13011

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

Improvements

  • Timequeries (i.e ListOffsets requests) that land in the cloud log now use an index to speed up the search
    and reduce the number of hydrated bytes required to serve the query. On average, a time query will have to download 4 times less data (if using the default segment and chunk size)

Vlad Lazar added 5 commits August 30, 2023 10:58
A lookup by timestamp is added to the remote index. It has the same
semantics as the other lookup methods. If the index does not include the
time index (i.e. it was created with serde version 1), the search comes
up empty.

(cherry picked from commit c1d0314)
This commit updates the timequery read path to skip up to the first
index entry with a timestamp smaller than the searched one. The result
is that timequeries will hydrate/materialize a maximum of two chunks
(two because the chunk boundaries don't always line up with the entries
in the index).

If the index is not present, or was originally created in v1, the search
starts from the first chunk as it did previously.

(cherry picked from commit da8533b)
A shard-level metric is added to track the total number of chunks that
were hydrated (i.e. downloaded).

(cherry picked from commit b78d711)
This commit makes a couple of changes to the timequery tests:
1. Run timequery on more offests (10 for each of the 12 segments)
2. Check that a maximum of two chunks are downloaded by any given
   timequery
3. Use the admin api to get the precise boundary between the cloud and
   local log. Previously, it was estimated based on record size.

(cherry picked from commit 6d507cd)
This commit refactors the handling of the index search result. If a new
result type is introduced, it's author will be reminded to handle it by
the assertions.

(cherry picked from commit de5aa53)
@VladLazar
Copy link
Contributor Author

Will merge after #13103 merges as that one is big and would be annoying to rebase.

@VladLazar
Copy link
Contributor Author

/ci-repeat

@VladLazar
Copy link
Contributor Author

Triggered ci-repeat for a rebase

@VladLazar VladLazar merged commit 0f28eb9 into redpanda-data:v23.2.x Aug 31, 2023
@BenPope BenPope added this to the v23.2.8 milestone Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants