cloud_storage: enable prefetching chunks #10950

abhijat · 2023-05-23T08:31:21Z

Chunk prefetching allows fetching more than one segment chunks at once for performance improvement. The http call to fetch data includes a single byte range covering the original chunk plus the prefetch, and the response is written to disk as individual chunk files.

Note on potential inefficiency/wasted effort: With these changes, consider a set of contiguous chunks A,B,C,D,E,F. With a prefetch of 4, when A is downloaded, B,C,D,E are also downloaded. These are not however hydrated or materialized. The chunk files are kept on disk. When/if a request for hydrating B is issued, if it is still on disk, it is directly materialized. If not, it is re-downloaded and then materialized. In the event that B is deleted for some reason (cache eviction) and C,D,E are still on disk, when B is re-hydrated, C,D,E will also be downloaded again, and the chunk files will be overwritten.

With some extra code this could be avoided by only selectively downloading missing chunks from the prefetch list, this implementation takes the simpler approach and just downloads all chunks from B onwards again, assuming that the next few chunks are also absent.

Fixes #11028

Backports Required

Release Notes

none

Force push:

fix bug when calculating number of bytes to read for a given chunk

abhijat · 2023-05-24T06:38:24Z

/ci-repeat 10
debug
skip-units
dt-repeat=100
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_read_chunks

abhijat · 2023-05-24T08:36:27Z

/ci-repeat 10
debug
skip-units
dt-repeat=100
tests/rptest/tests/cloud_storage_chunk_read_path_test.py::CloudStorageChunkReadTest.test_read_chunks

abhijat · 2023-05-25T13:39:09Z

/ci-repeat 5

VladLazar

Looks good. Needs a rebase though.

src/v/cloud_storage/segment_chunk_api.cc

src/v/cloud_storage/remote_segment.cc

VladLazar · 2023-05-29T08:27:27Z

src/v/config/configuration.cc

@@ -1573,6 +1573,12 @@ configuration::configuration()
      {model::cloud_storage_chunk_eviction_strategy::eager,
       model::cloud_storage_chunk_eviction_strategy::capped,
       model::cloud_storage_chunk_eviction_strategy::predictive})
+  , cloud_storage_chunk_prefetch(


nit: perhaps we could bound this property between between [0, segment_size / chunk_size]? The lower limit (0) can be enforced via bounded_property, but the upper limit is dynamic and has to be done somewhere else.

I'll try to address this in a future PR

+1 to the lower bound, but I'm curious what good the upper bound would do -- in the worst case, a poorly configured cluster would just download the entire segment in segment_size / chunk_size pieces anyway, right?

Also, it may be challenging to reconcile this with the fact that each topic may have a different segment size.

VladLazar · 2023-05-29T08:42:20Z

src/v/cloud_storage/remote_segment.cc

+    // makes an HTTP GET call and E is also prefetched. So a total of two calls
+    // are made for the five chunks (ignoring any cache evictions during the
+    // process).
+    if (const auto status = co_await _cache.is_cached(path_to_start);


Each is_cached call maps to an access sys call. We should be able to answer these queries by using the access time tracker in cache_service. I'll try to include this in #10855.

src/v/cloud_storage/remote_segment.h

VladLazar

Looks good, but raced with #11287 which added cache lookups for chunks (which this PR also adds).

VladLazar · 2023-06-08T12:59:46Z

src/v/cloud_storage/remote_segment.cc

+    ss::future<ss::temporary_buffer<char>> get() override {
+        return _stream.read_up_to(_upto);
+    }


Can read_up_to return with a shorter read than requested (looking at the impl I don't think so). If it can, _upto should probably be updated in here.

It actually looks like it can return a shorter buffer, nice catch:

} else if (_buf.size() <= n) { // easy case: steal buffer, return to caller return make_ready_future<tmp_buf>(std::move(_buf)); }

Fixed

I wonder if read_exactly is a better fit here.

andrwng · 2023-06-12T22:44:03Z

src/v/config/configuration.cc

@@ -1573,6 +1573,12 @@ configuration::configuration()
      {model::cloud_storage_chunk_eviction_strategy::eager,
       model::cloud_storage_chunk_eviction_strategy::capped,
       model::cloud_storage_chunk_eviction_strategy::predictive})
+  , cloud_storage_chunk_prefetch(


+1 to the lower bound, but I'm curious what good the upper bound would do -- in the worst case, a poorly configured cluster would just download the entire segment in segment_size / chunk_size pieces anyway, right?

Also, it may be challenging to reconcile this with the fact that each topic may have a different segment size.

andrwng · 2023-06-12T22:57:47Z

src/v/cloud_storage/segment_chunk_api.cc

+    auto it = chunks.find(start);
+    auto n_it = std::next(it);
+
+    for (size_t i = 0; i < prefetch + 1 && it != chunks.end(); ++i) {
+        auto start = it->first;
+        std::optional<chunk_start_offset_t> end = std::nullopt;
+        if (n_it != chunks.end()) {
+            end = n_it->first - 1;
+        }
+        _chunks[start] = end;
+        if (n_it == chunks.end()) {
+            break;
+        }
+        it++;
+        n_it++;
+    }


nit: this seems a bit non-trivial. Could you comment what this is doing? Either here or in the header? Also could you mention what it means for an end offset to be nullopt?

andrwng · 2023-06-12T23:05:34Z

src/v/cloud_storage/segment_chunk_api.h

+class segment_chunk_range {
+public:


I might be missing something, but given the public API of this, could this be a deque of pairs?

Do you mean for storing the data inside this class (instead of a map), or for replacing this class entirely?

If the former, then probably yes it can be a deque of pairs and since we only ever iterate over it and never lookup keys themselves, it should be marginally faster, but I don't see that as enough of a speedup to change it (traversal through a tree should still be pretty fast).

If the latter, this class provides some convenience methods for calculating the bounds of the range to decide how much space to reserve in cache etc, it could maybe be done by free functions accepting the deque of pairs but I prefer a class.

andrwng · 2023-06-12T23:11:05Z

src/v/cloud_storage/remote_segment.cc

+    ss::future<ss::temporary_buffer<char>> get() override {
+        const auto buf = co_await _stream.read_up_to(_upto);
+        if (buf.size() < _upto) {
+            _upto -= buf.size();
+        }
+        co_return buf;
+    }


What prevents _upto from underflowing? Is it possible for us to read more than the stream size in the first call? If so, would it make sense for subsequent calls to check _upto == 0 and return empty? Or is this only expected to be called once?

I have switched to read_exactly, I think it is more appropriate here, it will only return less than _upto bytes in case of EOF, in which case the next call to our get will return an empty buffer signalling EOF correctly.

andrwng · 2023-06-12T23:15:17Z

src/v/cloud_storage/remote_segment.h

@@ -299,6 +299,19 @@ class remote_segment final {

    std::optional<segment_chunks> _chunks_api;
    std::optional<offset_index::coarse_index_t> _coarse_index;
+
+    class consume_stream {


nit: can we make the name a bit more descriptive? Maybe chunk_caching_stream_consumer or something?

andrwng · 2023-06-12T23:19:08Z

src/v/cloud_storage/remote_segment.cc

  chunk_start_offset_t start, std::optional<chunk_start_offset_t> end) {
+remote_segment::consume_stream::consume_stream(
+  remote_segment& remote_segment, segment_chunk_range range)
+  : _segment{remote_segment}


Seems out of place?

VladLazar

Looks good to me, but the current version doesn't build.

While reading through this again I realised there's a tradeoff between prefetch and the cache space reservation. If the cache is nearly full, then prefetching is actually counter-productive to the overall system. Maybe we could check how full the cache is before creating the chunk_data_source_impl and override the prefetch to 0 if it's over 90% utilised or so.

abhijat · 2023-06-22T14:28:11Z

/ci-repeat

abhijat · 2023-06-23T06:38:04Z

ci failures:

#11626
#10848

VladLazar · 2023-06-23T09:20:27Z

tests/rptest/tests/cloud_storage_chunk_read_path_test.py

        self._set_params_and_start_redpanda(
-            cloud_storage_cache_chunk_size=1048576 * 8)
+            cloud_storage_cache_chunk_size=1048576 * 8,
+            cloud_storage_chunk_prefetch=prefetch)


Would be nice to check that prefetching is actually happening.

While trying to add a test for this, I ran into a bug in the coarse index generation code. Creating a separate PR to fix that, once that is merged I will add the test to this PR.

draft PR #11705 for fixing the bug

added the test tests.cloud_storage_chunk_read_path_test.CloudStorageChunkReadTest.test_prefetch_chunks now, will require a few ci-repeats to see if it's parameters are correct.

Lazin

LGTM, one small issue

Lazin · 2023-06-27T10:35:20Z

src/v/cloud_storage/remote_segment.cc

+      , _range{std::move(range)} {}
+
+    ss::future<uint64_t>
+    operator()(uint64_t size, ss::input_stream<char> stream) {


the stream is not closed

abhijat · 2023-06-29T05:43:09Z

/ci-repeat

Controls the number of chunks prefetched when a chunk is downloaded. A count of 0 means no extra chunk will be downloaded. A count of 1 means for every downloaded chunk, the next chunk is also prefetched.

A chunk range covers a series of contiguous chunks, which can be used to prefetch. The idea when prefetching is that we need to make a single http call for a byte range which contains > 1 chunks, but we want to write the data to disk as separate chunk files. The range utility introduced here takes a start offset, a prefetch count and a map of chunk starts, and enables iteration over the chunks until the prefetch is satisfied, also allowing access to the last offset which is used to download the byte range.

A bounded stream wraps an input stream and only consumes upto a certain offset from the underlying input stream. It is intended to be used with chunk prefetch, where a single input stream is used to read the response from an http GET call, and bounded streams are created on it to read up to chunk boundaries to be written to disk.

The remote object expects a consumer which will take an input stream and do something with it (usually write the response to disk). The consumer is supposed to be re-entrant since it can be called multiple times by remote. This change adds a stream consumer which accepts an http response stream and a chunk range, and creates chunk files to put in cache from the stream containing potentially multiple contiguous chunks. It effectively breaks up the stream into chunks before putting the files in cache.

Allow overriding prefetch size per reader. The main purpose of this is to allow disabling prefetch (by setting it to zero) for timequeries.

When segment is hydrated, we are guaranteed to be in legacy mode, which means that we were not able to download an index from cloud storage. The change cleans up the code path in segment hydration, so that if a segment is downloaded, an index is always created on the fly from the segment data.

abhijat · 2023-06-30T05:55:59Z

CI failures: #8217 and #7758 (reopened). The reopened issue seems to be client related and not due to changes in this PR

github-actions bot added the area/redpanda label May 23, 2023

abhijat force-pushed the chunk-prefetch branch 2 times, most recently from 227bde3 to 81cb604 Compare May 23, 2023 09:42

abhijat force-pushed the chunk-prefetch branch 2 times, most recently from a474c52 to 8ea1ca4 Compare May 25, 2023 07:38

abhijat marked this pull request as ready for review May 25, 2023 13:39

abhijat requested review from VladLazar, andrwng, jcsp, Lazin and andijcr May 25, 2023 13:46

VladLazar reviewed May 29, 2023

View reviewed changes

andijcr reviewed Jun 1, 2023

View reviewed changes

src/v/cloud_storage/remote_segment.h Outdated Show resolved Hide resolved

abhijat force-pushed the chunk-prefetch branch 2 times, most recently from f34de26 to 8245b13 Compare June 5, 2023 08:09

VladLazar reviewed Jun 8, 2023

View reviewed changes

abhijat force-pushed the chunk-prefetch branch 2 times, most recently from 85668de to 68ef97e Compare June 12, 2023 10:32

andrwng reviewed Jun 12, 2023

View reviewed changes

abhijat force-pushed the chunk-prefetch branch 3 times, most recently from 988b7fa to c210f2b Compare June 14, 2023 07:24

abhijat requested review from andijcr, VladLazar and andrwng June 14, 2023 10:38

VladLazar reviewed Jun 16, 2023

View reviewed changes

abhijat force-pushed the chunk-prefetch branch 2 times, most recently from 2b5bafb to a322dc0 Compare June 22, 2023 08:41

abhijat requested review from andijcr and VladLazar June 23, 2023 06:38

VladLazar previously approved these changes Jun 23, 2023

View reviewed changes

VladLazar reviewed Jun 23, 2023

View reviewed changes

Lazin previously approved these changes Jun 27, 2023

View reviewed changes

abhijat dismissed stale reviews from Lazin and VladLazar via 87f4a6e June 27, 2023 14:31

abhijat force-pushed the chunk-prefetch branch 5 times, most recently from 5691a15 to 150d4d0 Compare June 28, 2023 17:36

abhijat added 8 commits June 29, 2023 19:02

config: Add chunk prefetch property

0784056

Controls the number of chunks prefetched when a chunk is downloaded. A count of 0 means no extra chunk will be downloaded. A count of 1 means for every downloaded chunk, the next chunk is also prefetched.

cloud_storage: Add test for chunk prefetch

c9cd373

cloud_storage: Allow overriding prefetch size

c39d558

Allow overriding prefetch size per reader. The main purpose of this is to allow disabling prefetch (by setting it to zero) for timequeries.

cloud_storage: Make stream consumer into friend class, use better name

fc95469

abhijat force-pushed the chunk-prefetch branch from 150d4d0 to 158061e Compare June 29, 2023 13:37

abhijat requested review from VladLazar and Lazin June 30, 2023 05:57

jcsp merged commit 5b5654a into redpanda-data:dev Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage: enable prefetching chunks #10950

cloud_storage: enable prefetching chunks #10950

abhijat commented May 23, 2023 •

edited

Loading

abhijat commented May 24, 2023

abhijat commented May 24, 2023

abhijat commented May 25, 2023

VladLazar left a comment

VladLazar May 29, 2023

abhijat Jun 5, 2023

andrwng Jun 12, 2023

VladLazar May 29, 2023

VladLazar left a comment

VladLazar Jun 8, 2023

abhijat Jun 12, 2023

abhijat Jun 12, 2023

andrwng Jun 12, 2023

andrwng Jun 12, 2023

abhijat Jun 14, 2023

andrwng Jun 12, 2023

abhijat Jun 14, 2023

andrwng Jun 12, 2023

abhijat Jun 14, 2023

andrwng Jun 12, 2023

abhijat Jun 14, 2023

andrwng Jun 12, 2023

VladLazar left a comment

abhijat commented Jun 22, 2023

abhijat commented Jun 23, 2023

VladLazar Jun 23, 2023

abhijat Jun 27, 2023

abhijat Jun 27, 2023

abhijat Jun 28, 2023

Lazin left a comment

Lazin Jun 27, 2023

abhijat Jun 27, 2023

abhijat commented Jun 29, 2023

abhijat commented Jun 30, 2023

cloud_storage: enable prefetching chunks #10950

cloud_storage: enable prefetching chunks #10950

Conversation

abhijat commented May 23, 2023 • edited Loading

Backports Required

Release Notes

abhijat commented May 24, 2023

abhijat commented May 24, 2023

abhijat commented May 25, 2023

VladLazar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VladLazar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VladLazar left a comment

Choose a reason for hiding this comment

abhijat commented Jun 22, 2023

abhijat commented Jun 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lazin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhijat commented Jun 29, 2023

abhijat commented Jun 30, 2023

abhijat commented May 23, 2023 •

edited

Loading