cloud_storage: cache efficiency improvements #10855

VladLazar · 2023-05-18T13:14:28Z

This PR makes a couple of efficiency improvements to the cloud storage cache trimming process:

Stop tracking segment indices and transaction manifests in the access time tracker
Do not include segment indices and transaction manifests in the report generated by the cache walk.
Note that they're still included in the reported total size for the cache,

Both hinge on the fact that segment indices and transaction manifests are cleaned up with their
respective segments or chunks, so we don't need to track them as neatly.

Fixes #10719

Backports Required

Release Notes

none

This commit changes the `access_time_tracker` used by the cloud storage cache to ignore transaction manifest and index files. Since these file types are removed along with their corresponding segments or chunks, tracking them has no value. Note that removal of these file types from the tracker is still allowed. The tracker does not contain the paths (only hashes), so they cannot be removed upon deserialisation.

src/v/cloud_storage/recursive_directory_walker.cc

jcsp · 2023-05-18T13:31:12Z

src/v/cloud_storage/access_time_tracker.cc

@@ -155,9 +163,14 @@ ss::future<> access_time_tracker::read(ss::input_stream<char>& in) {

 void access_time_tracker::add_timestamp(
  std::string_view key, std::chrono::system_clock::time_point ts) {
+    if (!should_track(key)) {


Should we also check in the remove_timestamp path, to avoid doing a O(lnN) search for the things we know we never added to begin with? or does the cache just never call remove_timestamp for non-segment files?

I allowed removal of tx and index keys on purpose. Since we only serialise the hash, we cannot drop all of them when de-serialising. When the tracker is hydrated, it's going to contain such keys which are only ever removed by calls to access_time_tracker::trim. The idea is that on the start-up clean we do not apply the filtering and trim away all of these entries we don't need. For subsequent clean-ups, the filtering is applied.

This commit updates the code that walks the cloud storage cache to ignore the segment indices and transaction manifests as they are ignored by the subsequent code that performs the deletions. Note that their sizes are still included in the running total.

VladLazar · 2023-05-19T16:10:52Z

Failures are:

CI Failure Software caused connection abort in AvailabilityTests.test_recovery_after_catastrophic_failure #10602
CI Failure (Controller logs are not the same) in ConfigurationUpdateTest.test_two_nodes_update #10867
CI Failure (Consumer failed to consume up to offsets) in PartitionMoveInterruption.test_cancellations_interrupted_with_restarts #10674

VladLazar · 2023-05-24T11:05:14Z

/ci-repeat

andijcr · 2023-05-24T13:43:25Z

src/v/cloud_storage/recursive_directory_walker.cc

@@ -72,16 +79,20 @@ struct walk_accumulator {
    const access_time_tracker& tracker;
    bool seen_dentries{false};
    std::deque<ss::sstring> dirlist;
+    std::optional<recursive_directory_walker::filter_type> filter;


nit: maybe a default value for filter like [](auto&&){ return true; } could simplify the code. it could be a default parameter for the constructor

VladLazar · 2023-06-02T14:05:21Z

/ci-repeat

VladLazar · 2023-06-13T13:39:19Z

/ci-repeat

VladLazar · 2023-06-14T16:21:06Z

Failures are:

andrwng · 2023-06-14T18:20:01Z

src/v/cloud_storage/access_time_tracker.h

+    /// We do not wish to track index files and transaction manifests
+    /// as they are just an appendage to segment/chunk files and are
+    /// purged along with them.
+    bool should_track(std::string_view key) const;


nit: could be static

andrwng · 2023-06-14T18:27:02Z

src/v/cloud_storage/cache_service.cc

+    auto [walked_cache_size, filtered_out_files, candidates_for_deletion, _]
+      = co_await _walker.walk(
+        _cache_dir.native(), _access_time_tracker, [](std::string_view path) {
+            return !(
+              std::string_view(path).ends_with(".tx")
+              || std::string_view(path).ends_with(".index"));


Now that we're not explicitly trimming these, we should take care to make our behavior across a crash copes well. I think right now if we delete the log file and crash before we get to the ancillary files, they'll be stuck around forever. We should probably reverse the order to delete tx and index first.

The start-up trim does not apply the filtering precisely for this reason. If we crash in between, we'll still attempt to clean up index and tx files on start-up.

github-actions bot added the area/redpanda label May 18, 2023

VladLazar changed the title ~~Cache efficieny improvements 10719~~ cloud_storage: cache efficieny improvements May 18, 2023

jcsp reviewed May 18, 2023

View reviewed changes

src/v/cloud_storage/recursive_directory_walker.cc Outdated Show resolved Hide resolved

jcsp reviewed May 18, 2023

View reviewed changes

src/v/cloud_storage/recursive_directory_walker.cc Show resolved Hide resolved

jcsp reviewed May 18, 2023

View reviewed changes

VladLazar changed the title ~~cloud_storage: cache efficieny improvements~~ cloud_storage: cache efficiency improvements May 18, 2023

VladLazar force-pushed the cache-efficieny-improvements-10719 branch from 9f9147e to 5526081 Compare May 19, 2023 13:26

VladLazar requested a review from jcsp May 19, 2023 13:27

VladLazar marked this pull request as ready for review May 19, 2023 16:10

VladLazar requested a review from Lazin May 23, 2023 08:41

andijcr reviewed May 24, 2023

View reviewed changes

VladLazar mentioned this pull request May 29, 2023

cloud_storage: enable prefetching chunks #10950

Merged

7 tasks

VladLazar requested a review from abhijat June 6, 2023 17:05

andrwng reviewed Jun 14, 2023

View reviewed changes

andijcr self-requested a review June 14, 2023 19:57

andijcr approved these changes Jun 16, 2023

View reviewed changes

andrwng approved these changes Jun 16, 2023

View reviewed changes

piyushredpanda merged commit f806734 into redpanda-data:dev Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage: cache efficiency improvements #10855

cloud_storage: cache efficiency improvements #10855

VladLazar commented May 18, 2023

jcsp May 18, 2023

VladLazar May 19, 2023

VladLazar commented May 19, 2023

VladLazar commented May 24, 2023

andijcr May 24, 2023

VladLazar commented Jun 2, 2023

VladLazar commented Jun 13, 2023

VladLazar commented Jun 14, 2023

andrwng Jun 14, 2023

andrwng Jun 14, 2023

VladLazar Jun 15, 2023

cloud_storage: cache efficiency improvements #10855

cloud_storage: cache efficiency improvements #10855

Conversation

VladLazar commented May 18, 2023

Backports Required

Release Notes

jcsp May 18, 2023

Choose a reason for hiding this comment

VladLazar May 19, 2023

Choose a reason for hiding this comment

VladLazar commented May 19, 2023

VladLazar commented May 24, 2023

andijcr May 24, 2023

Choose a reason for hiding this comment

VladLazar commented Jun 2, 2023

VladLazar commented Jun 13, 2023

VladLazar commented Jun 14, 2023

andrwng Jun 14, 2023

Choose a reason for hiding this comment

andrwng Jun 14, 2023

Choose a reason for hiding this comment

VladLazar Jun 15, 2023

Choose a reason for hiding this comment