cloud_storage: Implement "carryover" cache trimming mechanism #18056

Lazin · 2024-04-24T18:38:46Z

The carryover is a collection of deletion candidates found during the previous trim. The trim collects full list of objects in the directory and sorts them in LRU order. Then it deletes first N objects. We're saving first M objects that remain in the list after the trim. These objects are deletion candidates.

During the next trim the cache service first uses carryover list to a quick cache trim. This trim doesn't need a directory scan and it can quickly decrease bytes/objects counts so other readers could reserve space successfully. The trim doesn't delete objects from the carryover list blindly. It compares access time recorded in the carryover list to the access time stored in the accesstime_tracker. If the time is different then the object was accessed and the trim is not deleting it during this phase.

New configuration option cloud_storage_cache_trim_carryover is added. This config option sets the limit on the size of the carryover list. The list is stored on shard 0. The default value is 512. We are storing a file path for every object so this list shouldn't be too big. Even relatively small carryover list might be able to make a difference and prevent readers from being blocked.

Backports Required

Release Notes

Improvements

Improve cloud storage cache to prevent readers from being blocked during cache eviction.

hcoyote · 2024-04-24T18:51:37Z

cloud_storage_cache_trim_carryover ... this will end up effectively being another hard value we have to pay attention to. Should this be percentage based on max object count in the cache instead?

Feediver1 · 2024-04-24T18:59:59Z

src/v/config/configuration.cc

+      "cloud_storage_cache_trim_carryover",
+      "The cache performs a recoursive directory walk during the cache trim. "
+      "The information obtained during the walk can be carried over to the "
+      "next trim operation. This parameter sets a limit on number of objects "
+      "that can be carried over from one trim to next. This allows cache to "
+      "quickly unblock readers before starting the directory walk.",


Suggested change

"cloud_storage_cache_trim_carryover",

"The cache performs a recoursive directory walk during the cache trim. "

"The information obtained during the walk can be carried over to the "

"next trim operation. This parameter sets a limit on number of objects "

"that can be carried over from one trim to next. This allows cache to "

"quickly unblock readers before starting the directory walk.",

"cloud_storage_cache_trim_carryover",

"The cache performs a recursive directory inspection during the cache trim. "

"The information obtained during the inspection can be carried over to the "

"next trim operation. This parameter sets a limit on the number of objects "

"that can be carried over from one trim to next, and allows cache to "

"quickly unblock readers before starting the directory inspection.",

Was not clear on your meaning of "walk" here...please let me know if "inspection" is more apt.

applied this

WillemKauf · 2024-04-24T19:16:14Z

src/v/cloud_storage/cache_service.cc

+    auto max_carryover_files = config::shard_local_cfg()
+                                 .cloud_storage_cache_trim_carryover.value()
+                                 .value_or(0);
+    fragmented_vector<file_list_item> tmp;


Would reserving some memory for tmp here be beneficial?
Something like tmp.reserve(std::min(max_carryover_files, candidates.size() - candidate_i))?

WillemKauf · 2024-04-24T19:58:21Z

src/v/cloud_storage/cache_service.cc

+    if (it == _last_trim_carryover->end()) {
+        _last_trim_carryover = std::nullopt;
+    } else {
+        fragmented_vector<file_list_item> tmp;


Worth having tmp.reserve() before std::copy() as well?

vbotbuildovich · 2024-04-24T20:42:20Z

new failures in https://buildkite.com/redpanda/redpanda/builds/48232#018f11a1-cce4-4fd0-9085-015378c8263b:

"rptest.tests.rbac_upgrade_test.UpgradeMigrationCreatingDefaultRole.test_rbac_migration"

new failures in https://buildkite.com/redpanda/redpanda/builds/48232#018f11aa-531c-4182-b73c-633a06b7d407:

"rptest.tests.rbac_upgrade_test.UpgradeMigrationCreatingDefaultRole.test_rbac_migration"

new failures in https://buildkite.com/redpanda/redpanda/builds/48345#018f1b68-184e-4018-8472-0f4d12a34f7b:

"rptest.tests.cloud_storage_chunk_read_path_test.CloudStorageChunkReadTest.test_read_when_cache_smaller_than_segment_size"

new failures in https://buildkite.com/redpanda/redpanda/builds/48345#018f1b88-2f6f-4993-ac6c-28e59025df88:

"rptest.tests.test_si_cache_space_leak.ShadowIndexingCacheSpaceLeakTest.test_si_cache.message_size=10000.num_messages=100000.concurrency=2"

new failures in https://buildkite.com/redpanda/redpanda/builds/48384#018f1ef5-e475-4844-a04a-348bd1ad0b66:

"rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.num_to_upgrade=0.with_tiered_storage=True"

abhijat · 2024-04-25T05:31:56Z

src/v/config/configuration.cc

+      "that can be carried over from one trim to next. This allows cache to "
+      "quickly unblock readers before starting the directory walk.",
+      {.needs_restart = needs_restart::no, .visibility = visibility::tunable},
+      512)


Perhaps this value should be larger. On systems where we are reaching obj. count we have upwards of 100k objects.

Not for this PR but maybe we can optimistically use much more available units from materialized resources and then trim this list to release units on demand if the readers need it, like a ballast file.

In some clusters we allow 500K objects. The purpose of this list is to unblock enough hydrations to allow progress during the trim. It's not supposed to replace the trim altogether. Only to use data that we got during the previous trim opportunistically.

src/v/cloud_storage/cache_service.cc

andrwng · 2024-04-26T01:19:45Z

src/v/cloud_storage/cache_service.cc

+        if (
+          is_trim_exempt(file_stat.path)
+          || std::string_view(file_stat.path).ends_with(tmp_extension)) {
+            continue;


Without digging in, this reads like we could we end up in a situation where all the oldest files are trim-exempt or tmp files, in which case the carryover list doesn't help at all. I don't think that's the case since this only refers to in-progress files or the access tracker, but still perhaps it's worth considering skipping them when creating the carryover list in the first place?

andrwng · 2024-04-26T01:47:33Z

src/v/cloud_storage/cache_service.cc

+            vlog(cst_log.debug, "Carryover trim list is empty");
+        }


Is this only deferring the problem? Once the carryover list is empty, won't we run into the same read-time walk?

I'm wondering if it would be a big lift to trim the carryover and return, but if we didn't previously have enough space to reserve, also trigger a regular trim in the background to repopulate the carryover list, so in the happy path there may be just the first read-time walk.

FWIW I'd be happy if this solves everything, given background work seems much more complicated. I'm just trying to understand if this fixes things enough to avoid the issues we were seeing in the cluster.

This should solve the problem. What will happen is one hydration will be blocked until the trim completes. Other hydrations will be quickly unblocked or never block at all.
It makes sense to push the process to the background. I may do this in this PR or in a followup. It probably makes sense to limit the scope of this PR.

Add new parameter that controls cache carryover behavior.

The "carryover" behavior allows cache to use information from the previous trim to quickly trim the cache without scanning the whole directory. This allows cache to avoid blocking readers. In a situation when the cache cntains very large number of files the recursive directory walk could take few minutes. We're not allowing number of objects stored in the cache to overshoot so all the readers are blocked until the walk is finished. This commit adds new "carryover" trim mechanism which is running before the normal trim and uses information obtained during the previous fast or full trim to delete some objects wihtout walking the directory tree.

Change the configuration parameter and treat the value as number of bytes that we can use to store carryover data.

Reserve memory units for the carryover mechanism in materialized_resrouces.

Lazin · 2024-04-26T16:12:20Z

cloud_storage_cache_trim_carryover ... this will end up effectively being another hard value we have to pay attention to. Should this be percentage based on max object count in the cache instead?

It has reasonable default. You only ever want to use it to disable the feature.

In case if carryover trim was able to release enough space start trim in the background and return early. This unblocks the hydration that reserved space and triggered the trim. We need to run normal trim anyway to avoid corner case when the carryover list becomes empty and we have to block readers for the duration of the full trim.

vbotbuildovich · 2024-04-27T10:40:21Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/48384#018f1eed-35d4-4d0c-b1a2-a778a6bf7875

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/48384#018f1ef5-e478-445b-b4e9-2034d976a087

Lazin · 2024-04-27T15:35:28Z

CI failure is #18121
I believe it's unrelated. Restarted CI.

Lazin · 2024-04-29T07:37:53Z

/ci-repeat 5

vbotbuildovich · 2024-04-29T10:38:08Z

/backport v24.1.x

vbotbuildovich · 2024-04-29T10:38:09Z

/backport v23.3.x

vbotbuildovich · 2024-04-29T10:39:03Z

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-18056-v23.3.x-62 remotes/upstream/v23.3.x
git cherry-pick -x 57985b74ff30f1cb8a27f06a4cbb688b2201b569 940fcd4d802e1f3665243f681d58b323c9e27ece 5ece9d401917d929a8d5e986b6aabacfd54884c1 fdf34981fac71b011a3c47e3adf86006e6d9da09 125659467d743a3821516b975efe4724bdd24da0 e1c30bc6e18a08de9fb74d4331450488642b5b53 0f84fdbbbecd80221ee199de0850a00f16de0789 6b57e0c167eb4da482b5122f482f3135f070ab02 61a09b47099e2422b86c401e11f5e5c6cfba9157 9f1b51b6a7e3455f3bc1c25965f3d19bb657dada

Workflow run logs.

cloud_storage: Add watchdog to the directory walk

57985b7

Lazin requested a review from a team as a code owner April 24, 2024 18:38

Lazin marked this pull request as draft April 24, 2024 18:38

github-actions bot added the area/redpanda label Apr 24, 2024

Feediver1 reviewed Apr 24, 2024

View reviewed changes

WillemKauf reviewed Apr 24, 2024

View reviewed changes

abhijat reviewed Apr 25, 2024

View reviewed changes

src/v/cloud_storage/cache_service.cc Outdated Show resolved Hide resolved

andrwng reviewed Apr 26, 2024

View reviewed changes

Lazin force-pushed the fix/non-blocking-trim branch from b3c46e9 to b0377e4 Compare April 26, 2024 12:55

config: Add cloud_storage_cache_trim_carryover

940fcd4

Add new parameter that controls cache carryover behavior.

Lazin force-pushed the fix/non-blocking-trim branch from b0377e4 to 5baa478 Compare April 26, 2024 15:04

Lazin added 7 commits April 26, 2024 15:47

cloud_storage: Add get_max_bytes/objects to cache

fdf3498

cloud_storage: Add test_carryover unit-test

1256594

cloud_storage: Express carryover limit in bytes

e1c30bc

Change the configuration parameter and treat the value as number of bytes that we can use to store carryover data.

cloud_storage: Reserve memory units for carryover

0f84fdb

Reserve memory units for the carryover mechanism in materialized_resrouces.

cloud_storage: Add carryover cache metric

6b57e0c

Fixup

61a09b4

Lazin force-pushed the fix/non-blocking-trim branch from 5baa478 to d81bbcc Compare April 26, 2024 16:11

Lazin requested review from abhijat, Feediver1, andrwng and WillemKauf April 26, 2024 16:11

Lazin marked this pull request as ready for review April 26, 2024 16:11

Lazin force-pushed the fix/non-blocking-trim branch from d81bbcc to 9f1b51b Compare April 27, 2024 08:32

abhijat approved these changes Apr 29, 2024

View reviewed changes

Lazin merged commit 789c6ca into redpanda-data:dev Apr 29, 2024
17 checks passed

This was referenced Apr 29, 2024

[v23.3.x] cloud_storage: Implement "carryover" cache trimming mechanism #18133

Closed

[v24.1.x] cloud_storage: Implement "carryover" cache trimming mechanism #18134

Merged

Lazin mentioned this pull request Apr 29, 2024

[v23.3.x] cloud_storage: Implement "carryover" cache trimming mechanism #18138

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage: Implement "carryover" cache trimming mechanism #18056

cloud_storage: Implement "carryover" cache trimming mechanism #18056

Lazin commented Apr 24, 2024

hcoyote commented Apr 24, 2024

Feediver1 Apr 24, 2024

Feediver1 Apr 24, 2024

Lazin Apr 26, 2024

WillemKauf Apr 24, 2024

Lazin Apr 26, 2024

WillemKauf Apr 24, 2024

Lazin Apr 26, 2024

vbotbuildovich commented Apr 24, 2024 •

edited

Loading

abhijat Apr 25, 2024

Lazin Apr 26, 2024

andrwng Apr 26, 2024

andrwng Apr 26, 2024

andrwng Apr 26, 2024

Lazin Apr 26, 2024

Lazin commented Apr 26, 2024

vbotbuildovich commented Apr 27, 2024 •

edited

Loading

Lazin commented Apr 27, 2024

Lazin commented Apr 29, 2024

vbotbuildovich commented Apr 29, 2024

vbotbuildovich commented Apr 29, 2024

vbotbuildovich commented Apr 29, 2024

cloud_storage: Implement "carryover" cache trimming mechanism #18056

cloud_storage: Implement "carryover" cache trimming mechanism #18056

Conversation

Lazin commented Apr 24, 2024

Backports Required

Release Notes

Improvements

hcoyote commented Apr 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbotbuildovich commented Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lazin commented Apr 26, 2024

vbotbuildovich commented Apr 27, 2024 • edited Loading

Lazin commented Apr 27, 2024

Lazin commented Apr 29, 2024

vbotbuildovich commented Apr 29, 2024

vbotbuildovich commented Apr 29, 2024

vbotbuildovich commented Apr 29, 2024

vbotbuildovich commented Apr 24, 2024 •

edited

Loading

vbotbuildovich commented Apr 27, 2024 •

edited

Loading