-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_storage: Implement "carryover" cache trimming mechanism #18056
Conversation
|
src/v/config/configuration.cc
Outdated
"cloud_storage_cache_trim_carryover", | ||
"The cache performs a recoursive directory walk during the cache trim. " | ||
"The information obtained during the walk can be carried over to the " | ||
"next trim operation. This parameter sets a limit on number of objects " | ||
"that can be carried over from one trim to next. This allows cache to " | ||
"quickly unblock readers before starting the directory walk.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"cloud_storage_cache_trim_carryover", | |
"The cache performs a recoursive directory walk during the cache trim. " | |
"The information obtained during the walk can be carried over to the " | |
"next trim operation. This parameter sets a limit on number of objects " | |
"that can be carried over from one trim to next. This allows cache to " | |
"quickly unblock readers before starting the directory walk.", | |
"cloud_storage_cache_trim_carryover", | |
"The cache performs a recursive directory inspection during the cache trim. " | |
"The information obtained during the inspection can be carried over to the " | |
"next trim operation. This parameter sets a limit on the number of objects " | |
"that can be carried over from one trim to next, and allows cache to " | |
"quickly unblock readers before starting the directory inspection.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was not clear on your meaning of "walk" here...please let me know if "inspection" is more apt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied this
auto max_carryover_files = config::shard_local_cfg() | ||
.cloud_storage_cache_trim_carryover.value() | ||
.value_or(0); | ||
fragmented_vector<file_list_item> tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would reserving some memory for tmp
here be beneficial?
Something like tmp.reserve(std::min(max_carryover_files, candidates.size() - candidate_i))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
if (it == _last_trim_carryover->end()) { | ||
_last_trim_carryover = std::nullopt; | ||
} else { | ||
fragmented_vector<file_list_item> tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth having tmp.reserve()
before std::copy()
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
new failures in https://buildkite.com/redpanda/redpanda/builds/48232#018f11a1-cce4-4fd0-9085-015378c8263b:
new failures in https://buildkite.com/redpanda/redpanda/builds/48232#018f11aa-531c-4182-b73c-633a06b7d407:
new failures in https://buildkite.com/redpanda/redpanda/builds/48345#018f1b68-184e-4018-8472-0f4d12a34f7b:
new failures in https://buildkite.com/redpanda/redpanda/builds/48345#018f1b88-2f6f-4993-ac6c-28e59025df88:
new failures in https://buildkite.com/redpanda/redpanda/builds/48384#018f1ef5-e475-4844-a04a-348bd1ad0b66:
|
src/v/config/configuration.cc
Outdated
"that can be carried over from one trim to next. This allows cache to " | ||
"quickly unblock readers before starting the directory walk.", | ||
{.needs_restart = needs_restart::no, .visibility = visibility::tunable}, | ||
512) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this value should be larger. On systems where we are reaching obj. count we have upwards of 100k objects.
Not for this PR but maybe we can optimistically use much more available units from materialized resources and then trim this list to release units on demand if the readers need it, like a ballast file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some clusters we allow 500K objects. The purpose of this list is to unblock enough hydrations to allow progress during the trim. It's not supposed to replace the trim altogether. Only to use data that we got during the previous trim opportunistically.
if ( | ||
is_trim_exempt(file_stat.path) | ||
|| std::string_view(file_stat.path).ends_with(tmp_extension)) { | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without digging in, this reads like we could we end up in a situation where all the oldest files are trim-exempt or tmp files, in which case the carryover list doesn't help at all. I don't think that's the case since this only refers to in-progress files or the access tracker, but still perhaps it's worth considering skipping them when creating the carryover list in the first place?
src/v/cloud_storage/cache_service.cc
Outdated
vlog(cst_log.debug, "Carryover trim list is empty"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only deferring the problem? Once the carryover list is empty, won't we run into the same read-time walk?
I'm wondering if it would be a big lift to trim the carryover and return, but if we didn't previously have enough space to reserve, also trigger a regular trim in the background to repopulate the carryover list, so in the happy path there may be just the first read-time walk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I'd be happy if this solves everything, given background work seems much more complicated. I'm just trying to understand if this fixes things enough to avoid the issues we were seeing in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should solve the problem. What will happen is one hydration will be blocked until the trim completes. Other hydrations will be quickly unblocked or never block at all.
It makes sense to push the process to the background. I may do this in this PR or in a followup. It probably makes sense to limit the scope of this PR.
b3c46e9
to
b0377e4
Compare
Add new parameter that controls cache carryover behavior.
b0377e4
to
5baa478
Compare
The "carryover" behavior allows cache to use information from the previous trim to quickly trim the cache without scanning the whole directory. This allows cache to avoid blocking readers. In a situation when the cache cntains very large number of files the recursive directory walk could take few minutes. We're not allowing number of objects stored in the cache to overshoot so all the readers are blocked until the walk is finished. This commit adds new "carryover" trim mechanism which is running before the normal trim and uses information obtained during the previous fast or full trim to delete some objects wihtout walking the directory tree.
Change the configuration parameter and treat the value as number of bytes that we can use to store carryover data.
Reserve memory units for the carryover mechanism in materialized_resrouces.
5baa478
to
d81bbcc
Compare
It has reasonable default. You only ever want to use it to disable the feature. |
In case if carryover trim was able to release enough space start trim in the background and return early. This unblocks the hydration that reserved space and triggered the trim. We need to run normal trim anyway to avoid corner case when the carryover list becomes empty and we have to block readers for the duration of the full trim.
d81bbcc
to
9f1b51b
Compare
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/48384#018f1eed-35d4-4d0c-b1a2-a778a6bf7875 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/48384#018f1ef5-e478-445b-b4e9-2034d976a087 |
CI failure is #18121 |
/ci-repeat 5 |
/backport v24.1.x |
/backport v23.3.x |
Failed to create a backport PR to v23.3.x branch. I tried:
|
The
carryover
is a collection of deletion candidates found during the previous trim. The trim collects full list of objects in the directory and sorts them in LRU order. Then it deletes first N objects. We're saving first M objects that remain in the list after the trim. These objects are deletion candidates.During the next trim the cache service first uses
carryover
list to a quick cache trim. This trim doesn't need a directory scan and it can quickly decrease bytes/objects counts so other readers could reserve space successfully. The trim doesn't delete objects from thecarryover
list blindly. It compares access time recorded in the carryover list to the access time stored in theaccesstime_tracker
. If the time is different then the object was accessed and the trim is not deleting it during this phase.New configuration option
cloud_storage_cache_trim_carryover
is added. This config option sets the limit on the size of the carryover list. The list is stored on shard 0. The default value is 512. We are storing a file path for every object so this list shouldn't be too big. Even relatively small carryover list might be able to make a difference and prevent readers from being blocked.Backports Required
Release Notes
Improvements