Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: Implement "carryover" cache trimming mechanism #18056

Merged
merged 10 commits into from
Apr 29, 2024

Conversation

Lazin
Copy link
Contributor

@Lazin Lazin commented Apr 24, 2024

The carryover is a collection of deletion candidates found during the previous trim. The trim collects full list of objects in the directory and sorts them in LRU order. Then it deletes first N objects. We're saving first M objects that remain in the list after the trim. These objects are deletion candidates.

During the next trim the cache service first uses carryover list to a quick cache trim. This trim doesn't need a directory scan and it can quickly decrease bytes/objects counts so other readers could reserve space successfully. The trim doesn't delete objects from the carryover list blindly. It compares access time recorded in the carryover list to the access time stored in the accesstime_tracker. If the time is different then the object was accessed and the trim is not deleting it during this phase.

New configuration option cloud_storage_cache_trim_carryover is added. This config option sets the limit on the size of the carryover list. The list is stored on shard 0. The default value is 512. We are storing a file path for every object so this list shouldn't be too big. Even relatively small carryover list might be able to make a difference and prevent readers from being blocked.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

Improvements

  • Improve cloud storage cache to prevent readers from being blocked during cache eviction.

@Lazin Lazin requested a review from a team as a code owner April 24, 2024 18:38
@Lazin Lazin marked this pull request as draft April 24, 2024 18:38
@hcoyote
Copy link

hcoyote commented Apr 24, 2024

cloud_storage_cache_trim_carryover ... this will end up effectively being another hard value we have to pay attention to. Should this be percentage based on max object count in the cache instead?

Comment on lines 2337 to 2342
"cloud_storage_cache_trim_carryover",
"The cache performs a recoursive directory walk during the cache trim. "
"The information obtained during the walk can be carried over to the "
"next trim operation. This parameter sets a limit on number of objects "
"that can be carried over from one trim to next. This allows cache to "
"quickly unblock readers before starting the directory walk.",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"cloud_storage_cache_trim_carryover",
"The cache performs a recoursive directory walk during the cache trim. "
"The information obtained during the walk can be carried over to the "
"next trim operation. This parameter sets a limit on number of objects "
"that can be carried over from one trim to next. This allows cache to "
"quickly unblock readers before starting the directory walk.",
"cloud_storage_cache_trim_carryover",
"The cache performs a recursive directory inspection during the cache trim. "
"The information obtained during the inspection can be carried over to the "
"next trim operation. This parameter sets a limit on the number of objects "
"that can be carried over from one trim to next, and allows cache to "
"quickly unblock readers before starting the directory inspection.",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was not clear on your meaning of "walk" here...please let me know if "inspection" is more apt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied this

auto max_carryover_files = config::shard_local_cfg()
.cloud_storage_cache_trim_carryover.value()
.value_or(0);
fragmented_vector<file_list_item> tmp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would reserving some memory for tmp here be beneficial?
Something like tmp.reserve(std::min(max_carryover_files, candidates.size() - candidate_i))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if (it == _last_trim_carryover->end()) {
_last_trim_carryover = std::nullopt;
} else {
fragmented_vector<file_list_item> tmp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth having tmp.reserve() before std::copy() as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Apr 24, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/48232#018f11a1-cce4-4fd0-9085-015378c8263b:

"rptest.tests.rbac_upgrade_test.UpgradeMigrationCreatingDefaultRole.test_rbac_migration"

new failures in https://buildkite.com/redpanda/redpanda/builds/48232#018f11aa-531c-4182-b73c-633a06b7d407:

"rptest.tests.rbac_upgrade_test.UpgradeMigrationCreatingDefaultRole.test_rbac_migration"

new failures in https://buildkite.com/redpanda/redpanda/builds/48345#018f1b68-184e-4018-8472-0f4d12a34f7b:

"rptest.tests.cloud_storage_chunk_read_path_test.CloudStorageChunkReadTest.test_read_when_cache_smaller_than_segment_size"

new failures in https://buildkite.com/redpanda/redpanda/builds/48345#018f1b88-2f6f-4993-ac6c-28e59025df88:

"rptest.tests.test_si_cache_space_leak.ShadowIndexingCacheSpaceLeakTest.test_si_cache.message_size=10000.num_messages=100000.concurrency=2"

new failures in https://buildkite.com/redpanda/redpanda/builds/48384#018f1ef5-e475-4844-a04a-348bd1ad0b66:

"rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.num_to_upgrade=0.with_tiered_storage=True"

"that can be carried over from one trim to next. This allows cache to "
"quickly unblock readers before starting the directory walk.",
{.needs_restart = needs_restart::no, .visibility = visibility::tunable},
512)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this value should be larger. On systems where we are reaching obj. count we have upwards of 100k objects.

Not for this PR but maybe we can optimistically use much more available units from materialized resources and then trim this list to release units on demand if the readers need it, like a ballast file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some clusters we allow 500K objects. The purpose of this list is to unblock enough hydrations to allow progress during the trim. It's not supposed to replace the trim altogether. Only to use data that we got during the previous trim opportunistically.

Comment on lines +1305 to +1337
if (
is_trim_exempt(file_stat.path)
|| std::string_view(file_stat.path).ends_with(tmp_extension)) {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without digging in, this reads like we could we end up in a situation where all the oldest files are trim-exempt or tmp files, in which case the carryover list doesn't help at all. I don't think that's the case since this only refers to in-progress files or the access tracker, but still perhaps it's worth considering skipping them when creating the carryover list in the first place?

Comment on lines 1405 to 1424
vlog(cst_log.debug, "Carryover trim list is empty");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only deferring the problem? Once the carryover list is empty, won't we run into the same read-time walk?

I'm wondering if it would be a big lift to trim the carryover and return, but if we didn't previously have enough space to reserve, also trigger a regular trim in the background to repopulate the carryover list, so in the happy path there may be just the first read-time walk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I'd be happy if this solves everything, given background work seems much more complicated. I'm just trying to understand if this fixes things enough to avoid the issues we were seeing in the cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should solve the problem. What will happen is one hydration will be blocked until the trim completes. Other hydrations will be quickly unblocked or never block at all.
It makes sense to push the process to the background. I may do this in this PR or in a followup. It probably makes sense to limit the scope of this PR.

@Lazin Lazin force-pushed the fix/non-blocking-trim branch from b3c46e9 to b0377e4 Compare April 26, 2024 12:55
Add new parameter that controls cache carryover behavior.
@Lazin Lazin force-pushed the fix/non-blocking-trim branch from b0377e4 to 5baa478 Compare April 26, 2024 15:04
Lazin added 7 commits April 26, 2024 15:47
The "carryover" behavior allows cache to use information from the
previous trim to quickly trim the cache without scanning the whole
directory. This allows cache to avoid blocking readers.

In a situation when the cache cntains very large number of files the
recursive directory walk could take few minutes. We're not allowing
number of objects stored in the cache to overshoot so all the readers
are blocked until the walk is finished.

This commit adds new "carryover" trim mechanism which is running before
the normal trim and uses information obtained during the previous fast
or full trim to delete some objects wihtout walking the directory tree.
Change the configuration parameter and treat the value as number of
bytes that we can use to store carryover data.
Reserve memory units for the carryover mechanism in
materialized_resrouces.
@Lazin Lazin force-pushed the fix/non-blocking-trim branch from 5baa478 to d81bbcc Compare April 26, 2024 16:11
@Lazin Lazin marked this pull request as ready for review April 26, 2024 16:11
@Lazin
Copy link
Contributor Author

Lazin commented Apr 26, 2024

cloud_storage_cache_trim_carryover ... this will end up effectively being another hard value we have to pay attention to. Should this be percentage based on max object count in the cache instead?

It has reasonable default. You only ever want to use it to disable the feature.

In case if carryover trim was able to release enough space start trim in
the background and return early. This unblocks the hydration that
reserved space and triggered the trim. We need to run normal trim anyway
to avoid corner case when the carryover list becomes empty and we have
to block readers for the duration of the full trim.
@Lazin Lazin force-pushed the fix/non-blocking-trim branch from d81bbcc to 9f1b51b Compare April 27, 2024 08:32
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Apr 27, 2024

@Lazin
Copy link
Contributor Author

Lazin commented Apr 27, 2024

CI failure is #18121
I believe it's unrelated. Restarted CI.

@Lazin
Copy link
Contributor Author

Lazin commented Apr 29, 2024

/ci-repeat 5

@Lazin Lazin merged commit 789c6ca into redpanda-data:dev Apr 29, 2024
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-18056-v23.3.x-62 remotes/upstream/v23.3.x
git cherry-pick -x 57985b74ff30f1cb8a27f06a4cbb688b2201b569 940fcd4d802e1f3665243f681d58b323c9e27ece 5ece9d401917d929a8d5e986b6aabacfd54884c1 fdf34981fac71b011a3c47e3adf86006e6d9da09 125659467d743a3821516b975efe4724bdd24da0 e1c30bc6e18a08de9fb74d4331450488642b5b53 0f84fdbbbecd80221ee199de0850a00f16de0789 6b57e0c167eb4da482b5122f482f3135f070ab02 61a09b47099e2422b86c401e11f5e5c6cfba9157 9f1b51b6a7e3455f3bc1c25965f3d19bb657dada

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants