Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[storage]: add local retention reclaimable to partition health report #12737

Merged
merged 7 commits into from
Aug 14, 2023

Conversation

dotnwat
Copy link
Member

@dotnwat dotnwat commented Aug 11, 2023

The partition balancer needs to know how much of a partition's size is easily reclaimable. That is, the amount of data that has grown best-effort above local retention threshold. This will be used to adjust the balancer's view of how much free space is available on a node.

Note: this is needs to be backported, but we should wait until we've accumulated changes and tests related to updates of the partition balancer as well. These are being track in https://github.com/redpanda-data/core-internal/issues/719

Fixes: https://github.com/redpanda-data/core-internal/issues/725

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

  • none

@dotnwat dotnwat marked this pull request as ready for review August 11, 2023 04:06
src/v/cluster/health_monitor_types.h Show resolved Hide resolved
src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
@dotnwat dotnwat force-pushed the health-report-reclaimable-space branch from 778d4f9 to 4738a9a Compare August 11, 2023 18:42
src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
src/v/cloud_storage/tests/cloud_storage_e2e_test.cc Outdated Show resolved Hide resolved
andrwng
andrwng previously approved these changes Aug 12, 2023
src/v/cloud_storage/tests/cloud_storage_e2e_test.cc Outdated Show resolved Hide resolved
src/v/cluster/health_monitor_types.h Show resolved Hide resolved
Comment on lines +1532 to +1533
return do_truncate_prefix(cfg)
.then([this] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting there's still room for a race where the reported size is lower than the reclaimable size, I think. Since size_bytes gets updated in do_truncate() and there is this scheduling point here.

It seems hard to be robust, so it's probably fine leaving this, but with the caveat that partition balancing should sanitize these values before making decisions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems hard to be robust, so it's probably fine leaving this, but with the caveat that partition balancing should sanitize these values before making decisions.

yeh. i added a comment to the interface. flagging directly to @ztlpn for visibility.

src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved
Seems to have been partially cleaned up in
redpanda-data#9121.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
The additional point where data is updated seems to have been overlooked
in redpanda-data@26fb548#diff-708b13fb33ad235d5c420e38131e7188daeee60ff6a4e1054ed480a36142ccb2

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
When reclaimable space is calculated the size above local retention is
cached so that it is available to the health report subsystem.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Tests that the reclaimable local size from cloud storage topic is
reflected back through health report.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
This happens in several places so factoring it out into a reusable
method.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
@dotnwat dotnwat force-pushed the health-report-reclaimable-space branch from 2888c16 to fa9fb6d Compare August 14, 2023 19:11
@dotnwat
Copy link
Member Author

dotnwat commented Aug 14, 2023

force push changes

  • use boost_require_eventually @andrwng
  • remove debugging log message @ztlpn
  • add clarifying comment about possible size inconsistency in report @andrwng
  • factor out gc configuration calculation to be reusable @ztlpn

@dotnwat dotnwat merged commit 1a7605f into redpanda-data:dev Aug 14, 2023
@ztlpn
Copy link
Contributor

ztlpn commented Sep 16, 2023

@dotnwat should we backport it to 23.2.x, WDYT?

@dotnwat
Copy link
Member Author

dotnwat commented Sep 17, 2023

@ztlpn yes. in the PR description I wrote

Note: this is needs to be backported, but we should wait until we've accumulated changes and tests related to updates of the partition balancer as well. These are being track in https://github.com/redpanda-data/core-internal/issues/719

I have on my todo list here to ask you about this. Perhaps we should discuss how to proceed on the balancer? I'm also happy to backport this PR now if it helps.

@ztlpn
Copy link
Contributor

ztlpn commented Sep 26, 2023

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-12737-v23.2.x-381 remotes/upstream/v23.2.x
git cherry-pick -x cfd6325a03a00b61dcd5936b287edccf27636808 3981467e2e71c1083e3c149c7a53bca2b52b1a4c dd8fbc8006c10590154bf87bed53a2c5340ca730 fb8657bb1c6612f612b56baeb2bc4be98e84b336 ac6a307cab222b0cd23c6b6412f612bc5d31391b 9f6ada40508135423addf1ea2e5ed0a004f2cd90 fa9fb6d834fd8c192988d0d3775d9f787bfef17f

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants