[storage]: add local retention reclaimable to partition health report #12737

dotnwat · 2023-08-11T01:35:32Z

The partition balancer needs to know how much of a partition's size is easily reclaimable. That is, the amount of data that has grown best-effort above local retention threshold. This will be used to adjust the balancer's view of how much free space is available on a node.

Note: this is needs to be backported, but we should wait until we've accumulated changes and tests related to updates of the partition balancer as well. These are being track in https://github.com/redpanda-data/core-internal/issues/719

Fixes: https://github.com/redpanda-data/core-internal/issues/725

Backports Required

Release Notes

none

src/v/cluster/health_monitor_types.h

src/v/storage/disk_log_impl.cc

src/v/cloud_storage/tests/cloud_storage_e2e_test.cc

src/v/cluster/health_monitor_types.h

andrwng · 2023-08-12T00:47:05Z

src/v/storage/disk_log_impl.cc

+        return do_truncate_prefix(cfg)
+          .then([this] {


Just noting there's still room for a race where the reported size is lower than the reclaimable size, I think. Since size_bytes gets updated in do_truncate() and there is this scheduling point here.

It seems hard to be robust, so it's probably fine leaving this, but with the caveat that partition balancing should sanitize these values before making decisions.

It seems hard to be robust, so it's probably fine leaving this, but with the caveat that partition balancing should sanitize these values before making decisions.

yeh. i added a comment to the interface. flagging directly to @ztlpn for visibility.

src/v/storage/disk_log_impl.cc

Seems to have been partially cleaned up in redpanda-data#9121. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

The additional point where data is updated seems to have been overlooked in redpanda-data@26fb548#diff-708b13fb33ad235d5c420e38131e7188daeee60ff6a4e1054ed480a36142ccb2 Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

When reclaimable space is calculated the size above local retention is cached so that it is available to the health report subsystem. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

Tests that the reclaimable local size from cloud storage topic is reflected back through health report. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

This happens in several places so factoring it out into a reusable method. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

dotnwat · 2023-08-14T19:14:07Z

force push changes

use boost_require_eventually @andrwng
remove debugging log message @ztlpn
add clarifying comment about possible size inconsistency in report @andrwng
factor out gc configuration calculation to be reusable @ztlpn

ztlpn · 2023-09-16T01:07:38Z

@dotnwat should we backport it to 23.2.x, WDYT?

dotnwat · 2023-09-17T15:00:33Z

@ztlpn yes. in the PR description I wrote

Note: this is needs to be backported, but we should wait until we've accumulated changes and tests related to updates of the partition balancer as well. These are being track in https://github.com/redpanda-data/core-internal/issues/719

I have on my todo list here to ask you about this. Perhaps we should discuss how to proceed on the balancer? I'm also happy to backport this PR now if it helps.

ztlpn · 2023-09-26T14:36:59Z

/backport v23.2.x

vbotbuildovich · 2023-09-26T14:38:02Z

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-12737-v23.2.x-381 remotes/upstream/v23.2.x
git cherry-pick -x cfd6325a03a00b61dcd5936b287edccf27636808 3981467e2e71c1083e3c149c7a53bca2b52b1a4c dd8fbc8006c10590154bf87bed53a2c5340ca730 fb8657bb1c6612f612b56baeb2bc4be98e84b336 ac6a307cab222b0cd23c6b6412f612bc5d31391b 9f6ada40508135423addf1ea2e5ed0a004f2cd90 fa9fb6d834fd8c192988d0d3775d9f787bfef17f

Workflow run logs.

[v23.2.x] Backport of #12544 #12596 #12644 #12586 #12737 #12368 #13480

github-actions bot added the area/redpanda label Aug 11, 2023

dotnwat marked this pull request as ready for review August 11, 2023 04:06

dotnwat requested review from andrwng, ztlpn, mmaslankaprv and VladLazar August 11, 2023 04:18

andrwng reviewed Aug 11, 2023

View reviewed changes

src/v/cluster/health_monitor_types.h Show resolved Hide resolved

src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved

dotnwat force-pushed the health-report-reclaimable-space branch from 778d4f9 to 4738a9a Compare August 11, 2023 18:42

ztlpn reviewed Aug 11, 2023

View reviewed changes

src/v/storage/disk_log_impl.cc Outdated Show resolved Hide resolved

src/v/cloud_storage/tests/cloud_storage_e2e_test.cc Outdated Show resolved Hide resolved

andrwng previously approved these changes Aug 12, 2023

View reviewed changes

dotnwat added 7 commits August 14, 2023 12:09

cluster: remove unused constants from adl time

cfd6325

Seems to have been partially cleaned up in redpanda-data#9121. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

cluster: add non_log_disk_size_bytes to report

3981467

The additional point where data is updated seems to have been overlooked in redpanda-data@26fb548#diff-708b13fb33ad235d5c420e38131e7188daeee60ff6a4e1054ed480a36142ccb2 Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

storage: expose lazy latest reclaimable size

dd8fbc8

When reclaimable space is calculated the size above local retention is cached so that it is available to the health report subsystem. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

cluster: add reclaimable size to partition health report

fb8657b

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

test: add a cloud storage fixture test for reclaimable space

ac6a307

Tests that the reclaimable local size from cloud storage topic is reflected back through health report. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

storage: add helper to compute default gc config

9f6ada4

This happens in several places so factoring it out into a reusable method. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

storage: refresh cached usage/reclaim stats after truncation

fa9fb6d

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

dotnwat dismissed andrwng’s stale review via fa9fb6d August 14, 2023 19:11

dotnwat force-pushed the health-report-reclaimable-space branch from 2888c16 to fa9fb6d Compare August 14, 2023 19:11

andrwng approved these changes Aug 14, 2023

View reviewed changes

dotnwat merged commit 1a7605f into redpanda-data:dev Aug 14, 2023

vbotbuildovich mentioned this pull request Sep 26, 2023

[v23.2.x] [storage]: add local retention reclaimable to partition health report #13682

Closed

ztlpn mentioned this pull request Sep 28, 2023

[v23.2.x] Backport of #12544 #12596 #12644 #12586 #12737 #12368 #13480 #13784

Merged

ztlpn added a commit that referenced this pull request Oct 3, 2023

Merge pull request #13784 from ztlpn/v23.2.x-bp

080ed02

[v23.2.x] Backport of #12544 #12596 #12644 #12586 #12737 #12368 #13480

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[storage]: add local retention reclaimable to partition health report #12737

[storage]: add local retention reclaimable to partition health report #12737

dotnwat commented Aug 11, 2023 •

edited

Loading

andrwng Aug 12, 2023

dotnwat Aug 14, 2023

dotnwat commented Aug 14, 2023

ztlpn commented Sep 16, 2023

dotnwat commented Sep 17, 2023

ztlpn commented Sep 26, 2023

vbotbuildovich commented Sep 26, 2023

[storage]: add local retention reclaimable to partition health report #12737

[storage]: add local retention reclaimable to partition health report #12737

Conversation

dotnwat commented Aug 11, 2023 • edited Loading

Backports Required

Release Notes

andrwng Aug 12, 2023

Choose a reason for hiding this comment

dotnwat Aug 14, 2023

Choose a reason for hiding this comment

dotnwat commented Aug 14, 2023

ztlpn commented Sep 16, 2023

dotnwat commented Sep 17, 2023

ztlpn commented Sep 26, 2023

vbotbuildovich commented Sep 26, 2023

dotnwat commented Aug 11, 2023 •

edited

Loading