archival: consistent log size probes across replicas #24257

nvartolomei · 2024-11-22T14:17:30Z

We called update probe only from leaders and after exiting the upload loop which led to inconsistent and stale metrics.

Fix this by introducing a subscription mechanism to the STM which is the source of truth for the manifest state and must be consistent across all replicas.

Another option is to report metrics only from the leader :think:

Backports Required

Release Notes

Bug Fixes

Make redpanda_cloud_storage_cloud_log_size metric consistent across all replicas. We used to update it seldomly from the leader replica only which lead to inconsistent/stale values.

src/v/cluster/archival/ntp_archiver_service.cc

src/v/cluster/archival/stm_subscriptions.h

We called update probe only from leaders and after exiting the upload loop which led to inconsistent and stale metrics. Fix this by introducing a subscription mechanism to the STM which is the source of truth for the manifest state and must be consistent across all replicas. Another option is to report metrics only from the leader :think:

vbotbuildovich · 2024-11-22T18:27:56Z

the below tests from https://buildkite.com/redpanda/redpanda/builds/58575#0193549f-2f41-4bc4-af2a-f496586d336e have failed and will be retried

cloud_storage_rpfixture

vbotbuildovich · 2024-11-22T19:03:41Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58575#019354e4-a507-4c8c-b133-8647ca1bde2a

dotnwat

update_probe doesn't block, so i'm wondering why the metrics aren't computed on demand when requested. is it very expensive? other than that i wonder if the subscription is overkill compared to a few more invocations of update_probe? i couldn't follow the code flow exactly, so maybe that wouldn't make sense...

dotnwat · 2024-11-26T04:38:07Z

src/v/cluster/archival/ntp_archiver_service.cc

+        // Ensure we are exception safe and won't leave the probe without
+        // a watcher in case of exceptions. Also ensures we won't crash calling
+        // the callback.
+        static_assert(noexcept(update_probe()));


nit: you could add this requirement to parameter to subscribe_to_state_change ?

It is there subscribe_to_state_change(ss::noncopyable_function<void() noexcept> f) unless I got your comment wrong

nvartolomei · 2024-11-26T09:17:29Z

@dotnwat that's a great idea. I'm in favor of computing on demand. Metric scraping should be less frequent than applying stm commands too so even better.

Lazin · 2024-11-26T13:46:29Z

@nvartolomei in order to do this you need to add manifest to the probe as a dependency. I'm not against it but object lifetimes might become tricky.

We called update probe only from leaders and after exiting the upload loop which led to inconsistent and stale metrics. Fix this by introducing a subscription mechanism to the STM which is the source of truth for the manifest state and must be consistent across all replicas. The first attempt was in redpanda-data#24257 but the feedback suggested that the approach in this commit is better.

nvartolomei · 2024-11-27T15:51:03Z

@Lazin @dotnwat alternative in #24342. I don't see a lifetime issue since the probe is owned by archival service and archival service can't exist without an stm so everything should work nicely. 🤞

dotnwat · 2024-11-27T23:40:42Z

Took a look at the other PR too

We called update probe only from leaders and after exiting the upload loop which led to inconsistent and stale metrics. Fix this by introducing a subscription mechanism to the STM which is the source of truth for the manifest state and must be consistent across all replicas. The first attempt was in redpanda-data#24257 but the feedback suggested that the approach in this commit is better. (cherry picked from commit ab1dd53)

github-actions bot added area/build area/redpanda labels Nov 22, 2024

nvartolomei requested review from Lazin, andrwng, dotnwat and WillemKauf November 22, 2024 14:24

WillemKauf reviewed Nov 22, 2024

View reviewed changes

src/v/cluster/archival/ntp_archiver_service.cc Outdated Show resolved Hide resolved

WillemKauf reviewed Nov 22, 2024

View reviewed changes

src/v/cluster/archival/ntp_archiver_service.cc Show resolved Hide resolved

WillemKauf reviewed Nov 22, 2024

View reviewed changes

src/v/cluster/archival/stm_subscriptions.h Outdated Show resolved Hide resolved

nvartolomei force-pushed the nv/CORE-8326-cloud-log-size-drift branch from 734375b to 6e467d4 Compare November 22, 2024 15:39

nvartolomei force-pushed the nv/CORE-8326-cloud-log-size-drift branch from 6e467d4 to 74eb0dd Compare November 22, 2024 16:04

Lazin approved these changes Nov 25, 2024

View reviewed changes

dotnwat reviewed Nov 26, 2024

View reviewed changes

nvartolomei mentioned this pull request Nov 27, 2024

archival: consistent log size probes across replicas (pull) #24342

Merged

7 tasks

nvartolomei closed this Nov 28, 2024

nvartolomei deleted the nv/CORE-8326-cloud-log-size-drift branch November 28, 2024 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archival: consistent log size probes across replicas #24257

archival: consistent log size probes across replicas #24257

nvartolomei commented Nov 22, 2024 •

edited

Loading

vbotbuildovich commented Nov 22, 2024

vbotbuildovich commented Nov 22, 2024

dotnwat left a comment

dotnwat Nov 26, 2024

nvartolomei Nov 27, 2024

nvartolomei commented Nov 26, 2024

Lazin commented Nov 26, 2024

nvartolomei commented Nov 27, 2024

dotnwat commented Nov 27, 2024

archival: consistent log size probes across replicas #24257

archival: consistent log size probes across replicas #24257

Conversation

nvartolomei commented Nov 22, 2024 • edited Loading

Backports Required

Release Notes

Bug Fixes

vbotbuildovich commented Nov 22, 2024

vbotbuildovich commented Nov 22, 2024

dotnwat left a comment

Choose a reason for hiding this comment

dotnwat Nov 26, 2024

Choose a reason for hiding this comment

nvartolomei Nov 27, 2024

Choose a reason for hiding this comment

nvartolomei commented Nov 26, 2024

Lazin commented Nov 26, 2024

nvartolomei commented Nov 27, 2024

dotnwat commented Nov 27, 2024

nvartolomei commented Nov 22, 2024 •

edited

Loading