-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
archival: consistent log size probes across replicas #24257
archival: consistent log size probes across replicas #24257
Conversation
734375b
to
6e467d4
Compare
We called update probe only from leaders and after exiting the upload loop which led to inconsistent and stale metrics. Fix this by introducing a subscription mechanism to the STM which is the source of truth for the manifest state and must be consistent across all replicas. Another option is to report metrics only from the leader :think:
6e467d4
to
74eb0dd
Compare
the below tests from https://buildkite.com/redpanda/redpanda/builds/58575#0193549f-2f41-4bc4-af2a-f496586d336e have failed and will be retried
|
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58575#019354e4-a507-4c8c-b133-8647ca1bde2a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update_probe doesn't block, so i'm wondering why the metrics aren't computed on demand when requested. is it very expensive? other than that i wonder if the subscription is overkill compared to a few more invocations of update_probe? i couldn't follow the code flow exactly, so maybe that wouldn't make sense...
// Ensure we are exception safe and won't leave the probe without | ||
// a watcher in case of exceptions. Also ensures we won't crash calling | ||
// the callback. | ||
static_assert(noexcept(update_probe())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you could add this requirement to parameter to subscribe_to_state_change
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is there subscribe_to_state_change(ss::noncopyable_function<void() noexcept> f)
unless I got your comment wrong
@dotnwat that's a great idea. I'm in favor of computing on demand. Metric scraping should be less frequent than applying stm commands too so even better. |
@nvartolomei in order to do this you need to add manifest to the probe as a dependency. I'm not against it but object lifetimes might become tricky. |
We called update probe only from leaders and after exiting the upload loop which led to inconsistent and stale metrics. Fix this by introducing a subscription mechanism to the STM which is the source of truth for the manifest state and must be consistent across all replicas. The first attempt was in redpanda-data#24257 but the feedback suggested that the approach in this commit is better.
Took a look at the other PR too |
We called update probe only from leaders and after exiting the upload loop which led to inconsistent and stale metrics. Fix this by introducing a subscription mechanism to the STM which is the source of truth for the manifest state and must be consistent across all replicas. The first attempt was in redpanda-data#24257 but the feedback suggested that the approach in this commit is better. (cherry picked from commit ab1dd53)
We called update probe only from leaders and after exiting the upload loop which led to inconsistent and stale metrics. Fix this by introducing a subscription mechanism to the STM which is the source of truth for the manifest state and must be consistent across all replicas. The first attempt was in redpanda-data#24257 but the feedback suggested that the approach in this commit is better. (cherry picked from commit ab1dd53)
We called update probe only from leaders and after exiting the upload loop which led to inconsistent and stale metrics. Fix this by introducing a subscription mechanism to the STM which is the source of truth for the manifest state and must be consistent across all replicas. The first attempt was in redpanda-data#24257 but the feedback suggested that the approach in this commit is better. (cherry picked from commit ab1dd53)
We called update probe only from leaders and after exiting the upload loop which led to inconsistent and stale metrics.
Fix this by introducing a subscription mechanism to the STM which is the source of truth for the manifest state and must be consistent across all replicas.
Another option is to report metrics only from the leader :think:
Backports Required
Release Notes
Bug Fixes
redpanda_cloud_storage_cloud_log_size
metric consistent across all replicas. We used to update it seldomly from the leader replica only which lead to inconsistent/stale values.