Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archival: consistent log size probes across replicas #24257

Conversation

nvartolomei
Copy link
Contributor

@nvartolomei nvartolomei commented Nov 22, 2024

We called update probe only from leaders and after exiting the upload loop which led to inconsistent and stale metrics.

Fix this by introducing a subscription mechanism to the STM which is the source of truth for the manifest state and must be consistent across all replicas.

Another option is to report metrics only from the leader :think:

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Bug Fixes

  • Make redpanda_cloud_storage_cloud_log_size metric consistent across all replicas. We used to update it seldomly from the leader replica only which lead to inconsistent/stale values.

@nvartolomei nvartolomei force-pushed the nv/CORE-8326-cloud-log-size-drift branch from 734375b to 6e467d4 Compare November 22, 2024 15:39
We called update probe only from leaders and after exiting the upload
loop which led to inconsistent and stale metrics.

Fix this by introducing a subscription mechanism to the STM which
is the source of truth for the manifest state and must be consistent
across all replicas.

Another option is to report metrics only from the leader :think:
@nvartolomei nvartolomei force-pushed the nv/CORE-8326-cloud-log-size-drift branch from 6e467d4 to 74eb0dd Compare November 22, 2024 16:04
@vbotbuildovich
Copy link
Collaborator

the below tests from https://buildkite.com/redpanda/redpanda/builds/58575#0193549f-2f41-4bc4-af2a-f496586d336e have failed and will be retried

cloud_storage_rpfixture

@vbotbuildovich
Copy link
Collaborator

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_probe doesn't block, so i'm wondering why the metrics aren't computed on demand when requested. is it very expensive? other than that i wonder if the subscription is overkill compared to a few more invocations of update_probe? i couldn't follow the code flow exactly, so maybe that wouldn't make sense...

// Ensure we are exception safe and won't leave the probe without
// a watcher in case of exceptions. Also ensures we won't crash calling
// the callback.
static_assert(noexcept(update_probe()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could add this requirement to parameter to subscribe_to_state_change ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is there subscribe_to_state_change(ss::noncopyable_function<void() noexcept> f) unless I got your comment wrong

@nvartolomei
Copy link
Contributor Author

@dotnwat that's a great idea. I'm in favor of computing on demand. Metric scraping should be less frequent than applying stm commands too so even better.

@Lazin
Copy link
Contributor

Lazin commented Nov 26, 2024

@nvartolomei in order to do this you need to add manifest to the probe as a dependency. I'm not against it but object lifetimes might become tricky.

nvartolomei added a commit to nvartolomei/redpanda that referenced this pull request Nov 27, 2024
We called update probe only from leaders and after exiting the upload
loop which led to inconsistent and stale metrics.

Fix this by introducing a subscription mechanism to the STM which
is the source of truth for the manifest state and must be consistent
across all replicas.

The first attempt was in
redpanda-data#24257 but the feedback
suggested that the approach in this commit is better.
@nvartolomei
Copy link
Contributor Author

@Lazin @dotnwat alternative in #24342. I don't see a lifetime issue since the probe is owned by archival service and archival service can't exist without an stm so everything should work nicely. 🤞

@dotnwat
Copy link
Member

dotnwat commented Nov 27, 2024

Took a look at the other PR too

vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this pull request Nov 28, 2024
We called update probe only from leaders and after exiting the upload
loop which led to inconsistent and stale metrics.

Fix this by introducing a subscription mechanism to the STM which
is the source of truth for the manifest state and must be consistent
across all replicas.

The first attempt was in
redpanda-data#24257 but the feedback
suggested that the approach in this commit is better.

(cherry picked from commit ab1dd53)
@nvartolomei nvartolomei deleted the nv/CORE-8326-cloud-log-size-drift branch November 28, 2024 16:28
nvartolomei added a commit to nvartolomei/redpanda that referenced this pull request Nov 28, 2024
We called update probe only from leaders and after exiting the upload
loop which led to inconsistent and stale metrics.

Fix this by introducing a subscription mechanism to the STM which
is the source of truth for the manifest state and must be consistent
across all replicas.

The first attempt was in
redpanda-data#24257 but the feedback
suggested that the approach in this commit is better.

(cherry picked from commit ab1dd53)
nvartolomei added a commit to nvartolomei/redpanda that referenced this pull request Nov 28, 2024
We called update probe only from leaders and after exiting the upload
loop which led to inconsistent and stale metrics.

Fix this by introducing a subscription mechanism to the STM which
is the source of truth for the manifest state and must be consistent
across all replicas.

The first attempt was in
redpanda-data#24257 but the feedback
suggested that the approach in this commit is better.

(cherry picked from commit ab1dd53)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants