Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storcon: add metric for AZ scheduling violations #9949

Merged
merged 2 commits into from
Dec 2, 2024
Merged

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Nov 29, 2024

Problem

We can't easily tell how far the state of shards is from their AZ preferences. This can be a cause of performance issues, so it's important for diagnosability that we can tell easily if there are significant numbers of shards that aren't running in their preferred AZ.

Related: https://github.com/neondatabase/cloud/issues/15413

Summary of changes

  • In reconcile_all, count shards that are scheduled into the wrong AZ (if they have a preference), and publish it as a prometheus gauge.
  • Also calculate a statistic for how many shards wanted to reconcile but couldn't.

This is clearly a lazy calculation: reconcile all only runs periodically. But that's okay: shards in the wrong AZ is something that only matters if it stays that way for some period of time.

@jcsp jcsp marked this pull request as ready for review November 29, 2024 18:23
@jcsp jcsp requested a review from a team as a code owner November 29, 2024 18:23
@jcsp jcsp requested a review from erikgrinaker November 29, 2024 18:23
@jcsp jcsp enabled auto-merge December 2, 2024 11:04
Copy link

github-actions bot commented Dec 2, 2024

7018 tests run: 6710 passed, 0 failed, 308 skipped (full report)


Flaky tests (8)

Postgres 17

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 30.4% (8270 of 27226 functions)
  • lines: 47.8% (65219 of 136541 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
2bebfe2 at 2024-12-02T11:36:50.612Z :recycle:

@jcsp jcsp added this pull request to the merge queue Dec 2, 2024
Merged via the queue into main with commit bd09369 Dec 2, 2024
80 checks passed
@jcsp jcsp deleted the jcsp/az-metrics branch December 2, 2024 11:51
awarus pushed a commit that referenced this pull request Dec 5, 2024
## Problem

We can't easily tell how far the state of shards is from their AZ
preferences. This can be a cause of performance issues, so it's
important for diagnosability that we can tell easily if there are
significant numbers of shards that aren't running in their preferred AZ.

Related: neondatabase/cloud#15413

## Summary of changes

- In reconcile_all, count shards that are scheduled into the wrong AZ
(if they have a preference), and publish it as a prometheus gauge.
- Also calculate a statistic for how many shards wanted to reconcile but
couldn't.

This is clearly a lazy calculation: reconcile all only runs
periodically. But that's okay: shards in the wrong AZ is something that
only matters if it stays that way for some period of time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants