Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: throttling: per-tenant metrics + more metrics to help understand throttle queue depth #9077

Merged
merged 3 commits into from
Sep 20, 2024

Conversation

problame
Copy link
Contributor

@problame problame commented Sep 20, 2024

There is an ongoing incident where having deeper insights into the queue depth at the throttle might be helpful.

This PR

  • adds per-tenant counterparts to the existing two metrics
  • adds two new global & per-tenant metrics to track queue depth
  • adds queue depth to the periodic logging
  • moves the periodic logging to the ingest housekeeping loop, which runs at same frequency as compaction loop, but, isn't slowed down by a slow compaction

On a typical PS, a per-timeline counter costs 4MiB in /metrics.
So, the toll we take here is quite manageable.

There's probably some CPU overhead to the additional global atomics, haven't measured it. It won't be devastating though, and we have the headroom.

@problame problame requested a review from koivunej September 20, 2024 13:18
@problame problame requested a review from a team as a code owner September 20, 2024 13:18
Copy link

github-actions bot commented Sep 20, 2024

5072 tests run: 4907 passed, 1 failed, 164 skipped (full report)


Failures on Postgres 17

  • test_storage_controller_heartbeats[failure4]: debug-x86-64
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_storage_controller_heartbeats[debug-pg17-failure4]"
Flaky tests (15)

Postgres 17

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 31.9% (7429 of 23312 functions)
  • lines: 49.9% (59857 of 120037 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
67b280f at 2024-09-20T17:00:33.896Z :recycle:

@problame problame enabled auto-merge (squash) September 20, 2024 16:47
@problame problame merged commit ec5dce0 into main Sep 20, 2024
80 checks passed
@problame problame deleted the problame/throttling-improve-observability--part2 branch September 20, 2024 16:48
davidgomes pushed a commit that referenced this pull request Sep 21, 2024
problame added a commit that referenced this pull request Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants