Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: expose prometheus metrics for startup time #4893

Merged
merged 14 commits into from
Aug 8, 2023
Merged

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Aug 3, 2023

Problem

Currently to know how long pageserver startup took requires inspecting logs.

Summary of changes

pageserver_startup_duration_ms metric is added, with label phase for different phases of startup.

These are broken down by phase, where the phases correspond to the existing wait points in the code:

  • Start of doing I/O
  • When tenant load is done
  • When initial size calculation is done
  • When background jobs start
  • Then "complete" when everything is done.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Aug 3, 2023
These are broken down by phase, where the phases correspond
to the existing wait points in the code:
- Start of doing I/O
- When tenant load is done
- When initial size calculation is done
- When background jobs start
- Then "complete" when everything is done.
@jcsp jcsp force-pushed the jcsp/startup_metrics branch from 0941b21 to 6707d58 Compare August 3, 2023 16:34
@github-actions
Copy link

github-actions bot commented Aug 3, 2023

1264 tests run: 1214 passed, 0 failed, 50 skipped (full report)


@koivunej
Copy link
Member

koivunej commented Aug 4, 2023

We should add these metrics to ... list of global metrics somewhere in the python regress suite. Those tests are quite the mess, we assert metrics in two different files and there are easy failures, but a global metric is not bad. I will have to do the same for #4892.

Feel free to look into #4813 as well.

I'll link this to some observability epic.

@koivunej
Copy link
Member

koivunej commented Aug 4, 2023

Adding a time to activate as in #4083 was requested might be a good next step on a separate PR.

@jcsp jcsp force-pushed the jcsp/startup_metrics branch from f83db0a to 31c0731 Compare August 4, 2023 16:58
libs/metrics/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up include more stuff to do on #4813, otherwise this is looking good now. We don't have cancellation safety for these metrics, but I cannot image in which sitation we'd need them, because we are then doing a restart faster than what is the scraping delay.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
@jcsp jcsp marked this pull request as ready for review August 8, 2023 08:58
@jcsp jcsp requested a review from a team as a code owner August 8, 2023 08:58
@jcsp jcsp requested a review from a team as a code owner August 8, 2023 08:58
@jcsp jcsp requested review from conradludgate and problame and removed request for a team August 8, 2023 08:58
@jcsp jcsp enabled auto-merge (squash) August 8, 2023 08:59
@jcsp jcsp merged commit 4dc6446 into main Aug 8, 2023
@jcsp jcsp deleted the jcsp/startup_metrics branch August 8, 2023 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants