Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics for tsdb startup time #5471

Closed
damnever opened this issue Jul 24, 2023 · 5 comments · Fixed by #5477
Closed

Metrics for tsdb startup time #5471

damnever opened this issue Jul 24, 2023 · 5 comments · Fixed by #5477

Comments

@damnever
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Add some metrics to show the tsdb startup time.

Describe the solution you'd like
We can reuse the tsdb level metrics, such as prometheus_tsdb_data_replay_duration_seconds and prometheus_tsdb_snapshot_replay_error_total. Alternatively, we can also introduce new metrics at the cortex level specific to each tenant.

@yeya24
Copy link
Contributor

yeya24 commented Jul 24, 2023

I like this. I guess it is fine to make it per tenant since we have other per tenant metrics in ingester anyway

@damnever
Copy link
Contributor Author

However, there is already a metric called cortex_ingester_tsdb_wal_replay_duration_seconds

walReplayTime: promauto.With(registerer).NewHistogram(prometheus.HistogramOpts{
Name: "cortex_ingester_tsdb_wal_replay_duration_seconds",
Help: "The total time it takes to open and replay a TSDB WAL.",
Buckets: prometheus.DefBuckets,
}),

@yeya24
Copy link
Contributor

yeya24 commented Aug 14, 2023

cortex_ingester_tsdb_wal_replay_duration_seconds this one seems a little bit weird as it is the total time of opening a tsdb, including wal replay and other time.
Does it make sense to rename the metric?

@damnever
Copy link
Contributor Author

damnever commented Aug 15, 2023

Perhaps we should deprecate cortex_ingester_tsdb_wal_replay_duration_seconds and replace it with cortex_ingester_tsdb_data_replay_duration_seconds . Since I personally do not find the percentile metric useful for identifying slow users when considering related context information such as the number of series the user has.

@yeya24
Copy link
Contributor

yeya24 commented Aug 17, 2023

I think I am ok to align with TSDB metrics. Thoughts? @friedrichg @alanprot @alvinlin123 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants