Feature Request: change vtbackup_duration_by_phase to binary vtbackup_duration #12972
Labels
Component: Backup and Restore
Type: Enhancement
Logical improvement (somewhere between a bug and feature)
Type: Feature
Feature Description
After using
vtbackup_duration_by_phase
for a few weeks in production, I can confidently say that they are pretty awkward to use.I recommend changing this metric to
vtbackup_phase
, a binary valued gauge similar to K8s metrics likekube_pod_status_phase
. Here's an example of what these metrics could look like:At any given moment, only one phase would be active. In order to calculate how long a phase has been active, you could do something like this:
Where
<interval>
is the number of seconds between data points.Use Case(s)
Some issues that would be resolved by the proposed change.
vtbackup
currently doesn't report that a phase as active. It only reports the phase duration once that phase completes. This means that there's no way to tell what phasevtbackup
is currently in, unless you know enough about the internals of the program to infer the current state from other metrics and logs.vtbackup
exits before completing a phase, it won't report the time it spent in that phase.TakeNewBackup
),vtbackup
exits pretty much right away. This means that there might only be a few seconds betweenvtbackup
reporting that phase for the first time andvtbackup
exiting, which might not be enough time for the metric collector (e.g. Prometheus) to have a chance to collect that metric. This necessitates using something awkward like--keep-alive-timeout
to get keepvtbackup
alive long enough for the collector to do at least one scrape.The text was updated successfully, but these errors were encountered: