-
Notifications
You must be signed in to change notification settings - Fork 71
Add proposal for Prometheus metrics coverage #77
Conversation
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
Thanks everyone for the comments! I've converted the lists to tables which include the metric name, type, and description. I also added a few additional metrics as suggested. Hopefully it's much clearer now. Please take another look. |
| up | Gauge | Keep-Alive check (maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series))) | | ||
|
||
Note that some of the above metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet | ||
integration which reports to Prometheus through our prometheus-operator installation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to make sure the scope. This is outside operator. By default cadvisor expose the metrics and user can use these by their own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I think it's good to document this here so we know that we don't need to report these metrics by ourselves.
docs/prometheus-metrics.md
Outdated
|
||
| Metric Name | Metric Type | Description | | ||
| ----------- | ------------| ----------- | | ||
| from_created_to_completed_job_duration_seconds_total | Counter | The duration between job created and job completed in seconds | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I am thinking if we should change to job_duration_from_created_to_complated_seconds_total
. Another thing is seems it would be good to use complete
deleted
as labels, but duration requires two and it would be a little bit hard to query. I think adding labels into metrics to distinguish them makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am following the naming practice outlined here: https://prometheus.io/docs/practices/naming/. I prefer the current naming without label as it's more intuitive but we can certainly revisit/revise later.
Beside above minor comments, it looks good to me. Wait to see if someone else has the feedback |
/assign @gaocegege @johnugeorge |
Signed-off-by: terrytangyuan <terrytangyuan@gmail.com>
@yeya24 Thanks! Great suggestions. I have updated the metric types in the doc. |
| completed_jobs_total | Counter | The total number of completed jobs | | ||
| restarted_jobs_total | Counter | The total number of restarted jobs | | ||
| pending_jobs_total | Gauge | The total number of pending jobs | | ||
| failed_jobs_total | Counter | The total number of failed jobs | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terrytangyuan Forgot to mention this one. Do you think it is more appropriate to make this Gauge
as well? Do you want to represent the history failures or the current failed jobs?
Can we list the metrics label in this doc as well? This is important and useful, too. Like we can combine pending jobs
running jobs
and failed jobs
into one metric job_status{status="pending/failed/running"}, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep it as it is for now so that the metrics are consistent for metrics with past tense v.s. metrics with present continuous tense. Currently there are no labels yet as it's hard to differentiate metrics with two different tenses and choose different metric types for those metrics.
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This provides a detailed outline of the Prometheus metrics we plan to coverage in common operator. Related issue: #22.
Signed-off-by: terrytangyuan terrytangyuan@gmail.com