Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Add proposal for Prometheus metrics coverage #77

Merged
merged 5 commits into from
May 1, 2020
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions docs/prometheus-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Prometheus Metrics Coverage

We plan to collect a rich set of metrics in kubeflow/common's `JobController` using [Prometheus](https://prometheus.io/).
The goal is to report generic metrics (e.g. metrics related to pods/jobs/services) during the lifecycle of `JobController` so that:

* Other operators built on top of it will automatically report Prometheus metrics without additional efforts;
* It is easier for users of Kubeflow distributed training operators to monitor operator performance and behaviors using consistent set of metrics for different distributed training operators.

This document outlines the list of Prometheus metrics we plan to cover in `JobController`. We follow the metric naming convention
outlined [here](https://prometheus.io/docs/practices/naming/) and the metric types supported by Prometheus [here](https://prometheus.io/docs/concepts/metric_types/).

## Pod Metrics

The following metrics related to the lifecycle of pods will be reported:

| Metric Name | Metric Type | Description |
| ----------- | ------------| ----------- |
| created_pods_total | Counter | The total number of created pods |
| restarted_pods_total | Counter | The total number of restarted pods |
| deleted_pods_total | Counter | The total number of deleted pods |
| failed_pods_total | Counter | The total number of failed pods |

terrytangyuan marked this conversation as resolved.
Show resolved Hide resolved
The following metrics will be reported on each pod:

| Metric Name | Metric Type | Description |
| ----------- | ------------| ----------- |
| container_cpu_usage_seconds_total | Counter | Cumulative cpu time consumed in seconds |
| container_accelerator_memory_used_bytes | Gauge | Total accelerator memory allocated |
| container_memory_usage_bytes | Gauge | Current memory usage in bytes, including all memory regardless of when it was accessed |
| container_network_transmit_bytes_total | Counter | Cumulative count of bytes transmitted |
| container_fs_usage_bytes | Gauge | Number of bytes that are consumed by the container on this filesystem |
| up | Gauge | Keep-Alive check (maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series))) |

Note that some of the above metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet
integration which reports to Prometheus through our prometheus-operator installation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to make sure the scope. This is outside operator. By default cadvisor expose the metrics and user can use these by their own.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think it's good to document this here so we know that we don't need to report these metrics by ourselves.


## Job Metrics

The following metrics related to the lifecycle of jobs will be reported:
Jeffwan marked this conversation as resolved.
Show resolved Hide resolved

| Metric Name | Metric Type | Description |
| ----------- | ------------| ----------- |
| created_jobs_total | Counter | The total number of created jobs |
| deleted_jobs_total | Counter | The total number of deleted jobs |
| completed_jobs_total | Counter | The total number of completed jobs |
| restarted_jobs_total | Counter | The total number of restarted jobs |
| pending_jobs_total | Counter | The total number of pending jobs |
| failed_jobs_total | Counter | The total number of failed jobs |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrytangyuan Forgot to mention this one. Do you think it is more appropriate to make this Gauge as well? Do you want to represent the history failures or the current failed jobs?

Can we list the metrics label in this doc as well? This is important and useful, too. Like we can combine pending jobs running jobs and failed jobs into one metric job_status{status="pending/failed/running"}, WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it as it is for now so that the metrics are consistent for metrics with past tense v.s. metrics with present continuous tense. Currently there are no labels yet as it's hard to differentiate metrics with two different tenses and choose different metric types for those metrics.

| running_jobs_total | Counter | The total number of running jobs |
terrytangyuan marked this conversation as resolved.
Show resolved Hide resolved

The following metrics related to the duration among various job phases will be reported:

| Metric Name | Metric Type | Description |
| ----------- | ------------| ----------- |
| from_created_to_completed_job_duration_seconds_total | Counter | The duration between job created and job completed in seconds |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I am thinking if we should change to job_duration_from_created_to_complated_seconds_total. Another thing is seems it would be good to use complete deleted as labels, but duration requires two and it would be a little bit hard to query. I think adding labels into metrics to distinguish them makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am following the naming practice outlined here: https://prometheus.io/docs/practices/naming/. I prefer the current naming without label as it's more intuitive but we can certainly revisit/revise later.

| from_completed_to_deleted_job_duration_seconds_total | Counter | The duration between job completed and job deleted in seconds |
| from_failed_to_restarted_job_duration_seconds_total | Counter | The duration between job failed and job restarted in seconds |
| from_pending_to_running_job_duration_seconds_total | Counter | The duration between job pending and job running in seconds |
terrytangyuan marked this conversation as resolved.
Show resolved Hide resolved

## Service Metrics

The following metrics related to the lifecycle of services will be reported:

| Metric Name | Metric Type | Description |
| ----------- | ------------| ----------- |
| succeeded_service_creations_total | Counter | The total number of succeeded service creations |
| failed_service_creations_total | Counter | The total number of failed service creations |
| restarted_service_creations_total | Counter | The total number of restarted service creations |
| service_patches_total | Counter | The total number of service patches |
| deleted_services_total | Counter | The total number of deleted services |

## Scheduling Metrics

The following metrics related to scheduling will be reported:

| Metric Name | Metric Type | Description |
| ----------- | ------------| ----------- |
| created_pod_disruption_policies_total | Counter | The total number of created pod disruption policies |
| deleted_pod_disruption_policies_total | Counter | The total number of deleted pod disruption policies |
| created_pod_groups_total | Counter | The total number of created pod groups |
| deleted_pod_groups_total | Counter | The total number of deleted pod groups |