From cf4c0b1a23223259fa4dea0441287519971a5432 Mon Sep 17 00:00:00 2001 From: terrytangyuan Date: Thu, 23 Apr 2020 16:18:31 -0400 Subject: [PATCH 1/5] Add proposal for Prometheus metrics coverage Signed-off-by: terrytangyuan --- docs/prometheus-metrics.md | 58 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 58 insertions(+) create mode 100644 docs/prometheus-metrics.md diff --git a/docs/prometheus-metrics.md b/docs/prometheus-metrics.md new file mode 100644 index 00000000..f4c0abc1 --- /dev/null +++ b/docs/prometheus-metrics.md @@ -0,0 +1,58 @@ +# Prometheus Metrics Coverage + +We plan to collect a rich set of metrics in kubeflow/common's `JobController` using [Prometheus](https://prometheus.io/). +The goal is to report generic metrics (e.g. metrics related to pods/jobs/services) during the lifecycle of `JobController` so that: + +* Other operators built on top of it will automatically report Prometheus metrics without additional efforts; +* It is easier for users of Kubeflow distributed training operators to monitor operator performance and behaviors using consistent set of metrics for different distributed training operators. + +This document outlines the list of Prometheus metrics we plan to cover in `JobController`. + +## Pod Metrics + +The following metrics related to the lifecycle of pods will be reported: + +* The total number of created pods +* The total number of restarted pods +* The total number of deleted pods +* The total number of failed pods + +The following metrics will be reported on each pod: + +* CPU usage +* GPU usage +* Memory usage +* Network usage +* I/O usage +* Keep-Alive check +* Is-leader check + +## Job Metrics + +The following metrics related to the lifecycle of jobs will be reported: + +* The total number of created jobs +* The total number of deleted jobs +* The total number of completed jobs +* The total number of restarted jobs +* The total number of pending jobs +* The total number of failed jobs + +## Service Metrics + +The following metrics related to the lifecycle of services will be reported: + +* The total number of succeeded service creations +* The total number of failed service creations +* The total number of restarted service creations +* The total number of service patches +* The total number of deleted services + +## Scheduling Metrics + +The following metrics related to scheduling will be reported: + +* The total number of created pod disruption policies +* The total number of deleted pod disruption policies +* The total number of created pod groups +* The total number of deleted pod groups From 3b2daaa05fe5631716f471576a9eec4ac555605b Mon Sep 17 00:00:00 2001 From: terrytangyuan Date: Mon, 27 Apr 2020 14:23:21 -0400 Subject: [PATCH 2/5] Convert to table and add metric names Signed-off-by: terrytangyuan --- docs/prometheus-metrics.md | 68 +++++++++++++++++++++++--------------- 1 file changed, 41 insertions(+), 27 deletions(-) diff --git a/docs/prometheus-metrics.md b/docs/prometheus-metrics.md index f4c0abc1..cba135ab 100644 --- a/docs/prometheus-metrics.md +++ b/docs/prometheus-metrics.md @@ -6,53 +6,67 @@ The goal is to report generic metrics (e.g. metrics related to pods/jobs/service * Other operators built on top of it will automatically report Prometheus metrics without additional efforts; * It is easier for users of Kubeflow distributed training operators to monitor operator performance and behaviors using consistent set of metrics for different distributed training operators. -This document outlines the list of Prometheus metrics we plan to cover in `JobController`. +This document outlines the list of Prometheus metrics we plan to cover in `JobController`. We follow the metric naming convention +outlined [here](https://prometheus.io/docs/practices/naming/). ## Pod Metrics The following metrics related to the lifecycle of pods will be reported: -* The total number of created pods -* The total number of restarted pods -* The total number of deleted pods -* The total number of failed pods +| Metric Name | Description | +| ------------ | ------- | +| created_pods_total | The total number of created pods | +| restarted_pods_total | The total number of restarted pods | +| deleted_pods_total | The total number of deleted pods | +| failed_pods_total | The total number of failed pods | The following metrics will be reported on each pod: -* CPU usage -* GPU usage -* Memory usage -* Network usage -* I/O usage -* Keep-Alive check -* Is-leader check +| Metric Name | Description | +| ------------ | ------- | +| container_cpu_usage_seconds_total | CPU usage | +| container_accelerator_memory_used_bytes | GPU usage | +| container_memory_usage_bytes | Memory usage | +| container_network_transmit_bytes_total | Network usage | +| container_fs_write_seconds_total | I/O usage | +| up | Keep-Alive check (maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series))) | +| common_operator_is_leader | Whether this client is the leader of this common operator client set | + +Note that some of the above metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet +integration which reports to Prometheus through our prometheus-operator installation. ## Job Metrics The following metrics related to the lifecycle of jobs will be reported: -* The total number of created jobs -* The total number of deleted jobs -* The total number of completed jobs -* The total number of restarted jobs -* The total number of pending jobs -* The total number of failed jobs +| Metric Name | Description | +| ------------ | ------- | +| created_jobs_total | The total number of created jobs | +| deleted_jobs_total | The total number of deleted jobs | +| completed_jobs_total | The total number of completed jobs | +| restarted_jobs_total | The total number of restarted jobs | +| pending_jobs_total | The total number of pending jobs | +| failed_jobs_total | The total number of failed jobs | ## Service Metrics The following metrics related to the lifecycle of services will be reported: -* The total number of succeeded service creations -* The total number of failed service creations -* The total number of restarted service creations -* The total number of service patches -* The total number of deleted services +| Metric Name | Description | +| ------------ | ------- | +| succeeded_service_creations_total | The total number of succeeded service creations | +| failed_service_creations_total | The total number of failed service creations | +| restarted_service_creations_total | The total number of restarted service creations | +| service_patches_total | The total number of service patches | +| deleted_services_total | The total number of deleted services | ## Scheduling Metrics The following metrics related to scheduling will be reported: -* The total number of created pod disruption policies -* The total number of deleted pod disruption policies -* The total number of created pod groups -* The total number of deleted pod groups +| Metric Name | Description | +| ------------ | ------- | +| created_pod_disruption_policies_total | The total number of created pod disruption policies | +| deleted_pod_disruption_policies_total | The total number of deleted pod disruption policies | +| created_pod_groups_total | The total number of created pod groups | +| deleted_pod_groups_total | The total number of deleted pod groups | From 4893e7aaff43d9cca66897f7f893fb21bfce2819 Mon Sep 17 00:00:00 2001 From: terrytangyuan Date: Mon, 27 Apr 2020 14:42:23 -0400 Subject: [PATCH 3/5] Add metric types Signed-off-by: terrytangyuan --- docs/prometheus-metrics.md | 84 +++++++++++++++++++++----------------- 1 file changed, 47 insertions(+), 37 deletions(-) diff --git a/docs/prometheus-metrics.md b/docs/prometheus-metrics.md index cba135ab..4392d602 100644 --- a/docs/prometheus-metrics.md +++ b/docs/prometheus-metrics.md @@ -7,30 +7,30 @@ The goal is to report generic metrics (e.g. metrics related to pods/jobs/service * It is easier for users of Kubeflow distributed training operators to monitor operator performance and behaviors using consistent set of metrics for different distributed training operators. This document outlines the list of Prometheus metrics we plan to cover in `JobController`. We follow the metric naming convention -outlined [here](https://prometheus.io/docs/practices/naming/). +outlined [here](https://prometheus.io/docs/practices/naming/) and the metric types supported by Prometheus [here](https://prometheus.io/docs/concepts/metric_types/). ## Pod Metrics The following metrics related to the lifecycle of pods will be reported: -| Metric Name | Description | -| ------------ | ------- | -| created_pods_total | The total number of created pods | -| restarted_pods_total | The total number of restarted pods | -| deleted_pods_total | The total number of deleted pods | -| failed_pods_total | The total number of failed pods | +| Metric Name | Metric Type | Description | +| ----------- | ------------| ----------- | +| created_pods_total | Counter | The total number of created pods | +| restarted_pods_total | Counter | The total number of restarted pods | +| deleted_pods_total | Counter | The total number of deleted pods | +| failed_pods_total | Counter | The total number of failed pods | The following metrics will be reported on each pod: -| Metric Name | Description | -| ------------ | ------- | -| container_cpu_usage_seconds_total | CPU usage | -| container_accelerator_memory_used_bytes | GPU usage | -| container_memory_usage_bytes | Memory usage | -| container_network_transmit_bytes_total | Network usage | -| container_fs_write_seconds_total | I/O usage | -| up | Keep-Alive check (maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series))) | -| common_operator_is_leader | Whether this client is the leader of this common operator client set | +| Metric Name | Metric Type | Description | +| ----------- | ------------| ----------- | +| container_cpu_usage_seconds_total | Counter | Cumulative cpu time consumed in seconds | +| container_accelerator_memory_used_bytes | Gauge | Total accelerator memory allocated | +| container_memory_usage_bytes | Gauge | Current memory usage in bytes, including all memory regardless of when it was accessed | +| container_network_transmit_bytes_total | Counter | Cumulative count of bytes transmitted | +| container_fs_usage_bytes | Gauge | Number of bytes that are consumed by the container on this filesystem | +| up | Gauge | Keep-Alive check (maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series))) | +| common_operator_is_leader | Gauge | Whether this client is the leader of this common operator client set | Note that some of the above metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet integration which reports to Prometheus through our prometheus-operator installation. @@ -39,34 +39,44 @@ integration which reports to Prometheus through our prometheus-operator installa The following metrics related to the lifecycle of jobs will be reported: -| Metric Name | Description | -| ------------ | ------- | -| created_jobs_total | The total number of created jobs | -| deleted_jobs_total | The total number of deleted jobs | -| completed_jobs_total | The total number of completed jobs | -| restarted_jobs_total | The total number of restarted jobs | -| pending_jobs_total | The total number of pending jobs | -| failed_jobs_total | The total number of failed jobs | +| Metric Name | Metric Type | Description | +| ----------- | ------------| ----------- | +| created_jobs_total | Counter | The total number of created jobs | +| deleted_jobs_total | Counter | The total number of deleted jobs | +| completed_jobs_total | Counter | The total number of completed jobs | +| restarted_jobs_total | Counter | The total number of restarted jobs | +| pending_jobs_total | Counter | The total number of pending jobs | +| failed_jobs_total | Counter | The total number of failed jobs | +| running_jobs_total | Counter | The total number of running jobs | + +The following metrics related to the duration among various job phases will be reported: + +| Metric Name | Metric Type | Description | +| ----------- | ------------| ----------- | +| from_created_to_completed_job_duration_seconds_total | Counter | The duration between job created and job completed in seconds | +| from_completed_to_deleted_job_duration_seconds_total | Counter | The duration between job completed and job deleted in seconds | +| from_failed_to_restarted_job_duration_seconds_total | Counter | The duration between job failed and job restarted in seconds | +| from_pending_to_running_job_duration_seconds_total | Counter | The duration between job pending and job running in seconds | ## Service Metrics The following metrics related to the lifecycle of services will be reported: -| Metric Name | Description | -| ------------ | ------- | -| succeeded_service_creations_total | The total number of succeeded service creations | -| failed_service_creations_total | The total number of failed service creations | -| restarted_service_creations_total | The total number of restarted service creations | -| service_patches_total | The total number of service patches | -| deleted_services_total | The total number of deleted services | +| Metric Name | Metric Type | Description | +| ----------- | ------------| ----------- | +| succeeded_service_creations_total | Counter | The total number of succeeded service creations | +| failed_service_creations_total | Counter | The total number of failed service creations | +| restarted_service_creations_total | Counter | The total number of restarted service creations | +| service_patches_total | Counter | The total number of service patches | +| deleted_services_total | Counter | The total number of deleted services | ## Scheduling Metrics The following metrics related to scheduling will be reported: -| Metric Name | Description | -| ------------ | ------- | -| created_pod_disruption_policies_total | The total number of created pod disruption policies | -| deleted_pod_disruption_policies_total | The total number of deleted pod disruption policies | -| created_pod_groups_total | The total number of created pod groups | -| deleted_pod_groups_total | The total number of deleted pod groups | +| Metric Name | Metric Type | Description | +| ----------- | ------------| ----------- | +| created_pod_disruption_policies_total | Counter | The total number of created pod disruption policies | +| deleted_pod_disruption_policies_total | Counter | The total number of deleted pod disruption policies | +| created_pod_groups_total | Counter | The total number of created pod groups | +| deleted_pod_groups_total | Counter | The total number of deleted pod groups | From 49a0e8aa5719b1507231859b16ddf607027fe8b1 Mon Sep 17 00:00:00 2001 From: Yuan Tang Date: Wed, 29 Apr 2020 08:25:37 -0400 Subject: [PATCH 4/5] Remove common_operator_is_leader --- docs/prometheus-metrics.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/prometheus-metrics.md b/docs/prometheus-metrics.md index 4392d602..67f9be53 100644 --- a/docs/prometheus-metrics.md +++ b/docs/prometheus-metrics.md @@ -30,7 +30,6 @@ The following metrics will be reported on each pod: | container_network_transmit_bytes_total | Counter | Cumulative count of bytes transmitted | | container_fs_usage_bytes | Gauge | Number of bytes that are consumed by the container on this filesystem | | up | Gauge | Keep-Alive check (maintained by Prometheus on its own with its `up` metric detailed in the documentation [here](https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series))) | -| common_operator_is_leader | Gauge | Whether this client is the leader of this common operator client set | Note that some of the above metrics are derived from [cAdvisor](https://github.com/google/cadvisor) kubelet integration which reports to Prometheus through our prometheus-operator installation. From 1aa73e7566d5f5fb997d8acd5718f1a88264ac5b Mon Sep 17 00:00:00 2001 From: terrytangyuan Date: Thu, 30 Apr 2020 15:58:52 -0400 Subject: [PATCH 5/5] Address comments Signed-off-by: terrytangyuan --- docs/prometheus-metrics.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/prometheus-metrics.md b/docs/prometheus-metrics.md index 67f9be53..2f3614b3 100644 --- a/docs/prometheus-metrics.md +++ b/docs/prometheus-metrics.md @@ -44,18 +44,18 @@ The following metrics related to the lifecycle of jobs will be reported: | deleted_jobs_total | Counter | The total number of deleted jobs | | completed_jobs_total | Counter | The total number of completed jobs | | restarted_jobs_total | Counter | The total number of restarted jobs | -| pending_jobs_total | Counter | The total number of pending jobs | +| pending_jobs_total | Gauge | The total number of pending jobs | | failed_jobs_total | Counter | The total number of failed jobs | -| running_jobs_total | Counter | The total number of running jobs | +| running_jobs_total | Gauge | The total number of running jobs | The following metrics related to the duration among various job phases will be reported: | Metric Name | Metric Type | Description | | ----------- | ------------| ----------- | -| from_created_to_completed_job_duration_seconds_total | Counter | The duration between job created and job completed in seconds | -| from_completed_to_deleted_job_duration_seconds_total | Counter | The duration between job completed and job deleted in seconds | -| from_failed_to_restarted_job_duration_seconds_total | Counter | The duration between job failed and job restarted in seconds | -| from_pending_to_running_job_duration_seconds_total | Counter | The duration between job pending and job running in seconds | +| from_created_to_completed_job_duration_seconds_total | Histogram | The duration between job created and job completed in seconds | +| from_completed_to_deleted_job_duration_seconds_total | Histogram | The duration between job completed and job deleted in seconds | +| from_failed_to_restarted_job_duration_seconds_total | Histogram | The duration between job failed and job restarted in seconds | +| from_pending_to_running_job_duration_seconds_total | Histogram | The duration between job pending and job running in seconds | ## Service Metrics