From 49f42651af58c4051d102aec5961d23d97774b60 Mon Sep 17 00:00:00 2001 From: Varsha Prasad Narsing Date: Tue, 2 Jul 2024 06:10:08 -0700 Subject: [PATCH 1/2] [Feature] Enable prometheus metrics for local queues This PR introduces an enhancement to enable collection of prometheus metrics for local queues. Addresses issue: https://github.com/kubernetes-sigs/kueue/issues/1833 Signed-off-by: Varsha Prasad Narsing --- keps/1833-metrics-for-local-queue/README.md | 165 ++++++++++++++++++++ keps/1833-metrics-for-local-queue/kep.yaml | 21 +++ 2 files changed, 186 insertions(+) create mode 100644 keps/1833-metrics-for-local-queue/README.md create mode 100644 keps/1833-metrics-for-local-queue/kep.yaml diff --git a/keps/1833-metrics-for-local-queue/README.md b/keps/1833-metrics-for-local-queue/README.md new file mode 100644 index 0000000000..d10fa5e8d7 --- /dev/null +++ b/keps/1833-metrics-for-local-queue/README.md @@ -0,0 +1,165 @@ +# KEP-1833: Enable Prometheus Metrics for Local Queues + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit Tests](#unit-tests) + - [Integration tests](#integration-tests) + - [Graduation Criteria](#graduation-criteria) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Summary + +The enhancement aims to introduce the exposure of local queue metrics to users, providing detailed insights into workload +processing specific to individual namespaces / tenants. + +## Motivation + +Metrics related to local queues are invaluable for batch users (usually with namespace-scoped permissions), as they provide +essential visibility and historical trends about their workloads. Currently, metrics are available for ClusterQueues but +not for namespace-scoped batch users. Cluster queue metrics are often ineffective for batch users and namespace admins +because they are global and cannot be filtered by namespaces. Furthermore, accessing cluster-scoped metrics +in secured Prometheus instances is generally restricted to cluster admin users with cluster level permissions across all namespaces and +tenants. This restriction makes it challenging for batch users to obtain the specific metrics they need for effective workload +management and to gain insights into their workloads within their limited scope of access. + +### Goals + +1. Introduce the API changes required to enable Local Queue metrics. +1. List the Prometheus metrics that would be exposed for Local Queues. + +### Non-Goals + +1. Discuss the implementation details on where these metrics need to be collected in codebase. + +## Proposal + +The proposal extends to enable collection of metrics for local queues that would be useful +for batch users and cluster administrators. + +### User Stories (Optional) + +#### Story 1 + +As a batch user of Kueue, I want to access metrics for local queues running workloads restricted to my namespace so that +I can monitor and analyze the performance and trends of my workloads. + +#### Story 2 + +As an administrator of Kueue, I want to enable batch users specific to a namespace to collect metrics for their workloads +within their namespace so that they can have visibility and insights into their own workload metrics. + +#### Story 3 + +As an administrator of Kueue, I want to filter and gain insights on fine-grained metrics relevant to a local queue by +namespace for specific tenants so that I can effectively manage and optimize resource usage and performance for different tenants. + +## Design Details + +### API changes: + +The [Configuration API](https://github.com/kubernetes-sigs/kueue/blob/7ec127b05c8a0c8268e623de61914472dc5bff29/apis/config/v1beta1/configuration_types.go#L30) +currently provides the ability to enable collection of metrics for cluster queues. This API will be extended to include options for enabling metrics +collection for local queues. + +The `ControllerMetrics` that contain the option to configure metrics, will be extended as follows: + +```go +type ControllerManager struct { + ... + + // Metrics contains the controller metrics configuration + // +optional + Metrics ControllerMetrics `json:"metrics,omitempty"` + ... +} + +// ControllerMetrics defines the metrics configs. +type ControllerMetrics struct { + ... + + // EnableLocalQueueResources, if true the local queue resource usage and quotas + // metrics will be reported. + // +optional + EnableLocalQueueResources bool `json:"enableLocalQueueResources,omitempty"` + + // LocalQueueMetricOptions specifies the configuration options for local queue metrics. + LocalQueueMetricOptions *metricsOptions `json:"localQueueMetricOptions,omitempty"` +} + +// metricsOptions defines the configuration options for local queue metrics. +// If left empty, then metrics will expose for all local queues across namespaces. +type metricsOptions struct { + // NamespaceSelector can be used to select namespaces in which the local queues should + // report metrics. + NamespaceSelector *metav1.LabelSelector `json:"namespaceSelector,omitempty"` + + // QueueSelector can be used to choose the local queues that need metrics to be collected. + QueueSelector *metav1.LabelSelector `json:"queueSelector,omitempty"` +} +``` + +To reduce cardinality, and enable selection of metrics for local queues, the following +knobs will be available: + +1. If `EnableLocalQueueResources` is false, then metrics will not be exposed. +1. If `EnableLocalQueueResources` is true, and `LocalQueueMetricOptions` is **not** provided - metrics will be exposed for all local queues. +1. If `EnableLocalQueueResources` is true and `LocalQueueMetricOptions` is provided - metrics will be collected for local queues that match specified label selectors. + +### List of metrics for Local Queues: + +In the first iteration, following are the list of metrics that would contain information on Local Queue statuses: + +| Metrics Name | Prometheus Type | Description | +|--------------------------------|-----------------|-------------------------------------------------------------| +| local_queue_pending_workloads | Gauge | The number of pending workloads | +| local_queue_reserved_workloads | Counter | Total number of workloads in the LocalQueue reserving quota | +| local_queue_admitted_workloads | Counter | Total number of admitted workloads | +| local_queue_resource_usage | Gauge | Total quantity of used quota per resource for a Local Queue | + +Each of these metrics will be augmented with relevant Prometheus labels, indicating the local queue name, namespace, +and any other unique identifiers as required during implementation. + +### Test Plan + +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +#### Unit Tests + +There are existing unit tests for prometheus metrics: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/metrics/metrics_test.go. +However, unit tests to ensure coverage for any additional local queue metrics will be added. + +- ``: `` - `` + +#### Integration tests + +The integration will address the following scenarios: + +1. Metrics for local queues are accurately reported throughout the lifecycle of workloads in local queues. +1. Metrics are removed when a local queue is deleted from the cache. + +### Graduation Criteria + +## Implementation History + +## Drawbacks + +If not implemented correctly, in certain scenarios enabling local queue metrics for all namespaces across all local queues can lead to issues +with cardinality and system overload. To mitigate this, configuration options are provided to selectively enable metrics +reporting for specific local queues. diff --git a/keps/1833-metrics-for-local-queue/kep.yaml b/keps/1833-metrics-for-local-queue/kep.yaml new file mode 100644 index 0000000000..132374fb12 --- /dev/null +++ b/keps/1833-metrics-for-local-queue/kep.yaml @@ -0,0 +1,21 @@ +title: +kep-number: 1833 +authors: + - "@varshaprasad96" +status: provisional +creation-date: 2024-07-02 +reviewers: + - "@astefanutti" + - "@alculquicondor" + - "@tenzen-y" + +approvers: + - TBD + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The milestone at which this feature was, or is targeted to be, at each stage. +# TODO: Leaving the milestone TBD, would be helpful to get inputs on intended release timeline. +milestone: + alpha: "TBD" From be1601a1bdc966bb857e6e26da9a0ff5bfda379c Mon Sep 17 00:00:00 2001 From: Varsha Prasad Narsing Date: Wed, 10 Jul 2024 00:11:26 -0700 Subject: [PATCH 2/2] Address reviews This commit addresses reviews by adding additional metrics for local queue. Signed-off-by: Varsha Prasad Narsing --- keps/1833-metrics-for-local-queue/README.md | 78 +++++++++++---------- keps/1833-metrics-for-local-queue/kep.yaml | 7 +- 2 files changed, 46 insertions(+), 39 deletions(-) diff --git a/keps/1833-metrics-for-local-queue/README.md b/keps/1833-metrics-for-local-queue/README.md index d10fa5e8d7..9076508121 100644 --- a/keps/1833-metrics-for-local-queue/README.md +++ b/keps/1833-metrics-for-local-queue/README.md @@ -9,17 +9,16 @@ - [User Stories (Optional)](#user-stories-optional) - [Story 1](#story-1) - [Story 2](#story-2) - - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - - [Risks and Mitigations](#risks-and-mitigations) + - [Story 3](#story-3) - [Design Details](#design-details) + - [API changes:](#api-changes) + - [List of metrics for Local Queues:](#list-of-metrics-for-local-queues) - [Test Plan](#test-plan) - - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit Tests](#unit-tests) - [Integration tests](#integration-tests) - [Graduation Criteria](#graduation-criteria) - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) -- [Alternatives](#alternatives) ## Summary @@ -29,22 +28,19 @@ processing specific to individual namespaces / tenants. ## Motivation -Metrics related to local queues are invaluable for batch users (usually with namespace-scoped permissions), as they provide -essential visibility and historical trends about their workloads. Currently, metrics are available for ClusterQueues but -not for namespace-scoped batch users. Cluster queue metrics are often ineffective for batch users and namespace admins -because they are global and cannot be filtered by namespaces. Furthermore, accessing cluster-scoped metrics -in secured Prometheus instances is generally restricted to cluster admin users with cluster level permissions across all namespaces and -tenants. This restriction makes it challenging for batch users to obtain the specific metrics they need for effective workload -management and to gain insights into their workloads within their limited scope of access. +Metrics related to local queues are invaluable for batch users, providing essential visibility and historical trends +about their workloads. Currently, while metrics are available for only ClusterQueues, they do not provide batch users with +the necessary insights into their specific workloads. ### Goals 1. Introduce the API changes required to enable Local Queue metrics. -1. List the Prometheus metrics that would be exposed for Local Queues. +2. List the Prometheus metrics that would be exposed for Local Queues. ### Non-Goals 1. Discuss the implementation details on where these metrics need to be collected in codebase. +2. Discuss on metric visibility and RBAC required to enable the metrics securely for namespace admins. ## Proposal @@ -92,47 +88,57 @@ type ControllerManager struct { type ControllerMetrics struct { ... - // EnableLocalQueueResources, if true the local queue resource usage and quotas - // metrics will be reported. + // LocalQueueMetrics is a configuration that provides enabling LocalQueue metrics and its options. // +optional - EnableLocalQueueResources bool `json:"enableLocalQueueResources,omitempty"` - - // LocalQueueMetricOptions specifies the configuration options for local queue metrics. - LocalQueueMetricOptions *metricsOptions `json:"localQueueMetricOptions,omitempty"` + LocalQueueMetrics *LocalQueueMetrics `json:"localQueueMetrics,omitempty"` } -// metricsOptions defines the configuration options for local queue metrics. +// LocalQueueMetrics defines the configuration options for local queue metrics. // If left empty, then metrics will expose for all local queues across namespaces. -type metricsOptions struct { +type LocalQueueMetrics struct { + // Enable is a knob to allow metrics to be exposed for local queues. Defaults to false. + Enable bool `json:"enable,omitempty` + // NamespaceSelector can be used to select namespaces in which the local queues should // report metrics. NamespaceSelector *metav1.LabelSelector `json:"namespaceSelector,omitempty"` - // QueueSelector can be used to choose the local queues that need metrics to be collected. - QueueSelector *metav1.LabelSelector `json:"queueSelector,omitempty"` + // LocalQueueSelector can be used to choose the local queues that need metrics to be collected. + LocalQueueSelector *metav1.LabelSelector `json:"localQueueSelector,omitempty"` } ``` To reduce cardinality, and enable selection of metrics for local queues, the following -knobs will be available: +knobs will be available for `LocalQueueMetrics`: -1. If `EnableLocalQueueResources` is false, then metrics will not be exposed. -1. If `EnableLocalQueueResources` is true, and `LocalQueueMetricOptions` is **not** provided - metrics will be exposed for all local queues. -1. If `EnableLocalQueueResources` is true and `LocalQueueMetricOptions` is provided - metrics will be collected for local queues that match specified label selectors. +| `Enable` | `NamespaceSelector` | `LocalQueueSelector` | Description | +|----------|---------------------|----------------------|--------------------------------------------------------------------------------------------------------------------| +| False | - | - | Metrics will not be exposed. | +| True | - | - | Metrics for all local queues will be exposed. | +| True | Specified | - | All LocalQueues in the specific namespaces that match the selector have metrics enabled. | +| True | - | Specified | All LocalQueues matching the label selector have metrics enabled. | +| True | Specified | Specified | Both the selectors are applied to local queues (logical AND) to filter the ones whose metrics have to be enabled. | +| False | Specified | Specified | The selectors are disregarded, metrics will not be exposed. | ### List of metrics for Local Queues: In the first iteration, following are the list of metrics that would contain information on Local Queue statuses: -| Metrics Name | Prometheus Type | Description | -|--------------------------------|-----------------|-------------------------------------------------------------| -| local_queue_pending_workloads | Gauge | The number of pending workloads | -| local_queue_reserved_workloads | Counter | Total number of workloads in the LocalQueue reserving quota | -| local_queue_admitted_workloads | Counter | Total number of admitted workloads | -| local_queue_resource_usage | Gauge | Total quantity of used quota per resource for a Local Queue | +| Metrics Name | Prometheus Type | Description | +|------------------------------------------------|-----------------|-----------------------------------------------------------------------------------------------------| +| local_queue_pending_workloads | Gauge | The number of pending workloads. | +| local_queue_reserved_workloads_total | Counter | Total number of workloads in the LocalQueue reserving quota. | +| local_queue_admitted_workloads_total | Counter | Total number of admitted workloads. | +| local_queue_resource_usage | Gauge | Total quantity of used quota per resource for a Local Queue. | +| local_queue_evicted_workloads_total | Counter | The total number of evicted workloads in Local Queue. | +| local_queue_reserved_wait_time_seconds | Histogram | The time between a workload was created or re-queued until it got quota reservation in local queue. | +| local_queue_admission_checks_wait_time_seconds | Histogram | The time from when a workload got the quota reservation until admission in local queue. | +| local_queue_admission_wait_time_seconds | Histogram | The time between a workload was created or re-queued until admission. | +| local_queue_status | Gauge | Reports the status of the ClusterQueue. | Each of these metrics will be augmented with relevant Prometheus labels, indicating the local queue name, namespace, -and any other unique identifiers as required during implementation. +and any other unique identifiers as required during implementation. They will be exported in the controller namespace, +alongside cluster queue metrics, at the same endpoint. ### Test Plan @@ -143,16 +149,16 @@ to implement this enhancement. #### Unit Tests There are existing unit tests for prometheus metrics: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/metrics/metrics_test.go. -However, unit tests to ensure coverage for any additional local queue metrics will be added. +However, unit tests to ensure coverage for any additional local queue metrics will be added. -- ``: `` - `` +- `pkg/metrics/`: `2024-07-19` - `48.2%` #### Integration tests The integration will address the following scenarios: 1. Metrics for local queues are accurately reported throughout the lifecycle of workloads in local queues. -1. Metrics are removed when a local queue is deleted from the cache. +2. Metrics are removed when a local queue is deleted from the cache. ### Graduation Criteria diff --git a/keps/1833-metrics-for-local-queue/kep.yaml b/keps/1833-metrics-for-local-queue/kep.yaml index 132374fb12..18596b00e2 100644 --- a/keps/1833-metrics-for-local-queue/kep.yaml +++ b/keps/1833-metrics-for-local-queue/kep.yaml @@ -5,17 +5,18 @@ authors: status: provisional creation-date: 2024-07-02 reviewers: + - "@PBundyra" - "@astefanutti" - "@alculquicondor" - "@tenzen-y" approvers: - - TBD + - "@alculquicondor" + - "@tenzen-y" # The target maturity stage in the current dev cycle for this KEP. stage: alpha # The milestone at which this feature was, or is targeted to be, at each stage. -# TODO: Leaving the milestone TBD, would be helpful to get inputs on intended release timeline. milestone: - alpha: "TBD" + alpha: v0.9