generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Enable prometheus metrics for local queues #2516
Merged
k8s-ci-robot
merged 2 commits into
kubernetes-sigs:main
from
varshaprasad96:kep-local-queue-metrics
Jul 29, 2024
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
# KEP-1833: Enable Prometheus Metrics for Local Queues | ||
|
||
<!-- toc --> | ||
- [Summary](#summary) | ||
- [Motivation](#motivation) | ||
- [Goals](#goals) | ||
- [Non-Goals](#non-goals) | ||
- [Proposal](#proposal) | ||
- [User Stories (Optional)](#user-stories-optional) | ||
- [Story 1](#story-1) | ||
- [Story 2](#story-2) | ||
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) | ||
- [Risks and Mitigations](#risks-and-mitigations) | ||
- [Design Details](#design-details) | ||
- [Test Plan](#test-plan) | ||
- [Prerequisite testing updates](#prerequisite-testing-updates) | ||
- [Unit Tests](#unit-tests) | ||
- [Integration tests](#integration-tests) | ||
- [Graduation Criteria](#graduation-criteria) | ||
- [Implementation History](#implementation-history) | ||
- [Drawbacks](#drawbacks) | ||
- [Alternatives](#alternatives) | ||
<!-- /toc --> | ||
|
||
## Summary | ||
|
||
The enhancement aims to introduce the exposure of local queue metrics to users, providing detailed insights into workload | ||
processing specific to individual namespaces / tenants. | ||
|
||
## Motivation | ||
|
||
Metrics related to local queues are invaluable for batch users (usually with namespace-scoped permissions), as they provide | ||
essential visibility and historical trends about their workloads. Currently, metrics are available for ClusterQueues but | ||
not for namespace-scoped batch users. Cluster queue metrics are often ineffective for batch users and namespace admins | ||
because they are global and cannot be filtered by namespaces. Furthermore, accessing cluster-scoped metrics | ||
in secured Prometheus instances is generally restricted to cluster admin users with cluster level permissions across all namespaces and | ||
tenants. This restriction makes it challenging for batch users to obtain the specific metrics they need for effective workload | ||
management and to gain insights into their workloads within their limited scope of access. | ||
|
||
### Goals | ||
|
||
1. Introduce the API changes required to enable Local Queue metrics. | ||
1. List the Prometheus metrics that would be exposed for Local Queues. | ||
varshaprasad96 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Non-Goals | ||
|
||
1. Discuss the implementation details on where these metrics need to be collected in codebase. | ||
|
||
varshaprasad96 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## Proposal | ||
|
||
The proposal extends to enable collection of metrics for local queues that would be useful | ||
for batch users and cluster administrators. | ||
|
||
### User Stories (Optional) | ||
|
||
#### Story 1 | ||
|
||
As a batch user of Kueue, I want to access metrics for local queues running workloads restricted to my namespace so that | ||
I can monitor and analyze the performance and trends of my workloads. | ||
|
||
#### Story 2 | ||
|
||
As an administrator of Kueue, I want to enable batch users specific to a namespace to collect metrics for their workloads | ||
within their namespace so that they can have visibility and insights into their own workload metrics. | ||
|
||
#### Story 3 | ||
|
||
As an administrator of Kueue, I want to filter and gain insights on fine-grained metrics relevant to a local queue by | ||
namespace for specific tenants so that I can effectively manage and optimize resource usage and performance for different tenants. | ||
|
||
## Design Details | ||
|
||
### API changes: | ||
|
||
The [Configuration API](https://github.com/kubernetes-sigs/kueue/blob/7ec127b05c8a0c8268e623de61914472dc5bff29/apis/config/v1beta1/configuration_types.go#L30) | ||
currently provides the ability to enable collection of metrics for cluster queues. This API will be extended to include options for enabling metrics | ||
collection for local queues. | ||
|
||
The `ControllerMetrics` that contain the option to configure metrics, will be extended as follows: | ||
|
||
```go | ||
type ControllerManager struct { | ||
... | ||
|
||
// Metrics contains the controller metrics configuration | ||
// +optional | ||
Metrics ControllerMetrics `json:"metrics,omitempty"` | ||
... | ||
} | ||
|
||
// ControllerMetrics defines the metrics configs. | ||
type ControllerMetrics struct { | ||
... | ||
|
||
// EnableLocalQueueResources, if true the local queue resource usage and quotas | ||
// metrics will be reported. | ||
// +optional | ||
EnableLocalQueueResources bool `json:"enableLocalQueueResources,omitempty"` | ||
|
||
// LocalQueueMetricOptions specifies the configuration options for local queue metrics. | ||
LocalQueueMetricOptions *metricsOptions `json:"localQueueMetricOptions,omitempty"` | ||
varshaprasad96 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
// metricsOptions defines the configuration options for local queue metrics. | ||
// If left empty, then metrics will expose for all local queues across namespaces. | ||
type metricsOptions struct { | ||
varshaprasad96 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// NamespaceSelector can be used to select namespaces in which the local queues should | ||
// report metrics. | ||
NamespaceSelector *metav1.LabelSelector `json:"namespaceSelector,omitempty"` | ||
|
||
// QueueSelector can be used to choose the local queues that need metrics to be collected. | ||
QueueSelector *metav1.LabelSelector `json:"queueSelector,omitempty"` | ||
varshaprasad96 marked this conversation as resolved.
Show resolved
Hide resolved
varshaprasad96 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
``` | ||
|
||
To reduce cardinality, and enable selection of metrics for local queues, the following | ||
knobs will be available: | ||
|
||
1. If `EnableLocalQueueResources` is false, then metrics will not be exposed. | ||
1. If `EnableLocalQueueResources` is true, and `LocalQueueMetricOptions` is **not** provided - metrics will be exposed for all local queues. | ||
1. If `EnableLocalQueueResources` is true and `LocalQueueMetricOptions` is provided - metrics will be collected for local queues that match specified label selectors. | ||
|
||
### List of metrics for Local Queues: | ||
|
||
In the first iteration, following are the list of metrics that would contain information on Local Queue statuses: | ||
|
||
| Metrics Name | Prometheus Type | Description | | ||
|--------------------------------|-----------------|-------------------------------------------------------------| | ||
| local_queue_pending_workloads | Gauge | The number of pending workloads | | ||
| local_queue_reserved_workloads | Counter | Total number of workloads in the LocalQueue reserving quota | | ||
| local_queue_admitted_workloads | Counter | Total number of admitted workloads | | ||
| local_queue_resource_usage | Gauge | Total quantity of used quota per resource for a Local Queue | | ||
|
||
varshaprasad96 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Each of these metrics will be augmented with relevant Prometheus labels, indicating the local queue name, namespace, | ||
and any other unique identifiers as required during implementation. | ||
|
||
### Test Plan | ||
|
||
[X] I/we understand the owners of the involved components may require updates to | ||
existing tests to make this code solid enough prior to committing the changes necessary | ||
to implement this enhancement. | ||
|
||
#### Unit Tests | ||
|
||
There are existing unit tests for prometheus metrics: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/metrics/metrics_test.go. | ||
However, unit tests to ensure coverage for any additional local queue metrics will be added. | ||
|
||
- `<package>`: `<date>` - `<test coverage>` | ||
varshaprasad96 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### Integration tests | ||
|
||
The integration will address the following scenarios: | ||
|
||
1. Metrics for local queues are accurately reported throughout the lifecycle of workloads in local queues. | ||
1. Metrics are removed when a local queue is deleted from the cache. | ||
|
||
### Graduation Criteria | ||
|
||
## Implementation History | ||
|
||
## Drawbacks | ||
|
||
If not implemented correctly, in certain scenarios enabling local queue metrics for all namespaces across all local queues can lead to issues | ||
with cardinality and system overload. To mitigate this, configuration options are provided to selectively enable metrics | ||
reporting for specific local queues. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
title: | ||
kep-number: 1833 | ||
authors: | ||
- "@varshaprasad96" | ||
status: provisional | ||
creation-date: 2024-07-02 | ||
reviewers: | ||
- "@astefanutti" | ||
- "@alculquicondor" | ||
- "@tenzen-y" | ||
|
||
varshaprasad96 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
approvers: | ||
- TBD | ||
varshaprasad96 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# The target maturity stage in the current dev cycle for this KEP. | ||
stage: alpha | ||
|
||
# The milestone at which this feature was, or is targeted to be, at each stage. | ||
# TODO: Leaving the milestone TBD, would be helpful to get inputs on intended release timeline. | ||
milestone: | ||
alpha: "TBD" | ||
varshaprasad96 marked this conversation as resolved.
Show resolved
Hide resolved
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this KEP also specify how we will prevent users with insufficient permission accessing metrics they shouldn't be able to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this proposal is not suggesting how to prevent access from namespaces that shouldn't have the permission, then I would remove this phrase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, maybe we could narrow the scope of this KEP @varshaprasad96 ? At first glance managing permissions seems to be challenging, or it would require external mechanism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply! I've been considering various options, and it's clear that publishing local queue metrics in each namespace could lead to complications. This approach would require multiple endpoints with respective Service Monitors or, if using a single central Service Monitor, we would need the correct RBAC setup to allow client access. The centralised approach could still poses some issues, such as namespace admins potentially being able to view metrics from other namespaces if not provided with right SA.
One potential solution for cluster admins is to scaffold out a Service Monitor (SM) with metrics labeling for specific namespaces from the same service endpoint. This would enable a common service endpoint for all local queues. Admins could then provide a service account with right RBAC to specific batch user, restricting their access to that particular service monitor.
However, this solution is difficult to implement in Kueue right away by figuring out the right set of scaffolds, and seems to be the responsibility of the cluster admin. This could probably be a customisation which can be documented for now.
That being said, given the complexity, this topic may probably warrant a further brainstorming and deserves a separate KEP. For now, I'll remove this reference and update it to indicate that for this iteration metrics will be exported in the controller namespace, alongside cluster queue metrics, at the same endpoint. Does that sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does.
You can remove the per-namespace topic from the motivation and add a note in non-goals.