[Feature] Enable prometheus metrics for local queues #2516

varshaprasad96 · 2024-07-02T13:17:39Z

What type of PR is this?

/kind feature
/kind documentation

What this PR does / why we need it:

This PR introduces an enhancement to enable collection of prometheus metrics for local queues.

Addresses issue: #1833

Which issue(s) this PR fixes:

Fixes # Partially #1833

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NA

k8s-ci-robot · 2024-07-02T13:17:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: varshaprasad96
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This PR introduces an enhancement to enable collection of prometheus metrics for local queues. Addresses issue: kubernetes-sigs#1833 Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>

netlify · 2024-07-02T13:17:55Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`15ef983`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66963e842fe97000086a426f

varshaprasad96 · 2024-07-02T13:19:08Z

@astefanutti @alculquicondor @tenzen-y Could you please take a look at the proposal and provide your inputs. Thank you!

alculquicondor · 2024-07-02T16:11:45Z

/assign @PBundyra

PBundyra · 2024-07-08T10:03:37Z

/lgtm

k8s-ci-robot · 2024-07-08T10:03:43Z

LGTM label has been added.

Git tree hash: b3460149244f8070ae015034e8bba83771759ecc

keps/1833-metrics-for-local-queue/README.md

PBundyra · 2024-07-08T10:11:13Z

/hold

alculquicondor · 2024-07-08T19:12:38Z

is this ready for a pass from approvers?

PBundyra · 2024-07-09T07:46:31Z

is this ready for a pass from approvers?

Yes

k8s-ci-robot · 2024-07-10T07:12:29Z

New changes are detected. LGTM label has been removed.

PBundyra · 2024-07-10T10:10:22Z

keps/1833-metrics-for-local-queue/README.md

+because they are global and cannot be filtered by namespaces. Furthermore, accessing cluster-scoped metrics 
+in secured Prometheus instances is generally restricted to cluster admin users with cluster level permissions across all namespaces and 
+tenants. This restriction makes it challenging for batch users to obtain the specific metrics they need for effective workload 
+management and to gain insights into their workloads within their limited scope of access.


Could this KEP also specify how we will prevent users with insufficient permission accessing metrics they shouldn't be able to?

If this proposal is not suggesting how to prevent access from namespaces that shouldn't have the permission, then I would remove this phrase.

I agree, maybe we could narrow the scope of this KEP @varshaprasad96 ? At first glance managing permissions seems to be challenging, or it would require external mechanism

Sorry for the late reply! I've been considering various options, and it's clear that publishing local queue metrics in each namespace could lead to complications. This approach would require multiple endpoints with respective Service Monitors or, if using a single central Service Monitor, we would need the correct RBAC setup to allow client access. The centralised approach could still poses some issues, such as namespace admins potentially being able to view metrics from other namespaces if not provided with right SA.

One potential solution for cluster admins is to scaffold out a Service Monitor (SM) with metrics labeling for specific namespaces from the same service endpoint. This would enable a common service endpoint for all local queues. Admins could then provide a service account with right RBAC to specific batch user, restricting their access to that particular service monitor.

However, this solution is difficult to implement in Kueue right away by figuring out the right set of scaffolds, and seems to be the responsibility of the cluster admin. This could probably be a customisation which can be documented for now.

That being said, given the complexity, this topic may probably warrant a further brainstorming and deserves a separate KEP. For now, I'll remove this reference and update it to indicate that for this iteration metrics will be exported in the controller namespace, alongside cluster queue metrics, at the same endpoint. Does that sound reasonable?

Yes it does.

You can remove the per-namespace topic from the motivation and add a note in non-goals.

keps/1833-metrics-for-local-queue/kep.yaml

keps/1833-metrics-for-local-queue/README.md

tenzen-y · 2024-07-10T20:05:54Z

keps/1833-metrics-for-local-queue/README.md

+	NamespaceSelector *metav1.LabelSelector `json:"namespaceSelector,omitempty"`
+
+	// QueueSelector can be used to choose the local queues that need metrics to be collected. 
+	QueueSelector *metav1.LabelSelector `json:"queueSelector,omitempty"`


What happens if the below situations?

only namespeceSelector is specified

only localQueueSelector is specified

both namespaceSelector and localQueueSelector are specified

Had a predicate based filtering kind of approach for this case.

Only NS selector: All LocalQueues in the specific namespaces are selected. If it's nil, all the local queues across all namespaces are considered.

LocalQueue selector: Local queues's matching the labels are selected across all namespaces. If not, all the local queues across namespaces are considered.

If both are specified: Both the selectors are applied to local queues (logical AND).

Had a Venn diagram in mind :)) Please let me know if I'm missing anything.

Have added a configuration table for this knob in the KEP.

tenzen-y · 2024-07-10T20:13:51Z

keps/1833-metrics-for-local-queue/README.md

+
+In the first iteration, following are the list of metrics that would contain information on Local Queue statuses:
+
+| Metrics Name                             | Prometheus Type | Description                                                                                         | 


A part of the clusterqueue metrics is imported here like local_queue_status. Is there any reason that you want to drop a few metrics rather than clusterqueue ones?

Updated the list to contain all the local_queue relevant ones. One additional metric is the local_queue_wait_time_seconds which is a sum of admission and reserved wait time.

This commit addresses reviews by adding additional metrics for local queue. Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>

varshaprasad96 · 2024-07-16T09:56:10Z

@PBundyra @alculquicondor @tenzen-y I've updated the proposal based on the reviews. Please take a look when you get a chance. Thank you!

PBundyra · 2024-07-16T13:08:47Z

keps/1833-metrics-for-local-queue/README.md

+| True     | -                   | Specified            | All LocalQueues matching the label selector have metrics enabled.                                                  |
+| True     | -                   | Specified            | All LocalQueues matching the label selector have metrics enabled.                                                  |


PBundyra · 2024-07-16T13:16:11Z

keps/1833-metrics-for-local-queue/README.md

+| local_queue_admission_wait_time_seconds  | Histogram       | The time from when a workload got the quota reservation until admission in local queue.             |
+| local_queue_reserved_wait_time_seconds   | Histogram       | The time between a workload was created or re-queued until it got quota reservation in local queue. |
+| local_queue_wait_time_seconds            | Histogram       | Time taken to accept the resources from cluster queue.                                              |
+| local_admission_checks_wait_time_seconds | Histogram       | The time from when a workload got the quota reservation until admission.                            |


Could you please clarify the differences between local_queue_admission_wait_time_seconds and local_admission_checks_wait_time_seconds? Maybe local_queue_admission_wait_time_seconds could describe time from creating (or requeueing) a Workload until admission - similarly to the clusterqueue metrics. In that case it would cover the local_queue_wait_time_seconds metric usage.

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. kind/documentation Categorizes issue or PR as related to documentation. labels Jul 2, 2024

k8s-ci-robot requested review from alculquicondor and mimowo July 2, 2024 13:17

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 2, 2024

[Feature] Enable prometheus metrics for local queues

49f4265

This PR introduces an enhancement to enable collection of prometheus metrics for local queues. Addresses issue: kubernetes-sigs#1833 Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 2, 2024

varshaprasad96 force-pushed the kep-local-queue-metrics branch from b73eee4 to 49f4265 Compare July 2, 2024 13:17

k8s-ci-robot assigned PBundyra Jul 2, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 8, 2024

PBundyra reviewed Jul 8, 2024

View reviewed changes

keps/1833-metrics-for-local-queue/README.md Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 8, 2024

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 10, 2024

k8s-ci-robot requested a review from PBundyra July 10, 2024 07:12

PBundyra reviewed Jul 10, 2024

View reviewed changes

tenzen-y reviewed Jul 10, 2024

View reviewed changes

varshaprasad96 force-pushed the kep-local-queue-metrics branch from e86c944 to bc697cc Compare July 16, 2024 09:32

Address reviews

15ef983

This commit addresses reviews by adding additional metrics for local queue. Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>

varshaprasad96 force-pushed the kep-local-queue-metrics branch from bc697cc to 15ef983 Compare July 16, 2024 09:33

PBundyra reviewed Jul 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Enable prometheus metrics for local queues #2516

[Feature] Enable prometheus metrics for local queues #2516

varshaprasad96 commented Jul 2, 2024

k8s-ci-robot commented Jul 2, 2024

netlify bot commented Jul 2, 2024 •

edited

Loading

varshaprasad96 commented Jul 2, 2024

alculquicondor commented Jul 2, 2024

PBundyra commented Jul 8, 2024

k8s-ci-robot commented Jul 8, 2024

PBundyra commented Jul 8, 2024

alculquicondor commented Jul 8, 2024

PBundyra commented Jul 9, 2024

k8s-ci-robot commented Jul 10, 2024

PBundyra Jul 10, 2024

alculquicondor Jul 11, 2024

PBundyra Jul 12, 2024

varshaprasad96 Jul 12, 2024 •

edited

Loading

alculquicondor Jul 12, 2024

tenzen-y Jul 10, 2024

varshaprasad96 Jul 12, 2024 •

edited

Loading

varshaprasad96 Jul 16, 2024

tenzen-y Jul 10, 2024

varshaprasad96 Jul 16, 2024

varshaprasad96 commented Jul 16, 2024

PBundyra Jul 16, 2024

PBundyra Jul 16, 2024


		In the first iteration, following are the list of metrics that would contain information on Local Queue statuses:

		\| Metrics Name \| Prometheus Type \| Description \|

		\| True \| - \| Specified \| All LocalQueues matching the label selector have metrics enabled. \|
		\| True \| - \| Specified \| All LocalQueues matching the label selector have metrics enabled. \|

[Feature] Enable prometheus metrics for local queues #2516

Are you sure you want to change the base?

[Feature] Enable prometheus metrics for local queues #2516

Conversation

varshaprasad96 commented Jul 2, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Jul 2, 2024

netlify bot commented Jul 2, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

varshaprasad96 commented Jul 2, 2024

alculquicondor commented Jul 2, 2024

PBundyra commented Jul 8, 2024

k8s-ci-robot commented Jul 8, 2024

PBundyra commented Jul 8, 2024

alculquicondor commented Jul 8, 2024

PBundyra commented Jul 9, 2024

k8s-ci-robot commented Jul 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varshaprasad96 Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varshaprasad96 Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varshaprasad96 commented Jul 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Jul 2, 2024 •

edited

Loading

varshaprasad96 Jul 12, 2024 •

edited

Loading

varshaprasad96 Jul 12, 2024 •

edited

Loading