Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local queues prometheus metrics #1833

Open
1 of 3 tasks
astefanutti opened this issue Mar 13, 2024 · 14 comments
Open
1 of 3 tasks

Local queues prometheus metrics #1833

astefanutti opened this issue Mar 13, 2024 · 14 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@astefanutti
Copy link
Member

astefanutti commented Mar 13, 2024

What would you like to be added:

Expose Prometheus metrics for local queues, equivalent to the existing cluster queue metrics, but filtered and labeled by local queues.

Similarly to the visibility API, that serves information about pending workloads in local queues, it would be possible to get metrics like like pending workloads, admitted active workloads, resource usage, etc, for local queues.

If cardinality is a concern, those metrics could be exposed behind a feature flag.

Why is this needed:

Metrics about local queues can be useful for the batch users persona, so they can have visibility and historical trends about their workloads.

While some metrics are already available for cluster queues, exposing them to the batch users persona presents the following challenges / limitations:

  • Cluster queues metrics are global and cannot be filtered by namespaces / tenants
  • Querying "cluster-scoped" metrics in secured Prometheus instances is generally only authorised for cluster admin users that have access to all namespaces / tenants

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@astefanutti astefanutti added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 13, 2024
@astefanutti
Copy link
Member Author

@alculquicondor @tenzen-y Do you think that'd be a useful / possible enhancement?

@alculquicondor
Copy link
Contributor

alculquicondor commented Mar 13, 2024

If cardinality is a concern, those metrics could be exposed behind a feature flag.

Yes, that's the primary concern. I wouldn't make it a feature flag, but a long-term configuration field.

@tenzen-y
Copy link
Member

@alculquicondor @tenzen-y Do you think that'd be a useful / possible enhancement?

I understand that this feature is so useful, but I have the same concern with @alculquicondor.
IIRC, previously, we had similar discussions when we designed Visibility/Visibility On-demand.
So, if we introduce this feature, configurable this feature by Config API would be better.

Anyway, I guess that having a small KEP would be better since we may extend the existing Config API.

@astefanutti
Copy link
Member Author

That makes sense. There could be one or two options added to the Config API, similar to the existing .metrics.enableClusterQueueResources, like .metrics.enableLocalQueues and .metrics.enableLocalQueueResources.

I can work on a small KEP if you guys give the green light.

@alculquicondor
Copy link
Contributor

It seems simple enough.

@tenzen-y
Copy link
Member

SGTM

@astefanutti
Copy link
Member Author

Thanks for your quick feedback! I'll work on it asap.

/assign

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2024
@alculquicondor
Copy link
Contributor

@astefanutti are you still looking into this?

@astefanutti
Copy link
Member Author

@alculquicondor I haven't, but hopefully we'll get back to it soon.

@astefanutti
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2024
@astefanutti
Copy link
Member Author

/unassign

varshaprasad96 added a commit to varshaprasad96/kueue that referenced this issue Jul 2, 2024
This PR introduces an enhancement to enable collection of
prometheus metrics for local queues.

Addresses issue: kubernetes-sigs#1833

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
varshaprasad96 added a commit to varshaprasad96/kueue that referenced this issue Jul 2, 2024
This PR introduces an enhancement to enable collection of
prometheus metrics for local queues.

Addresses issue: kubernetes-sigs#1833

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
@alculquicondor
Copy link
Contributor

@varshaprasad96, please write /assign in a comment to claim this issue. It's important to communicate that you are working on an issue so that other contributors don't try to work on the same thing.

@varshaprasad96
Copy link
Member

/assign

k8s-ci-robot pushed a commit that referenced this issue Jul 29, 2024
* [Feature] Enable prometheus metrics for local queues

This PR introduces an enhancement to enable collection of
prometheus metrics for local queues.

Addresses issue: #1833

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>

* Address reviews

This commit addresses reviews by adding additional metrics
for local queue.

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>

---------

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

6 participants