Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a metric that tracks the number of preemptions issued by a ClusterQueue #2491

Open
3 tasks
alculquicondor opened this issue Jun 27, 2024 · 6 comments · May be fixed by #2538
Open
3 tasks

Add a metric that tracks the number of preemptions issued by a ClusterQueue #2491

alculquicondor opened this issue Jun 27, 2024 · 6 comments · May be fixed by #2538
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@alculquicondor
Copy link
Contributor

What would you like to be added:

A metric that counts how many preemptions a ClusterQueue has issued, broken down by whether it was internal to the ClusterQueue, it was a reclamation, fair sharing or priority threshold.

This is somewhat the opposite direction of evicted_workloads_total, but focused on Preemption.

Why is this needed:

Improve observability.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@alculquicondor alculquicondor added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 27, 2024
@alculquicondor
Copy link
Contributor Author

/assign @vladikkuzn

@alculquicondor
Copy link
Contributor Author

To clarify, this counter should increment for every workload that is preempted.

@trasc
Copy link
Contributor

trasc commented Jul 3, 2024

In this case we can just extend

EvictedWorkloadsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Subsystem: constants.KueueName,
Name: "evicted_workloads_total",
Help: `The number of evicted workloads per 'cluster_queue',
The label 'reason' can have the following values:
- "Preempted" means that the workload was evicted in order to free resources for a workload with a higher priority or reclamation of nominal quota.
- "PodsReadyTimeout" means that the eviction took place due to a PodsReady timeout.
- "AdmissionCheck" means that the workload was evicted because at least one admission check transitioned to False.
- "ClusterQueueStopped" means that the workload was evicted because the ClusterQueue is stopped.
- "InactiveWorkload" means that the workload was evicted because spec.active is set to false`,
}, []string{"cluster_queue", "reason"},
)

and add an additional label for the preemption scope.

@alculquicondor
Copy link
Contributor Author

Yes, indeed, that would be useful.

But this counter is from the point-of-view of the preemptee CQ.

The request is from the point-of-view of the preemptor CQ.

@trasc
Copy link
Contributor

trasc commented Jul 3, 2024

... that is a bit different , so count the preemptees but group but group by the preemptor's CQ name. We could ad yet another metric label "preemptor_cluster_queue" but we can end up creating too many metric data-points.

@alculquicondor
Copy link
Contributor Author

Preemption is one of the few actions that involves two entities.
We could also have one metric that has both clusterqueues as labels, but that could cause explosion of cardinality.
Having one for each side sounds like a reasonable compromise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants