Add a metric that tracks the number of preemptions issued by a ClusterQueue #2491

alculquicondor · 2024-06-27T14:31:31Z

What would you like to be added:

A metric that counts how many preemptions a ClusterQueue has issued, broken down by whether it was internal to the ClusterQueue, it was a reclamation, fair sharing or priority threshold.

This is somewhat the opposite direction of evicted_workloads_total, but focused on Preemption.

Why is this needed:

Improve observability.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

The text was updated successfully, but these errors were encountered:

alculquicondor · 2024-06-27T14:31:38Z

/assign @vladikkuzn

alculquicondor · 2024-07-02T15:19:22Z

To clarify, this counter should increment for every workload that is preempted.

trasc · 2024-07-03T05:59:33Z

In this case we can just extend

kueue/pkg/metrics/metrics.go

Lines 137 to 149 in 1d849aa

    
           	EvictedWorkloadsTotal = prometheus.NewCounterVec( 
        
           		prometheus.CounterOpts{ 
        
           			Subsystem: constants.KueueName, 
        
           			Name:      "evicted_workloads_total", 
        
           			Help: `The number of evicted workloads per 'cluster_queue', 
        
           The label 'reason' can have the following values: 
        
           - "Preempted" means that the workload was evicted in order to free resources for a workload with a higher priority or reclamation of nominal quota. 
        
           - "PodsReadyTimeout" means that the eviction took place due to a PodsReady timeout. 
        
           - "AdmissionCheck" means that the workload was evicted because at least one admission check transitioned to False. 
        
           - "ClusterQueueStopped" means that the workload was evicted because the ClusterQueue is stopped. 
        
           - "InactiveWorkload" means that the workload was evicted because spec.active is set to false`, 
        
           		}, []string{"cluster_queue", "reason"}, 
        
           	)

and add an additional label for the preemption scope.

alculquicondor · 2024-07-03T11:50:02Z

Yes, indeed, that would be useful.

But this counter is from the point-of-view of the preemptee CQ.

The request is from the point-of-view of the preemptor CQ.

trasc · 2024-07-03T12:29:31Z

... that is a bit different , so count the preemptees but group but group by the preemptor's CQ name. We could ad yet another metric label "preemptor_cluster_queue" but we can end up creating too many metric data-points.

alculquicondor · 2024-07-03T14:17:50Z

Preemption is one of the few actions that involves two entities.
We could also have one metric that has both clusterqueues as labels, but that could cause explosion of cardinality.
Having one for each side sounds like a reasonable compromise.

alculquicondor added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 27, 2024

k8s-ci-robot assigned vladikkuzn Jun 27, 2024

vladikkuzn linked a pull request Jul 5, 2024 that will close this issue

Add a metric that tracks the number of preemptions issued by a ClusterQueue #2538

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a metric that tracks the number of preemptions issued by a ClusterQueue #2491

Add a metric that tracks the number of preemptions issued by a ClusterQueue #2491

alculquicondor commented Jun 27, 2024

alculquicondor commented Jun 27, 2024

alculquicondor commented Jul 2, 2024

trasc commented Jul 3, 2024 •

edited

Loading

alculquicondor commented Jul 3, 2024

trasc commented Jul 3, 2024

alculquicondor commented Jul 3, 2024

Add a metric that tracks the number of preemptions issued by a ClusterQueue #2491

Add a metric that tracks the number of preemptions issued by a ClusterQueue #2491

Comments

alculquicondor commented Jun 27, 2024

alculquicondor commented Jun 27, 2024

alculquicondor commented Jul 2, 2024

trasc commented Jul 3, 2024 • edited Loading

alculquicondor commented Jul 3, 2024

trasc commented Jul 3, 2024

alculquicondor commented Jul 3, 2024

trasc commented Jul 3, 2024 •

edited

Loading