Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metrics] Add quota_reserved_wait_time_seconds #1977

Merged

Conversation

mbobrovskyi
Copy link
Contributor

@mbobrovskyi mbobrovskyi commented Apr 12, 2024

What type of PR is this?

/kind bug
/kind feature

What this PR does / why we need it:

Fixed metrics admitted_workloads_total, admission_wait_time_seconds.
Added metrics quota_reserved_workloads_total, quota_reserved_wait_time_seconds.

Which issue(s) this PR fixes:

Fixes #1961

Does this PR introduce a user-facing change?

Improve metrics related to workload's quota reservation and admission:
- fix admission_wait_time_seconds - to measure the time to "Admitted" condition since creation time or last requeue (as opposed to the "QuotaReserved" condition as before)
- add quota_reserved_wait_time_seconds - measures time to "QuotaReserved" condition since creation time, or last eviction time
- add quota_reserved_workloads_total - counts the number of workloads that got admitted
- admission_checks_wait_time_seconds - measures the time to admit a workload with admission checks since quota reservation
- use longer buckets (up to 10240s) for histogram metrics: admission_wait_time_seconds, quota_reserved_wait_time_seconds, admission_checks_wait_time_seconds

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 12, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 12, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @mbobrovskyi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 12, 2024
Copy link

netlify bot commented Apr 12, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit a33f33e
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66265ea2242094000847962a

@trasc
Copy link
Contributor

trasc commented Apr 12, 2024

/assign
/ok-to-test
/test all

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 12, 2024
@k8s-ci-robot
Copy link
Contributor

@trasc: /release-note-edit must be used with a release note block.

In response to this:

/release-note-edit

Improve metrics relating to workload's quota reservation and admission

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@trasc: /release-note-edit must be used with a single release note block.

In response to this:

/release-note-edit

Improve metrics relating to workload's quota reservation and admission

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@trasc: /release-note-edit must be used with a release note block.

In response to this:

/release-note-edit

Improve metrics relating to workload's quota reservation and admission

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 15, 2024
@mbobrovskyi mbobrovskyi marked this pull request as ready for review April 15, 2024 14:19
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 15, 2024
@k8s-ci-robot k8s-ci-robot requested a review from trasc April 15, 2024 14:19
@trasc
Copy link
Contributor

trasc commented Apr 15, 2024

/lgtm
/assign @mimowo

Copy link
Contributor

@trasc trasc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: c9c1f4ba99a1d94dc4b267733da06e284f80195a

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost lgtm

Comment on lines 181 to 185
queuedTime := wl.CreationTimestamp.Time
if c := apimeta.FindStatusCondition(wl.Status.Conditions, kueue.WorkloadRequeued); c != nil {
queuedTime = c.LastTransitionTime.Time
}
queuedWaitTime := time.Since(queuedTime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extract this as a helper in workload.go to commonize the analogous code with scheduler.go. Maybe QueuedWaitTime(workload)

condition := metav1.Condition{
Type: kueue.WorkloadRequeued,
Status: metav1.ConditionTrue,
LastTransitionTime: metav1.Now(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is set inside apimeta.SetStatusCondition(

Comment on lines 306 to 308
// WorkloadRequeued means that the Workload was requeued.
// The possible reasons for this condition are:
// - On setting QuotaReserved=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// WorkloadRequeued means that the Workload was requeued.
// The possible reasons for this condition are:
// - On setting QuotaReserved=False
// WorkloadRequeued means that the Workload was requeued due to eviction.

Let's make the comment meaningful for the users who may observe the conditions.
Kueue will set QuotaReserved=False for evicted workloads.

Request(corev1.ResourceCPU, "2").
Obj()

ginkgo.By("checking the first workload gets created and admitted", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ginkgo.By("checking the first workload gets created and admitted", func() {
ginkgo.By("checking the first workload gets created and gets quota reserved", func() {

@@ -302,6 +302,11 @@ const (
// more detailed information. The more detailed reasons should be prefixed
// by one of the "base" reasons.
WorkloadPreempted = "Preempted"

// WorkloadRequeued means that the Workload was requeued.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure we have an integration test which checks that:

  • Requeued=True when QuotaReserved=True and Admitted=False
  • Requeued=False when the workload is evicted (IIUC this may still need implementation)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requeued=False when the workload is evicted (IIUC this may still need implementation)

Could you please explain this case? Is it possible to set Requeued to false?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we could set it to false, and when we set Admitted=True condition. We do something similar for the other conditions when settting QuotaReserved: true:

if evictedCond := apimeta.FindStatusCondition(w.Status.Conditions, kueue.WorkloadEvicted); evictedCond != nil {
evictedCond.Status = metav1.ConditionFalse
evictedCond.Reason = "QuotaReserved"
evictedCond.Message = "Previously: " + evictedCond.Message
evictedCond.LastTransitionTime = metav1.Now()
}
// reset Preempted condition if present.
if preemptedCond := apimeta.FindStatusCondition(w.Status.Conditions, kueue.WorkloadPreempted); preemptedCond != nil {
preemptedCond.Status = metav1.ConditionFalse
preemptedCond.Reason = "QuotaReserved"
preemptedCond.Message = "Previously: " + preemptedCond.Message
preemptedCond.LastTransitionTime = metav1.Now()
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH, setting it to False is not required to compute the metric, so maybe it does not need to be in this PR. WDYT @alculquicondor ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we don't want to change the condition to False on admission.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just add an assert the it remains True when QuotaReserved=True, and Admitted=True?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition should transition to False in the following scenarios:

  • When workload.spec.active=false. It should return to True when it's reactivated.
  • When evicted due to WaitForPodsReady, the workload will temporarily be in backoff. During this time, Requeued should be false.

Although I think we can leave these for a follow up.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2024
Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
Leaving lgtm to @mimowo

@@ -633,6 +633,12 @@ func TestReconciler(t *testing.T) {
Reason: "Pending",
Message: "The workload is deactivated",
}).
Condition(metav1.Condition{
Type: kueue.WorkloadRequeued,
Status: metav1.ConditionTrue,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. If the workload is deactivated, Requeued should be False.

@@ -302,6 +302,11 @@ const (
// more detailed information. The more detailed reasons should be prefixed
// by one of the "base" reasons.
WorkloadPreempted = "Preempted"

// WorkloadRequeued means that the Workload was requeued.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition should transition to False in the following scenarios:

  • When workload.spec.active=false. It should return to True when it's reactivated.
  • When evicted due to WaitForPodsReady, the workload will temporarily be in backoff. During this time, Requeued should be false.

Although I think we can leave these for a follow up.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mbobrovskyi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2024
@mimowo
Copy link
Contributor

mimowo commented Apr 23, 2024

/lgtm
Let's address the remaining comments (test and transitions) in follow up(s).

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 23, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: d813f5fde2705e8a3177eeb00b5269ec4a993645

@mimowo
Copy link
Contributor

mimowo commented Apr 23, 2024

/lgtm Let's address the remaining comments (test and transitions) in follow up(s).

I opened the issue to track the remaining work for transitioning: #2038.

@k8s-ci-robot k8s-ci-robot merged commit 92baacd into kubernetes-sigs:main Apr 23, 2024
15 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.7 milestone Apr 23, 2024
@mimowo
Copy link
Contributor

mimowo commented Apr 23, 2024

/release-note-edit

Improve metrics related to workload's quota reservation and admission:
- fix admission_wait_time_seconds - to measure the time to "Admitted" condition since creation time or last requeue (as opposed to the "QuotaReserved" condition as before)
- add quota_reserved_wait_time_seconds - measures time to "QuotaReserved" condition since creation time, or last eviction time
- add quota_reserved_workloads_total - counts the number of workloads that got admitted
- admission_checks_wait_time_seconds - measures the time to admit a workload with admission checks since quota reservation
- use longer buckets (up to 10240s) for histogram metrics: admission_wait_time_seconds, quota_reserved_wait_time_seconds, admission_checks_wait_time_seconds

@mbobrovskyi mbobrovskyi deleted the fix/admission_wait_time_seconds branch April 23, 2024 18:38
@tenzen-y
Copy link
Member

@mimowo As I can see this PR, the semantics of the existing metric (admitted_workloads_total and admission_wait_time_seconds ) are reworked with breaking. So, shouldn't add the ACTION REQUIRED in the release note?

@mimowo
Copy link
Contributor

mimowo commented Apr 24, 2024

@mimowo As I can see this PR, the semantics of the existing metric (admitted_workloads_total and admission_wait_time_seconds ) are reworked with breaking. So, shouldn't add the ACTION REQUIRED in the release note?

I don't think admitted_workloads_total changed, we just added quota_reserved_workloads_total.
Regarding admission_wait_time_seconds I would say it was fixed for workloads with admission checks.
It does not seem to me user action is needed.

@tenzen-y
Copy link
Member

I don't think admitted_workloads_total changed, we just added quota_reserved_workloads_total.

Oh, I see. Thank you for the clarifications. I was thinking that the metric had been changed.

Regarding admission_wait_time_seconds I would say it was fixed for workloads with admission checks.
It does not seem to me user action is needed.

That makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[metrics] The admission_wait_time_seconds metric is misleading when AdmissionChecks are used
6 participants