Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track usage in capacity status #57

Merged
merged 1 commit into from
Feb 28, 2022

Conversation

alculquicondor
Copy link
Contributor

@alculquicondor alculquicondor commented Feb 23, 2022

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

  • Add new status fields
  • Reconcile when there are QueuedWorkload updates

This is important for the observability of kueue operations

Which issue(s) this PR fixes:

Fixes #7

Special notes for your reviewer:

I'm putting a minimal check in the integration test. I plan to add a more comprehensive test as part of #68

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Feb 23, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 23, 2022
@alculquicondor alculquicondor force-pushed the capacity_status branch 2 times, most recently from 05ef630 to fe181b6 Compare February 23, 2022 22:39
Copy link
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing it locally, so far looks good, for now just one nit

config/manager/kustomization.yaml Outdated Show resolved Hide resolved
@alculquicondor alculquicondor changed the title WIP Track usage in capacity status Track usage in capacity status Feb 24, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2022
@alculquicondor
Copy link
Contributor Author

/hold
(because I don't want a single LGTM to cause a merge of the PR)

/assign @ahg-g @ArangoGutierrez

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 24, 2022
@alculquicondor alculquicondor force-pushed the capacity_status branch 2 times, most recently from dbefd6f to 27c3569 Compare February 24, 2022 22:43
Copy link
Member

@denkensk denkensk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments.


// runningWorkloads is the number of workloads currently assigned to this
// capacity.
RunningWorkloads int32 `json:"runningWorkloads,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running is a little confuse. Actually the workload is just assigned.

RunningWorkloads --> AssignedWorkloads

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes... not sure what I was thinking.

// Usage reports the used resources and number of workloads assigned to the
// capacity.
func (c *Cache) Usage(capObj *kueue.Capacity) (map[corev1.ResourceName]map[string]kueue.Usage, int, error) {
c.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can choose to use RWLock here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it makes sense now that there are more than one reader

if err := k8sClient.Get(ctx, types.NamespacedName{Name: "prod-capacity"}, &capObj); err != nil {
return false
}
return capObj.Status.RunningWorkloads == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check the usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do as part of #68

q.Add(requestForWorkloadCapacity(oldW))
}
newW := e.ObjectNew.(*kueue.QueuedWorkload)
if newW.Spec.AssignedCapacity != "" && newW.Spec.AssignedCapacity != oldW.Spec.AssignedCapacity {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the circumstances under which this Spec.AssignedCapacity would change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't happen... but I tried to make the code resilient to that.

I guess one case could be reassignment after preemption, if the user decides to change the queue for the workload, or the queue changes capacity 🤷

Copy link
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested locally and works, just a few comments

}

// SetupWithManager sets up the controller with the Manager.
func (r *CapacityReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&kueue.Capacity{}).
Watches(&source.Kind{Type: &kueue.QueuedWorkload{}}, &assignedWorkloadHandler{}).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Watches(&source.Kind{Type: &kueue.QueuedWorkload{}}, &assignedWorkloadHandler{}).
Watches(&source.Kind{Type: &kueue.QueuedWorkload{}}, &assignedWorkloadHandler{}).
WithLogger(r.log).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that before, but the manager adds extra names... so it looks like this:

capacity-reconciler.controller.capacity

Better just have controller.capacity to distinguish from the event handler logs.

Maybe I should rename the handler log to capacity-handler?

log.Error(err, "Failed getting usage from capacity cache")
// This is likely because the capacity was recently removed and we didn't
// process that event yet.
return ctrl.Result{}, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in this scenario we want to requeue?

	// Requeue tells the Controller to requeue the reconcile key.  Defaults to false.
	Requeue bool

to trigger a new try to reconcile the state of the capacity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because we should receive the event, which will put the Capacity in the workqueue.

Although it doesn't really matter, there is nothing to do if the capacity is deleted.

api/v1alpha1/capacity_types.go Outdated Show resolved Hide resolved
api/v1alpha1/capacity_types.go Show resolved Hide resolved
pkg/capacity/capacity_test.go Show resolved Hide resolved
pkg/controller/core/capacity_controller.go Show resolved Hide resolved
pkg/controller/core/capacity_controller.go Show resolved Hide resolved
@ahg-g
Copy link
Contributor

ahg-g commented Feb 25, 2022

Looks good to me if you want to squash, i can lgtm if there are no pending comments from others

Copy link
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 25, 2022
- Add new status fields
- Reconcile when there are QueuedWorkload updates

Change-Id: I44b96a286f9871a76e5657f1dbb1b4b31738c2f4
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2022
@alculquicondor
Copy link
Contributor Author

rebased

@ahg-g
Copy link
Contributor

ahg-g commented Feb 28, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2022
@ahg-g
Copy link
Contributor

ahg-g commented Feb 28, 2022

/retest

why do we keep getting this failure?

@alculquicondor
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 28, 2022
@k8s-ci-robot k8s-ci-robot merged commit d74ccc6 into kubernetes-sigs:main Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add info to ClusterQueue status
5 participants