Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry provisioning request #1351

Merged

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Nov 20, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Part of #1353

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Support for retry of provisioning request.

When `ProvisioningACC` is enabled, and there are existing ProvisioningRequests, they are going to be recreated.
This may cause job evictions for some long-running jobs which were using the ProvisioningRequests.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 20, 2023
Copy link

netlify bot commented Nov 20, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit dcde2d8
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/655cf43b5fe005000878677f

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 20, 2023
@mimowo mimowo force-pushed the retry-provisioning-request branch 8 times, most recently from 70b345a to 90dae37 Compare November 20, 2023 14:34
@mimowo
Copy link
Contributor Author

mimowo commented Nov 20, 2023

/kind feature
/kind api-change

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Nov 20, 2023
@mimowo
Copy link
Contributor Author

mimowo commented Nov 20, 2023

/hold
For KEP update

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 20, 2023
@mimowo mimowo force-pushed the retry-provisioning-request branch 2 times, most recently from c602a0a to b3a3eff Compare November 21, 2023 12:12
@mimowo mimowo changed the title WIP: Retry provisioning request Retry provisioning request Nov 21, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 21, 2023
@mimowo
Copy link
Contributor Author

mimowo commented Nov 21, 2023

/hold cancel
We have the KEP update PR: #1355

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 21, 2023
@mimowo mimowo force-pushed the retry-provisioning-request branch 4 times, most recently from 8c033e5 to d25d221 Compare November 21, 2023 16:26
if matches(req, wl.Name, checkName) {
prc, err := c.helper.ProvReqConfigForAdmissionCheck(ctx, checkName)
if err == nil && c.reqIsNeeded(ctx, wl, prc) && requestHasParamaters(req, prc) {
if currPr, exists := checkNameToPr[checkName]; !exists || getAttempt(ctx, currPr, wl.Name, checkName) < getAttempt(ctx, req, wl.Name, checkName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should only be one ProvReq that is not Failed at a time. Furthermore, the most recent one is the one that we should care about. There shouldn't be a need to parse the number of attempts from the name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH, the counter as part of the name has a nice implication: you can't create two ProvReq with the same name, so we don't have to worry about having more than one active ProvReq when the informer cache is slightly out of date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I would add two arguments:

  • we need to keep the number of attempts somewhere, we could use annotation, but this is API change and the intention is to backport to 0.5
  • it is possible that the request failed and the cloud provider needs some time to clean up nodes / resources corresponding to the old one. Starting new avoids the problem

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

return reconcile.Result{}, nil
}

func (c *Controller) getCheckNameToPr(ctx context.Context, wl *kueue.Workload, relevantChecks []string, ownedPrs []autoscaling.ProvisioningRequest) map[string]*autoscaling.ProvisioningRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (c *Controller) getCheckNameToPr(ctx context.Context, wl *kueue.Workload, relevantChecks []string, ownedPrs []autoscaling.ProvisioningRequest) map[string]*autoscaling.ProvisioningRequest {
func (c *Controller) activeOrLastPRNameForChecks(ctx context.Context, wl *kueue.Workload, relevantChecks []string, ownedPrs []autoscaling.ProvisioningRequest) map[string]*autoscaling.ProvisioningRequest {

Copy link
Contributor Author

@mimowo mimowo Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is more specific, rename done, but I use activeOrLastPRForChecks, to avoid a Get inside syncCheckStates

pkg/controller/admissionchecks/provisioning/controller.go Outdated Show resolved Hide resolved
@@ -40,7 +42,7 @@ const (
customResourceTwo = "example.org/res2"
)

var _ = ginkgo.Describe("Provisioning", func() {
var _ = ginkgo.Describe("Provisioning", ginkgo.Ordered, ginkgo.ContinueOnFailure, func() {
Copy link
Contributor

@alculquicondor alculquicondor Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modify the vars, and different tests use different value, so I don't want the tests to run in parallel, but maybe this is not needed?

}, util.Timeout, util.Interval).Should(gomega.Succeed())
})

ginkgo.By("Checking the admission check is pending", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to check this. We only need to check that a new ProvReq appears and the check is still Pending by then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well. I have added the check, because this was a source of a bug in my initial implementation. I would transition to Rejected, and then back to Failed when the replacement provisioning request is added. So, I wanted to test that there is no flip to Rejected

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 21, 2023
@alculquicondor
Copy link
Contributor

/remove-kind api-change

@k8s-ci-robot k8s-ci-robot removed the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Nov 21, 2023
Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 21, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: eaa1ae065806872292fd04bc9ba2ee7cbf9ce652

@alculquicondor
Copy link
Contributor

/hold
actually, please squash

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 21, 2023
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 21, 2023
@alculquicondor
Copy link
Contributor

/hold cancel
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 21, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ef4a44ddb8782acf7cf00a21007f3391bc91e011

@k8s-ci-robot k8s-ci-robot merged commit 92079c0 into kubernetes-sigs:main Nov 21, 2023
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.6 milestone Nov 21, 2023
Comment on lines +710 to 713
func GetProvisioningRequestName(workloadName, checkName string, attempt int32) string {
fullName := fmt.Sprintf("%s-%s-%d", workloadName, checkName, int(attempt))
return limitObjectName(fullName)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mimowo @alculquicondor What happens if admins upgrade kueue v0.4 to the latest version?
I guess that ProvisioningRequests (PR) using the old name form will be recreated, and I'm wondering if this name form change will affect the PRs if CA is provisioning nodes triggered by the PRs.

Copy link
Member

@tenzen-y tenzen-y Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding PRs with CapacityAvailable, Provisioned, or Failed, I don't think that the above concerns exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be considered attempt 1.

Copy link
Member

@tenzen-y tenzen-y Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be considered attempt 1.

You mean that PRs with old name forms wouldn't recreated, right? If so, that makes sense.
Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the old PRs need to be recreated as they would not pass the matches function.

It is easy to match old PRs (without the suffix) as attempt "1", but one thing to consider is that provisioning request wl-ac-1, would be matching to both ac and ac-1.

I think this is ok to leave as is, because there is no support for provisioning requests in Kueue 0.4, it starts in 0.5.0 and the ProvisioningACC is Alpha. See feature-stages.

@mwielgus @alculquicondor @trasc WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a release note.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always good to have additional safety and not loosing any in-flight requests, but additional work to perfect upgrade story for just released non-default, alpha-like feature seem to have relatively little impact and return on investment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always good to have additional safety and not loosing any in-flight requests, but additional work to perfect upgrade story for just released non-default, alpha-like feature seem to have relatively little impact and return on investment.

That makes sense. I'm ok with leaving on recreation for ProvisioningRequest since the ProvisioningACC is the Alpha stage.

I believe that users don't use the featureGate yet in production.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mimowo Thanks for updating the release note. The notification would be helpful :)

@mimowo mimowo deleted the retry-provisioning-request branch November 29, 2023 14:57
k8s-ci-robot pushed a commit that referenced this pull request Jan 17, 2024
…ts (#1595)

* Support retry for provisioning requests

* Fix creation of the second provisioning request
@alculquicondor
Copy link
Contributor

/release-note-edit

Support for retry of provisioning request.

When `ProvisioningACC` is enabled, and there are existing ProvisioningRequests, they are going to be recreated.
This may cause job evictions for some long-running jobs which were using the ProvisioningRequests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants