Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the number of goroutines used when deleting jobs #123

Merged

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented May 7, 2023

I limited the number of goroutines used when deleting Jobs.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 7, 2023
// where the jobs are recreated.
backgroundPolicy := metav1.DeletePropagationBackground
if err := r.Delete(ctx, job, &client.DeleteOptions{PropagationPolicy: &backgroundPolicy}); client.IgnoreNotFound(err) != nil {
finalErr = err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a potential issue, a race condition may occur.

@tenzen-y tenzen-y force-pushed the use-workqueue-when-deleting-jobs branch 2 times, most recently from db70f73 to 0276be0 Compare May 7, 2023 17:38
@@ -364,31 +369,23 @@ func (r *JobSetReconciler) restartPolicyRecreateAll(ctx context.Context, js *job
return nil
}

func (r *JobSetReconciler) deleteJobs(ctx context.Context, js *jobset.JobSet, jobsForDeletion []*batchv1.Job) error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't use the JobSet.

@tenzen-y tenzen-y changed the title Limit the number of goroutines used when deleting jobs WIP: Limit the number of goroutines used when deleting jobs May 7, 2023
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2023
if err := r.Delete(ctx, targetJob, &client.DeleteOptions{PropagationPolicy: &backgroundPolicy}); client.IgnoreNotFound(err) != nil {
lock.Lock()
defer lock.Unlock()
finalErr = errors.Join(finalErr, err)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, errors.Join() is introduced since Go 1.20, although this project uses Go 1.19.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahg-g @danielvegamyhre I'd like to update the Go version to 1.20 in another PR before merging this one into the main.

Does bumping the Go version to v1.20 sound good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, let's wait for Abdullah to confirm as well though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say that I remember upgrading the e2e tests in test infra to 1.20

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that I think we are already using a go1.20 for the e2e tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a PR bumping that and we can discuss.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y tenzen-y force-pushed the use-workqueue-when-deleting-jobs branch 2 times, most recently from 72ba788 to f49cbc6 Compare May 8, 2023 16:30
@tenzen-y tenzen-y changed the title WIP: Limit the number of goroutines used when deleting jobs Limit the number of goroutines used when deleting jobs May 8, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 8, 2023
@tenzen-y
Copy link
Member Author

tenzen-y commented May 8, 2023

Squashed into one and rebased.

@danielvegamyhre
Copy link
Contributor

/lgtm

Leaving approval for Abdullah's review

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 8, 2023
const RestartsKey string = "jobset.sigs.k8s.io/restart-attempt"
const (
RestartsKey string = "jobset.sigs.k8s.io/restart-attempt"
parallelDeletions int = 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this too low? any idea what do we set this number to in other places like the Job API?

Copy link
Member Author

@tenzen-y tenzen-y May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this too low?

It makes sense. But I'm not sure appropriate value. How about 100 as a default value? Or, do you have any good ideas?

any idea what do we set this number to in other places like the Job API?

How about setting it via config: #55?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the concept of a "slow start" in the linked code from the job controller. Maybe it would be simpler to start with some constant value like 100 then implement slow start separately?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for deletions we can continue to use a constant (I would set it to 50); but adopt a slow start approach for parallel creation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for deletions we can continue to use a constant (I would set it to 50); but adopt a slow start approach for parallel creation.

Great suggestions! I agree.

Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Yuki, I think we need to parallelize job creation too, I created an issue: #130

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y tenzen-y force-pushed the use-workqueue-when-deleting-jobs branch from f49cbc6 to b425f26 Compare May 10, 2023 16:20
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2023
@tenzen-y
Copy link
Member Author

@ahg-g @danielvegamyhre I have updated.

@tenzen-y
Copy link
Member Author

/test pull-jobset-test-e2e-main-1-24

@danielvegamyhre
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2023
@ahg-g
Copy link
Contributor

ahg-g commented May 10, 2023

/test pull-jobset-test-e2e-main-1-24

any idea why this failed?

@ahg-g
Copy link
Contributor

ahg-g commented May 10, 2023

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 10, 2023
@k8s-ci-robot k8s-ci-robot merged commit 749a77c into kubernetes-sigs:main May 10, 2023
@tenzen-y tenzen-y deleted the use-workqueue-when-deleting-jobs branch May 11, 2023 06:09
@tenzen-y
Copy link
Member Author

/test pull-jobset-test-e2e-main-1-24

any idea why this failed?

Actually, I created #135.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants