Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix the suspend=true added to the job by the default job webhook has … #758

Merged
merged 1 commit into from
May 11, 2023

Conversation

fjding
Copy link
Contributor

@fjding fjding commented May 10, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fix the suspend=true added to the job by the default job webhook has not taken effect.

Which issue(s) this PR fixes:

Fixes #757

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed the suspend=true add to the job/mpijob by the default webhook has not taken effect.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 10, 2023
@netlify
Copy link

netlify bot commented May 10, 2023

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit a3e923c
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/645c664c8c53e10009f2f602
😎 Deploy Preview https://deploy-preview-758--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 10, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @fjding. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 10, 2023
@kannon92
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 10, 2023
@tenzen-y
Copy link
Member

@fjding Thanks for your contribution. Could you fix the MPIJob as well?

@fjding
Copy link
Contributor Author

fjding commented May 10, 2023

@fjding Thanks for your contribution. Could you fix the MPIJob as well?

Sorry, I missed it, I will re-update the pr

}

func (j *Job) Object() client.Object {
return &j.Job
return j.Job
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return a pointer here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pointer already

Copy link
Member

@tenzen-y tenzen-y May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I found the receiver is a pointer. Thanks!

@@ -121,11 +121,11 @@ func (h *parentWorkloadHandler) queueReconcileForChildJob(object client.Object,
}

type Job struct {
batchv1.Job
*batchv1.Job
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm... Instead of making this a pointer, adding a new function, SetObject(obj client.Object) might be better.
@alculquicondor @mimowo @kerthcet WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pointer sounds better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of func Object() was incorrect. So, a pointer sounds better.

@alculquicondor
Copy link
Contributor

/kind bug
can you add a release note?

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 10, 2023
@alculquicondor
Copy link
Contributor

tests are failing, could you investigate?

…not taken effect

Signed-off-by: fjding <dingfangjie@bytedance.com>
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 11, 2023
@fjding
Copy link
Contributor Author

fjding commented May 11, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 11, 2023
@fjding
Copy link
Contributor Author

fjding commented May 11, 2023

tests are failing, could you investigate?

done

@tenzen-y
Copy link
Member

@fjding This PR is related to a user-facing change. So instead of adding None to the release note, adding appropriate comments would be great.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjding Thank you!
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 11, 2023
@k8s-ci-robot k8s-ci-robot removed the release-note-none Denotes a PR that doesn't merit a release note. label May 11, 2023
@mimowo
Copy link
Contributor

mimowo commented May 11, 2023

/lgtm thanks for fixing.

Given the importance of the scenario I suggest we add an e2e test. Either in this or a follow up PR. WDYT @alculquicondor ?

@tenzen-y
Copy link
Member

/lgtm thanks for fixing.

Given the importance of the scenario I suggest we add an e2e test. Either in this or a follow up PR. WDYT @alculquicondor ?

+1

@alculquicondor
Copy link
Contributor

-1 on e2e, but we should have an integration test for this.

@alculquicondor
Copy link
Contributor

@fjding could you add one integration test for job in this PR?

/lgtm
/approve

You can do it in a follow up, but ideally we cherry-pick that one too.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, fjding

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2023
@alculquicondor
Copy link
Contributor

/cherry-pick release-0.3

@k8s-infra-cherrypick-robot
Copy link
Contributor

@alculquicondor: once the present PR merges, I will cherry-pick it on top of release-0.3 in a new PR and assign it to you.

In response to this:

/cherry-pick release-0.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-infra-cherrypick-robot
Copy link
Contributor

@alculquicondor: new pull request created: #765

In response to this:

/cherry-pick release-0.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alculquicondor alculquicondor mentioned this pull request May 11, 2023
14 tasks
@tenzen-y
Copy link
Member

we should have an integration test for this.

Through more investigation, I found that it is hard to test this scenario.

The job reconciler sets suspend=true even in the webhook doesn't set suspend=true.

// stopJob will suspend the job, and also restore node affinity, reset job status if needed.
func (r *JobReconciler) stopJob(ctx context.Context, job GenericJob, object client.Object, wl *kueue.Workload, eventMsg string) error {
log := ctrl.LoggerFrom(ctx)
// Suspend the job at first then we're able to update the scheduling directives.
job.Suspend()

So, we should run the integration test only for webhook. But I'm not sure how to start the webhook server without a manager using the envtest.

@alculquicondor Any thoughts?

@tenzen-y
Copy link
Member

Maybe we must start a simple http server to start a webhook server without a manager.

testEnv := &envtest.Environment{}

hookServer := &webhook.Server{
  Port: testEnv.WebhookInstallOptions.LocalServingPort,
  Host: testEnv.WebhookInstallOptions.LocalServingHost,
}

httpServer := &http.Server{
  Addr:    hookServer.Address(),
  Handler: hookServer.ServeMux,
}
...

@alculquicondor
Copy link
Contributor

I guess we would face the same problem with e2e tests. Maybe it's not worth doing this?

@tenzen-y
Copy link
Member

I guess we would face the same problem with e2e tests.

That's right.

Maybe it's not worth doing this?

As another approach, we can add a unit test for the

func ApplyDefaultForSuspend(job GenericJob, manageJobsWithoutQueueName bool) {

WDYT? @alculquicondor

@alculquicondor
Copy link
Contributor

That wouldn't help, because we where passing the wrong object to that function.

The unit test needs to be for the Job webhook function (and MPIJob), if there isn't one already.

@tenzen-y
Copy link
Member

tenzen-y commented May 12, 2023

That wouldn't help, because we where passing the wrong object to that function.

That makes sense.

The unit test needs to be for the Job webhook function (and MPIJob), if there isn't one already.

Yes, we don't have any unit tests for the Defaulter.

@mimowo
Copy link
Contributor

mimowo commented May 12, 2023

I guess we would face the same problem with e2e tests.

I suggested e2e because I thought we have webhooks running there. This comment suggests that:

// To verify that webhooks are ready, let's create a simple resourceflavor
.

@tenzen-y
Copy link
Member

Uhm, adding a unit test for the Defaulter might be better.

@alculquicondor
Copy link
Contributor

Yes, but the e2e test would have the same problem than this #758 (comment)

@tenzen-y
Copy link
Member

Yes, but the e2e test would have the same problem than this #758 (comment)

Yes, it would be hard to verify the Defaulter in E2E :(

@tenzen-y
Copy link
Member

In the integration test, as a hacky approach, we might be able to verify the Defaulter in the following:

  1. Manually create a Workload with Admitted=true.
  2. Create a Job with Active=1.

Although, I'm not sure whether the hacky test is worth it.

@tenzen-y
Copy link
Member

For e2e, I don't have any good ideas.

@alculquicondor
Copy link
Contributor

not worth, IMO. unit tests at the webhook level should be enough

@tenzen-y
Copy link
Member

not worth, IMO. unit tests at the webhook level should be enough

Agree. Let's go ahead :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The suspend=true added to the job by the default job webhook has not taken effect.
7 participants