Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspend Test errors #116

Closed
kannon92 opened this issue May 5, 2023 · 15 comments
Closed

Suspend Test errors #116

kannon92 opened this issue May 5, 2023 · 15 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@kannon92
Copy link
Contributor

kannon92 commented May 5, 2023

Sometimes we are getting the following logs in our suspend integration tests:

JobSet controller jobset is created and its jobs go through a series of updates suspend a running jobset
/home/ec2-user/Work/GIT/jobset-kevin/test/integration/controller/jobset_controller_test.go:409
  STEP: creating jobset @ 05/05/23 19:59:03.915
  STEP: checking that jobset creation succeeds @ 05/05/23 19:59:03.915
  STEP: checking all jobs were created successfully @ 05/05/23 19:59:03.924
  STEP: checking all jobs are not suspended @ 05/05/23 19:59:04.18
  STEP: checking all jobs are suspended @ 05/05/23 19:59:04.191
  2023-05-05T19:59:04Z  DEBUG   events  SuspendedJobs   {"type": "Normal", "object": {"kind":"JobSet","namespace":"test-ns-76npt","name":"test-js","uid":"ab8ce85b-71af-49f1-83b6-d0373f0d1fca","apiVersion":"jobset.x-k8s.io/v1alpha1","resourceVersion":"373"}, "reason": "Suspended"}
  2023-05-05T19:59:04Z  ERROR   updating jobset status  {"controller": "jobset", "controllerGroup": "jobset.x-k8s.io", "controllerKind": "JobSet", "JobSet": {"name":"test-js","namespace":"test-ns-76npt"}, "namespace": "test-ns-76npt", "name": "test-js", "reconcileID": "848294ee-190f-4a24-a524-47ece0c1a036", "jobset": {"name":"test-js","namespace":"test-ns-76npt"}, "error": "Operation cannot be fulfilled on jobsets.jobset.x-k8s.io \"test-js\": the object has been modified; please apply your changes to the latest version and try again"}
  sigs.k8s.io/jobset/pkg/controllers.(*JobSetReconciler).Reconcile
        /home/ec2-user/Work/GIT/jobset-kevin/pkg/controllers/jobset_controller.go:152
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /home/ec2-user/go/1.20.2/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /home/ec2-user/go/1.20.2/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /home/ec2-user/go/1.20.2/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /home/ec2-user/go/1.20.2/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235
  STEP: checking jobset status is: Suspended @ 05/05/23 19:59:04.451

I looked into this originally thinking it was just test code problems. This error means that you are modifying an object that is stale. The solution is usually to get the object from the client and then update.

I think these errors are coming from when we update the job spec so I wonder if we need to get the list of jobs before submitting a patch request.

@kannon92
Copy link
Contributor Author

kannon92 commented May 5, 2023

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 5, 2023
@ahg-g
Copy link
Contributor

ahg-g commented May 5, 2023

we should return error here: https://github.com/kubernetes-sigs/jobset/blob/main/pkg/controllers/jobset_controller.go#L137 to trigger another reconciliation.

actually we need to return an error in Reconcile whenever it happens throughput the function, which surprisingly we don't do :(

@ahg-g
Copy link
Contributor

ahg-g commented May 5, 2023

@danielvegamyhre we need to backport this fix

@ahg-g
Copy link
Contributor

ahg-g commented May 5, 2023

/assign

@ahg-g
Copy link
Contributor

ahg-g commented May 5, 2023

we can use server-side apply to avoid those conflicts like we do when applying the admission status in kueue: https://github.com/kubernetes-sigs/kueue/blob/3b2f370586902723d4800d3b8e30eed9d89fbdae/pkg/workload/workload.go#L271

@alculquicondor any reason we don't do SSA in the jobframework for suspend/unsuspend (e.g., https://github.com/kubernetes-sigs/kueue/blob/3b2f370586902723d4800d3b8e30eed9d89fbdae/pkg/controller/jobframework/reconciler.go#L353)

@alculquicondor
Copy link

Sometimes you want to actually fail if there was a change in the Job. For example, you might not want to suspend if the Job finished right at the time when it finished.

SSA requires some thinking overall.

@ahg-g
Copy link
Contributor

ahg-g commented May 9, 2023

ok, I guess with #118 merged, there is nothing much left to do with this one.

@ahg-g ahg-g closed this as completed May 9, 2023
@kannon92
Copy link
Contributor Author

kannon92 commented May 9, 2023

I think I still see those cache modified errors though in the logs. Should we change that update to a patch?

@ahg-g
Copy link
Contributor

ahg-g commented May 9, 2023

Actually there is something not right indeed.

The error message you are observing is this

  2023-05-05T19:59:04Z  ERROR   updating jobset status  {"controller": "jobset", "controllerGroup": "jobset.x-k8s.io", "controllerKind": "JobSet", "JobSet": {"name":"test-js","namespace":"test-ns-76npt"}, "namespace": "test-ns-76npt", "name": "test-js", "reconcileID": "848294ee-190f-4a24-a524-47ece0c1a036", "jobset": {"name":"test-js","namespace":"test-ns-76npt"}, "error": "Operation cannot be fulfilled on jobsets.jobset.x-k8s.io \"test-js\": the object has been modified; please apply your changes to the latest version and try again"}

"updating jobset status" is the message we log when failing to update the jobset status to completed:

log.Error(err, "updating jobset status")

The test where this is failing is "suspend a running jobset", and this test is not supposed to cause the jobset to succeed. There is bug here!

@ahg-g ahg-g reopened this May 9, 2023
@kannon92
Copy link
Contributor Author

kannon92 commented May 9, 2023

Actually there is something not right indeed.

The error message you are observing is this

  2023-05-05T19:59:04Z  ERROR   updating jobset status  {"controller": "jobset", "controllerGroup": "jobset.x-k8s.io", "controllerKind": "JobSet", "JobSet": {"name":"test-js","namespace":"test-ns-76npt"}, "namespace": "test-ns-76npt", "name": "test-js", "reconcileID": "848294ee-190f-4a24-a524-47ece0c1a036", "jobset": {"name":"test-js","namespace":"test-ns-76npt"}, "error": "Operation cannot be fulfilled on jobsets.jobset.x-k8s.io \"test-js\": the object has been modified; please apply your changes to the latest version and try again"}

"updating jobset status" is the message we log when failing to update the jobset status to completed:

log.Error(err, "updating jobset status")

The test where this is failing is "suspend a running jobset", and this test is not supposed to cause the jobset to succeed. There is bug here!

Not sure where you saw that.

I get these logs:

  2023-05-09T14:47:31Z  ERROR   suspending jobset       {"controller": "jobset", "controllerGroup": "jobset.x-k8s.io", "controllerKind": "JobSet", "JobSet": {"name":"test-js","namespace":"jobset-ns-5gb9r"}, "namespace": "jobset-ns-5gb9r", "name": "test-js", "reconcileID": "0620bd2d-9d6b-459d-a5b2-e523b75c0e9b", "jobset": {"name":"test-js","namespace":"jobset-ns-5gb9r"}, "error": "Operation cannot be fulfilled on jobsets.jobset.x-k8s.io \"test-js\": the object has been modified; please apply your changes to the latest version and try again"}
  sigs.k8s.io/jobset/pkg/controllers.(*JobSetReconciler).Reconcile

I think this comes from this log: https://github.com/kubernetes-sigs/jobset/blob/main/pkg/controllers/jobset_controller.go#L218

My hunch is that when we get the list of jobs and classify them into different sections we then update those. I wonder if the cache is out of date then and we should maybe do a list and filter before applying status.

I could imagine this being slow though so WDYT?

@ahg-g
Copy link
Contributor

ahg-g commented May 11, 2023

Not sure where you saw that.

That is from your original post #116 (comment)

I wonder if the cache is out of date then and we should maybe do a list and filter before applying status.

Yes, but who is updating those jobs? the job controller is not running in integration tests.

See this run: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_jobset/123/pull-jobset-test-integration-main/1656333510689951744

I could imagine this being slow though so WDYT?

Yes, but it is inevitable, we need to reconcile again to ensure we are working on the latest cluster state.

@kannon92
Copy link
Contributor Author

So reading through Kueue's integration tests I notice that we only sometimes use DeleteNamespaces and other times we actually teardown the entire test framework.

I am still dealing with some namespace termination errors so I am curious why @alculquicondor and team choose to terminate the entire envtest especially for job/mpi controllers.

Ex: https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/mpijob/mpijob_controller_test.go#L70
And https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/job/job_controller_test.go#L71

@alculquicondor
Copy link

choose to terminate the entire envtest especially for job/mpi controllers.

Because we needed to reset some base configuration in the kueue manager. We could keep the testenv, but it would have needed some refactoring. Let me open a tracking issue for that :)

@kannon92
Copy link
Contributor Author

choose to terminate the entire envtest especially for job/mpi controllers.

Because we needed to reset some base configuration in the kueue manager. We could keep the testenv, but it would have needed some refactoring. Let me open a tracking issue for that :)

Honestly I have been running into a lot of problems with enforcing a clean environment before each test case so I was curious if tearing down the testenv may be what we need to do.. That was why I brought it up because I am confused on why our test cases aren't cleaning up correctly.

@kannon92
Copy link
Contributor Author

With some recent work by @ahg-g I think we can close this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants