-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suspend Test errors #116
Comments
/kind bug |
we should return error here: https://github.com/kubernetes-sigs/jobset/blob/main/pkg/controllers/jobset_controller.go#L137 to trigger another reconciliation. actually we need to return an error in Reconcile whenever it happens throughput the function, which surprisingly we don't do :( |
@danielvegamyhre we need to backport this fix |
/assign |
we can use server-side apply to avoid those conflicts like we do when applying the admission status in kueue: https://github.com/kubernetes-sigs/kueue/blob/3b2f370586902723d4800d3b8e30eed9d89fbdae/pkg/workload/workload.go#L271 @alculquicondor any reason we don't do SSA in the jobframework for suspend/unsuspend (e.g., https://github.com/kubernetes-sigs/kueue/blob/3b2f370586902723d4800d3b8e30eed9d89fbdae/pkg/controller/jobframework/reconciler.go#L353) |
Sometimes you want to actually fail if there was a change in the Job. For example, you might not want to suspend if the Job finished right at the time when it finished. SSA requires some thinking overall. |
ok, I guess with #118 merged, there is nothing much left to do with this one. |
I think I still see those cache modified errors though in the logs. Should we change that update to a patch? |
Actually there is something not right indeed. The error message you are observing is this
"updating jobset status" is the message we log when failing to update the jobset status to completed: jobset/pkg/controllers/jobset_controller.go Line 122 in 7500282
The test where this is failing is "suspend a running jobset", and this test is not supposed to cause the jobset to succeed. There is bug here! |
Not sure where you saw that. I get these logs:
I think this comes from this log: https://github.com/kubernetes-sigs/jobset/blob/main/pkg/controllers/jobset_controller.go#L218 My hunch is that when we get the list of jobs and classify them into different sections we then update those. I wonder if the cache is out of date then and we should maybe do a list and filter before applying status. I could imagine this being slow though so WDYT? |
That is from your original post #116 (comment)
Yes, but who is updating those jobs? the job controller is not running in integration tests.
Yes, but it is inevitable, we need to reconcile again to ensure we are working on the latest cluster state. |
So reading through Kueue's integration tests I notice that we only sometimes use I am still dealing with some namespace termination errors so I am curious why @alculquicondor and team choose to terminate the entire envtest especially for job/mpi controllers. Ex: https://github.com/kubernetes-sigs/kueue/blob/main/test/integration/controller/mpijob/mpijob_controller_test.go#L70 |
Because we needed to reset some base configuration in the kueue manager. We could keep the testenv, but it would have needed some refactoring. Let me open a tracking issue for that :) |
Honestly I have been running into a lot of problems with enforcing a clean environment before each test case so I was curious if tearing down the testenv may be what we need to do.. That was why I brought it up because I am confused on why our test cases aren't cleaning up correctly. |
With some recent work by @ahg-g I think we can close this |
Sometimes we are getting the following logs in our suspend integration tests:
I looked into this originally thinking it was just test code problems. This error means that you are modifying an object that is stale. The solution is usually to get the object from the client and then update.
I think these errors are coming from when we update the job spec so I wonder if we need to get the list of jobs before submitting a patch request.
The text was updated successfully, but these errors were encountered: