-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator stops working in interval after error #102
Comments
@francoran: thanks for reporting this issue. do you mind sharing the script you are using for running the test? I'll run those in my cluster and take a closer look. |
thanks @senthilrch So even though I deleted all hanging jobs yesterday the operator didn't back to work. Do you know if there's a way to release that processing state without reinstalling the entire chart? |
Another upadate, I tried to delete the imagecache and re apply it. I guess when it can't find a job to delete then it get stuck in some limbo state |
@francoran : thanks for these details. kube-fledged has a separate image_manager routine that is responsible for creating the Job for pulling/deleting the images. Image manager maintains a jobs-tracker to keep track of the Jobs it creates. After all jobs have completed, or after at least one of the jobs hasn't completed even after --image-pull-deadline-duration, whichever occurs earlier, the image manager performs cleaning up of jobs. It looks up at its jobs-tracker and calls the k8s api to delete the jobs one by one. If it encounters a situation where a job-id present in its job-tracker is not found in k8s, it quits further processing. See below code:- And the status of ImageCache gets stuck in Processing, without no processing happening. I'm wondering why a job-id in its job-tracker was not found in k8s? Any thoughts? |
I honestly don't know, maybe there was a scale down or rescheduling? that can cause the job to die and the tracker might be unaware in that scenario. |
Yes could be...I'll have to refactor the code in image manager to make it more resilient to situations like this. That will prevent it from getting stuck. When it encounters a situation where job is not present in k8s, it should continue to process cleaning up of further jobs and send a status update back to the main controller routine. Now it doesn't sends any update back, so controller doesn't updates the status... I'll try to fix this in next release once I get more time to work on this...Thanks again for spotting it and reporting.. |
@francoran : I've updated the code and built a new image |
Hey,
I ran stress test against kube-fledged to see how it behaves on scale,
My running interval is every 10m,
for the first 3 runs it worked great, then some error was thrown -
image_manager.go:309] Error deleting job ml-imagecache-wrxzx: jobs.batch "ml-imagecache-wrxzx" not found
and since then it never ran.
I see the completed jobs hanging for 1hr+ on completed state and no new jobs are starting
Any thoughts?
Thanks,
The text was updated successfully, but these errors were encountered: