Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator stops working in interval after error #102

Closed
francoran opened this issue Aug 23, 2021 · 7 comments · Fixed by #110
Closed

Operator stops working in interval after error #102

francoran opened this issue Aug 23, 2021 · 7 comments · Fixed by #110
Assignees
Labels
bug Something isn't working
Milestone

Comments

@francoran
Copy link

francoran commented Aug 23, 2021

Hey,

I ran stress test against kube-fledged to see how it behaves on scale,
My running interval is every 10m,
for the first 3 runs it worked great, then some error was thrown -

image_manager.go:309] Error deleting job ml-imagecache-wrxzx: jobs.batch "ml-imagecache-wrxzx" not found

and since then it never ran.
I see the completed jobs hanging for 1hr+ on completed state and no new jobs are starting
image

Any thoughts?

Thanks,

@senthilrch
Copy link
Owner

@francoran: thanks for reporting this issue. do you mind sharing the script you are using for running the test? I'll run those in my cluster and take a closer look.

@senthilrch senthilrch added the bug Something isn't working label Aug 24, 2021
@francoran
Copy link
Author

thanks @senthilrch
I just scaled from 12 nodes to ~40 and it happened somewhere along the way. We use relatively large images (~8GB)
And we ask to download 4 of these to every machine.

So even though I deleted all hanging jobs yesterday the operator didn't back to work.
the imagecache status still stuck on pending -
image

Do you know if there's a way to release that processing state without reinstalling the entire chart?

@francoran
Copy link
Author

francoran commented Aug 24, 2021

Another upadate, I tried to delete the imagecache and re apply it.
issue still persists, it run for the first time and never stopped. I assume something is still locking the controller
I do see in the controller logs that it just throw that error again, for different job this time-
E0824 07:21:39.895451 1 image_manager.go:309] Error deleting job ml-imagecache-hkpb8: jobs.batch "ml-imagecache-hkpb8" not found

I guess when it can't find a job to delete then it get stuck in some limbo state

@senthilrch
Copy link
Owner

@francoran : thanks for these details.

kube-fledged has a separate image_manager routine that is responsible for creating the Job for pulling/deleting the images. Image manager maintains a jobs-tracker to keep track of the Jobs it creates. After all jobs have completed, or after at least one of the jobs hasn't completed even after --image-pull-deadline-duration, whichever occurs earlier, the image manager performs cleaning up of jobs. It looks up at its jobs-tracker and calls the k8s api to delete the jobs one by one.

If it encounters a situation where a job-id present in its job-tracker is not found in k8s, it quits further processing. See below code:-
image

And the status of ImageCache gets stuck in Processing, without no processing happening. I'm wondering why a job-id in its job-tracker was not found in k8s? Any thoughts?

@francoran
Copy link
Author

I honestly don't know, maybe there was a scale down or rescheduling? that can cause the job to die and the tracker might be unaware in that scenario.
What do you think?

@senthilrch
Copy link
Owner

Yes could be...I'll have to refactor the code in image manager to make it more resilient to situations like this. That will prevent it from getting stuck. When it encounters a situation where job is not present in k8s, it should continue to process cleaning up of further jobs and send a status update back to the main controller routine. Now it doesn't sends any update back, so controller doesn't updates the status...

I'll try to fix this in next release once I get more time to work on this...Thanks again for spotting it and reporting..

@senthilrch senthilrch self-assigned this Aug 24, 2021
@senthilrch senthilrch added this to the v0.8.2 milestone Aug 24, 2021
@senthilrch
Copy link
Owner

@francoran : I've updated the code and built a new image senthilrch/kubefledged-controller:v0.8.2-beta.1. could you re-install kube-fledged and check how it works?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants