Operator stops working in interval after error #102

francoran · 2021-08-23T14:45:37Z

Hey,

I ran stress test against kube-fledged to see how it behaves on scale,
My running interval is every 10m,
for the first 3 runs it worked great, then some error was thrown -

image_manager.go:309] Error deleting job ml-imagecache-wrxzx: jobs.batch "ml-imagecache-wrxzx" not found

and since then it never ran.
I see the completed jobs hanging for 1hr+ on completed state and no new jobs are starting

Any thoughts?

Thanks,

The text was updated successfully, but these errors were encountered:

senthilrch · 2021-08-24T03:25:12Z

@francoran: thanks for reporting this issue. do you mind sharing the script you are using for running the test? I'll run those in my cluster and take a closer look.

francoran · 2021-08-24T06:43:46Z

thanks @senthilrch
I just scaled from 12 nodes to ~40 and it happened somewhere along the way. We use relatively large images (~8GB)
And we ask to download 4 of these to every machine.

So even though I deleted all hanging jobs yesterday the operator didn't back to work.
the imagecache status still stuck on pending -

Do you know if there's a way to release that processing state without reinstalling the entire chart?

francoran · 2021-08-24T07:32:37Z

Another upadate, I tried to delete the imagecache and re apply it.
issue still persists, it run for the first time and never stopped. I assume something is still locking the controller
I do see in the controller logs that it just throw that error again, for different job this time-
E0824 07:21:39.895451 1 image_manager.go:309] Error deleting job ml-imagecache-hkpb8: jobs.batch "ml-imagecache-hkpb8" not found

I guess when it can't find a job to delete then it get stuck in some limbo state

senthilrch · 2021-08-24T09:56:29Z

@francoran : thanks for these details.

kube-fledged has a separate image_manager routine that is responsible for creating the Job for pulling/deleting the images. Image manager maintains a jobs-tracker to keep track of the Jobs it creates. After all jobs have completed, or after at least one of the jobs hasn't completed even after --image-pull-deadline-duration, whichever occurs earlier, the image manager performs cleaning up of jobs. It looks up at its jobs-tracker and calls the k8s api to delete the jobs one by one.

If it encounters a situation where a job-id present in its job-tracker is not found in k8s, it quits further processing. See below code:-

And the status of ImageCache gets stuck in Processing, without no processing happening. I'm wondering why a job-id in its job-tracker was not found in k8s? Any thoughts?

francoran · 2021-08-24T10:08:03Z

I honestly don't know, maybe there was a scale down or rescheduling? that can cause the job to die and the tracker might be unaware in that scenario.
What do you think?

senthilrch · 2021-08-24T10:20:22Z

Yes could be...I'll have to refactor the code in image manager to make it more resilient to situations like this. That will prevent it from getting stuck. When it encounters a situation where job is not present in k8s, it should continue to process cleaning up of further jobs and send a status update back to the main controller routine. Now it doesn't sends any update back, so controller doesn't updates the status...

I'll try to fix this in next release once I get more time to work on this...Thanks again for spotting it and reporting..

senthilrch · 2021-08-26T12:20:35Z

@francoran : I've updated the code and built a new image senthilrch/kubefledged-controller:v0.8.2-beta.1. could you re-install kube-fledged and check how it works?

senthilrch added the bug Something isn't working label Aug 24, 2021

senthilrch self-assigned this Aug 24, 2021

senthilrch added this to the v0.8.2 milestone Aug 24, 2021

This was referenced Aug 26, 2021

Operator stops working in interval after error #104

Merged

Merge commits for v0.8.2 release #110

Merged

senthilrch linked a pull request Aug 27, 2021 that will close this issue

Merge commits for v0.8.2 release #110

Merged

senthilrch closed this as completed in #110 Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator stops working in interval after error #102

Operator stops working in interval after error #102

francoran commented Aug 23, 2021 •

edited

Loading

senthilrch commented Aug 24, 2021

francoran commented Aug 24, 2021

francoran commented Aug 24, 2021 •

edited

Loading

senthilrch commented Aug 24, 2021

francoran commented Aug 24, 2021

senthilrch commented Aug 24, 2021

senthilrch commented Aug 26, 2021

Operator stops working in interval after error #102

Operator stops working in interval after error #102

Comments

francoran commented Aug 23, 2021 • edited Loading

senthilrch commented Aug 24, 2021

francoran commented Aug 24, 2021

francoran commented Aug 24, 2021 • edited Loading

senthilrch commented Aug 24, 2021

francoran commented Aug 24, 2021

senthilrch commented Aug 24, 2021

senthilrch commented Aug 26, 2021

francoran commented Aug 23, 2021 •

edited

Loading

francoran commented Aug 24, 2021 •

edited

Loading