More unstable cluster #329

mitar · 2019-02-22T04:29:12Z

So now I am running kind on our CI (GitLab) and I am noticing more than I would hope instability. What we do is create a kind cluster and then create a namespace, in the namespace few pods, run some jobs (one job takes generally around and hour or so) inside the namespace, cleanup the namespace, and repeat with another namespace. We do this few times. I use Python Kubernetes client.

What happens sometimes is that at some point commands do not seem to get through. A typical example is that after creating a namespace, commands starts failing with "default service account dos not exist" error message. I added a check to wait for it to be created but it seems it never is. And this happens only occasionally.

Another example is that I have a watch observing and waiting for some condition (like all pods ready) and that just dies on me and connection gets closed.

I am attaching kind logs for one such failed CI session.

results-43316.zip

I see errors like:

Error syncing pod b734fcc86501dde5579ce80285c0bf0c ("kube-scheduler-kind-control-plane_kube-system(b734fcc86501dde5579ce80285c0b
f0c)"), skipping: failed to "StartContainer" for "kube-scheduler" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=kube-scheduler pod=kube-scheduler-kind-control-plane_kube-system(b734fcc86501dde5579ce80285c0bf
0c)"

The text was updated successfully, but these errors were encountered:

BenTheElder · 2019-02-22T05:15:50Z

hmm while looking into causes I see your comment here :^) kubernetes/kubernetes#66689 (comment)

mitar · 2019-02-22T05:36:37Z

Yes. :-) I have such waiting. But it does not really resolve it. I thought it is just a race condition, but in fact it seems it just does not create properly the namespace. I am guessing some core service dies or something.

mitar · 2019-02-22T08:48:59Z

So I watch all events and print them out. I am noticing during a run of my CI such events unrelated to what I am doing in our tests:

namespace=default, reason=RegisteredNode, message=Node kind-control-plane event: Registered Node kind-control-plane in Controller, for={kind=Node, name=kind-control-plane}

Not sure why would new node be registered in the middle of CI? Maybe because it died before and now it was recreated?

Also, I have an example situation which is a showcase of this issue, and I would like just to make sure it is not something I am doing wrong. So this is example log output in my CI script I wrote:

[2019-02-22 06:02:34,187] [cmu/simple-ta3] Running tests.
[2019-02-22 06:02:36,425] podpreset.settings.k8s.io/tests-configuration created
[2019-02-22 06:02:36,717] job.batch/simple-ta3-tests created
[2019-02-22 06:02:36,727] [cmu/simple-ta3] Waiting for all pods matching a selector 'controller-uid in (73ff307f-3667-11e9-9ef3-024280e8c710)' to be ready.
[2019-02-22 06:03:36,987] >>> ERROR [cmu/simple-ta3] Exception raised: Waiting timeout: No pods appeared in 60 seconds.

So after systems are up in their pods, I start tests against them. This is done by creating a job. After I create a job, I use list_namespaced_job to obtain job description, from which I store (in Python) job.spec.selector.match_labels['controller-uid']. Then, I wait for pods with selector controller-uid in ({job_selectors}), where job_selectors is what I stored above. That should match any pods created to satisfy that job, no? So, the issue is that sometimes no such pod appears in 60 seconds after job was created. And this is why then my CI script complains. I would assume pods should appear in 60 seconds, of course not yet in ready state, but at least by watching list_namespaced_pod using that selector.

I am assuming some core service has issues running on the kind cluster and this is why no pod appears and why my CI script complains. The question is, which service has issues and why.

0xmichalis · 2019-02-22T22:10:21Z

@mitar are you reusing the same cluster for all tests or each test gets its own cluster? If things start hanging then the first thing to look at is the api server. You should be able to jump in the control plane container and start looking around at the different containers. Also it would be useful to track your machines resources if you deploy kind for a long period of time.

mitar · 2019-02-23T01:04:08Z

So there is a series of tests inside one CI run. And cluster is made only once per CI run.

I cannot really jump into "CI run" because it is running on GitLab workers. So maybe there is some resource starvation or anything, but it would be useful if that would be somehow reported or something. Or made so that core systems are never removed. I would understand and easier debug if my test pods gets killed with "out of resources" by the cluster. But not core pods.

BenTheElder · 2019-02-23T22:10:58Z

Definitely agree on the core systems never being removed, unfortunately there are some Kubernetes limitations there where core workloads can be rescheduled, k8s is designed to have reserved overhead but we can't fully do that in kind right now.

I spent some time looking at another user's logs with some similar issues but we haven't pinned it down yet.

We need to debug and solve this as much as though, and the tooling for that needs improvement.

BenTheElder · 2019-02-23T22:15:35Z

This may be related: #303

mitar · 2019-02-24T22:30:11Z

there are some Kubernetes limitations there where core workloads can be rescheduled

Rescheduled to where? If this is one node cluster, where would it go? :-)

BenTheElder · 2019-02-24T22:39:36Z

Rescheduled to where? If this is one node cluster, where would it go? :-)

re-created on the same node.

some of these (like daemonsets) are expected to change I think...

RE: #303, I think these mounts may fix some of the issues with repeated nesting. The other user seemed to have tracked it down to the host systemd killing things in their environment. I can't replicate this yet (not enough details, doesn't happen in my environments so far...).

mitar · 2019-02-24T23:04:34Z

Hm, would there be anything in the log if hosts kills a container?

BenTheElder · 2019-02-25T08:21:54Z

You may see a signal handler log in the pod / container logs IIRC, I haven't first-hand seen this occur yet. It would not be normal on a "real" cluster.

Another user with issues: #136 (comment)

I will spend some time later this week looking at how to improve debug-ability and bring this up in our subproject meeting tomorrow. If it's related to mounting the host groups it may be particularly tricky to identify though... 😬

kind has been remarkably stable in the environments I have access too ... which is of course wildly unhelpful for identifying causes of instability 😞

fejta-bot · 2019-05-26T09:10:58Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

mitar · 2019-05-26T15:19:12Z

I think we can close this for now. It looks relatively stable recently.

BenTheElder · 2021-06-23T21:15:20Z

Aside: I've disabled that bot in this repo, and I do hope we've made progress on ensuring stability in more configurations.

BenTheElder added kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question. labels Feb 23, 2019

mitar mentioned this issue Mar 8, 2019

Enabling kernel monitor for kind #362

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 26, 2019

mitar closed this as completed May 26, 2019

BenTheElder removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 23, 2021

stg-0 pushed a commit to stg-0/kind that referenced this issue Oct 27, 2023

Fix eks:UpdateAddon resource (kubernetes-sigs#329)

692f461

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More unstable cluster #329

More unstable cluster #329

mitar commented Feb 22, 2019

BenTheElder commented Feb 22, 2019

mitar commented Feb 22, 2019

mitar commented Feb 22, 2019

0xmichalis commented Feb 22, 2019

mitar commented Feb 23, 2019

BenTheElder commented Feb 23, 2019

BenTheElder commented Feb 23, 2019

mitar commented Feb 24, 2019

BenTheElder commented Feb 24, 2019

mitar commented Feb 24, 2019

BenTheElder commented Feb 25, 2019

fejta-bot commented May 26, 2019

mitar commented May 26, 2019

BenTheElder commented Jun 23, 2021

More unstable cluster #329

More unstable cluster #329

Comments

mitar commented Feb 22, 2019

BenTheElder commented Feb 22, 2019

mitar commented Feb 22, 2019

mitar commented Feb 22, 2019

0xmichalis commented Feb 22, 2019

mitar commented Feb 23, 2019

BenTheElder commented Feb 23, 2019

BenTheElder commented Feb 23, 2019

mitar commented Feb 24, 2019

BenTheElder commented Feb 24, 2019

mitar commented Feb 24, 2019

BenTheElder commented Feb 25, 2019

fejta-bot commented May 26, 2019

mitar commented May 26, 2019

BenTheElder commented Jun 23, 2021