Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More unstable cluster #329

Closed
mitar opened this issue Feb 22, 2019 · 14 comments
Closed

More unstable cluster #329

mitar opened this issue Feb 22, 2019 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question.

Comments

@mitar
Copy link
Contributor

mitar commented Feb 22, 2019

So now I am running kind on our CI (GitLab) and I am noticing more than I would hope instability. What we do is create a kind cluster and then create a namespace, in the namespace few pods, run some jobs (one job takes generally around and hour or so) inside the namespace, cleanup the namespace, and repeat with another namespace. We do this few times. I use Python Kubernetes client.

What happens sometimes is that at some point commands do not seem to get through. A typical example is that after creating a namespace, commands starts failing with "default service account dos not exist" error message. I added a check to wait for it to be created but it seems it never is. And this happens only occasionally.

Another example is that I have a watch observing and waiting for some condition (like all pods ready) and that just dies on me and connection gets closed.

I am attaching kind logs for one such failed CI session.

results-43316.zip

I see errors like:

Error syncing pod b734fcc86501dde5579ce80285c0bf0c ("kube-scheduler-kind-control-plane_kube-system(b734fcc86501dde5579ce80285c0b
f0c)"), skipping: failed to "StartContainer" for "kube-scheduler" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=kube-scheduler pod=kube-scheduler-kind-control-plane_kube-system(b734fcc86501dde5579ce80285c0bf
0c)"
@BenTheElder
Copy link
Member

hmm while looking into causes I see your comment here :^) kubernetes/kubernetes#66689 (comment)

@mitar
Copy link
Contributor Author

mitar commented Feb 22, 2019

Yes. :-) I have such waiting. But it does not really resolve it. I thought it is just a race condition, but in fact it seems it just does not create properly the namespace. I am guessing some core service dies or something.

@mitar
Copy link
Contributor Author

mitar commented Feb 22, 2019

So I watch all events and print them out. I am noticing during a run of my CI such events unrelated to what I am doing in our tests:

namespace=default, reason=RegisteredNode, message=Node kind-control-plane event: Registered Node kind-control-plane in Controller, for={kind=Node, name=kind-control-plane}

Not sure why would new node be registered in the middle of CI? Maybe because it died before and now it was recreated?

Also, I have an example situation which is a showcase of this issue, and I would like just to make sure it is not something I am doing wrong. So this is example log output in my CI script I wrote:

[2019-02-22 06:02:34,187] [cmu/simple-ta3] Running tests.
[2019-02-22 06:02:36,425] podpreset.settings.k8s.io/tests-configuration created
[2019-02-22 06:02:36,717] job.batch/simple-ta3-tests created
[2019-02-22 06:02:36,727] [cmu/simple-ta3] Waiting for all pods matching a selector 'controller-uid in (73ff307f-3667-11e9-9ef3-024280e8c710)' to be ready.
[2019-02-22 06:03:36,987] >>> ERROR [cmu/simple-ta3] Exception raised: Waiting timeout: No pods appeared in 60 seconds.

So after systems are up in their pods, I start tests against them. This is done by creating a job. After I create a job, I use list_namespaced_job to obtain job description, from which I store (in Python) job.spec.selector.match_labels['controller-uid']. Then, I wait for pods with selector controller-uid in ({job_selectors}), where job_selectors is what I stored above. That should match any pods created to satisfy that job, no? So, the issue is that sometimes no such pod appears in 60 seconds after job was created. And this is why then my CI script complains. I would assume pods should appear in 60 seconds, of course not yet in ready state, but at least by watching list_namespaced_pod using that selector.

I am assuming some core service has issues running on the kind cluster and this is why no pod appears and why my CI script complains. The question is, which service has issues and why.

@0xmichalis
Copy link

@mitar are you reusing the same cluster for all tests or each test gets its own cluster? If things start hanging then the first thing to look at is the api server. You should be able to jump in the control plane container and start looking around at the different containers. Also it would be useful to track your machines resources if you deploy kind for a long period of time.

@mitar
Copy link
Contributor Author

mitar commented Feb 23, 2019

So there is a series of tests inside one CI run. And cluster is made only once per CI run.

I cannot really jump into "CI run" because it is running on GitLab workers. So maybe there is some resource starvation or anything, but it would be useful if that would be somehow reported or something. Or made so that core systems are never removed. I would understand and easier debug if my test pods gets killed with "out of resources" by the cluster. But not core pods.

@BenTheElder BenTheElder added kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question. labels Feb 23, 2019
@BenTheElder
Copy link
Member

Definitely agree on the core systems never being removed, unfortunately there are some Kubernetes limitations there where core workloads can be rescheduled, k8s is designed to have reserved overhead but we can't fully do that in kind right now.

I spent some time looking at another user's logs with some similar issues but we haven't pinned it down yet.

We need to debug and solve this as much as though, and the tooling for that needs improvement.

@BenTheElder
Copy link
Member

This may be related: #303

@mitar
Copy link
Contributor Author

mitar commented Feb 24, 2019

there are some Kubernetes limitations there where core workloads can be rescheduled

Rescheduled to where? If this is one node cluster, where would it go? :-)

@BenTheElder
Copy link
Member

Rescheduled to where? If this is one node cluster, where would it go? :-)

re-created on the same node.

some of these (like daemonsets) are expected to change I think...

RE: #303, I think these mounts may fix some of the issues with repeated nesting. The other user seemed to have tracked it down to the host systemd killing things in their environment. I can't replicate this yet (not enough details, doesn't happen in my environments so far...).

@mitar
Copy link
Contributor Author

mitar commented Feb 24, 2019

Hm, would there be anything in the log if hosts kills a container?

@BenTheElder
Copy link
Member

You may see a signal handler log in the pod / container logs IIRC, I haven't first-hand seen this occur yet. It would not be normal on a "real" cluster.

Another user with issues: #136 (comment)

I will spend some time later this week looking at how to improve debug-ability and bring this up in our subproject meeting tomorrow. If it's related to mounting the host groups it may be particularly tricky to identify though... 😬

kind has been remarkably stable in the environments I have access too ... which is of course wildly unhelpful for identifying causes of instability 😞

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 26, 2019
@mitar
Copy link
Contributor Author

mitar commented May 26, 2019

I think we can close this for now. It looks relatively stable recently.

@mitar mitar closed this as completed May 26, 2019
@BenTheElder BenTheElder removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 23, 2021
@BenTheElder
Copy link
Member

Aside: I've disabled that bot in this repo, and I do hope we've made progress on ensuring stability in more configurations.

stg-0 pushed a commit to stg-0/kind that referenced this issue Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

5 participants