Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infrequent failed to init node with kubeadm: exit status 1 on create cluster #928

Closed
howardjohn opened this issue Oct 7, 2019 · 9 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.

Comments

@howardjohn
Copy link
Contributor

(maybe similar to #921 - basically copy pasting that issue and changing a few words. We can consolidate if this is likely the same root cause)

What happened:

We occasionally are seeing issues setting up kind in CI. From a rough grep I think this is impacting roughly 1.5% of our PRs. Note that each PR is running ~20 tests and maybe be rerun many times due to test failures or new commits, etc. So the actual failure rate is probably more like 0.5%?

What you expected to happen:

Ideally, kind create cluster is more robust and doesn't experience errors. But if that is not feasible it may be nice to have some better logging/error messages, maybe retries possibly? I'm not too sure as I don't yet understand the root cause.

From the other issue

Logging is substantially more powerful in HEAD, -v 1 or greater will result in a stack trace being logged on failure, along with the command output if the failure's cause is executing a command.

so I should probably try this at HEAD, but haven't had a chance yet

How to reproduce it (as minimally and precisely as possible):

I cannot reproduce it but can point to a collection of failures:

We do everything with loglevel=debug and dump kind logs in Artifacts so hopefully we have everything there. I didn't really look through the logs much as I don't know what to look for, but happy to look deeper if I am pointed in the right direction.

Anything else we need to know?:

I am planning to dig through the logs soon to see if I can root cause this, but I figured I would open an issue in the mean time

I've seen some similar issues but they seemed to be about using btfs or the wrong node image, neither of which should apply here I think

Environment:

kind version: (use kind version): 0.5.1
Kubernetes version: (use kubectl version): Running Kind on GKE 1.13. I think all of these are spinning up 1.15 clusters
Docker version: (use docker info): 18.06.1
OS (e.g. from /etc/os-release): cOS

@howardjohn howardjohn added the kind/bug Categorizes issue or PR as related to a bug. label Oct 7, 2019
@BenTheElder
Copy link
Member

I've been running around chasing other issues (also oncall for prow.k8s.io and a bunch of other infra currently...) but I have been looking into this intermittently as a possible issue in Kubernetes CI as well.

I currently suspect that docker exec is flaky, which is ... problematic. We have some possible options but none of them are super great.

A quick (and rather unfortunate) mitigation might be kind create cluster || kind delete cluster && kind create cluster (IE retry creation once).

@BenTheElder
Copy link
Member

Regarding the cgroups errors with https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17400/e2e-bookInfoTests-trustdomain_istio/1420 I think we've stopped encountering these on prow.k8s.io since some upgrades (to the cluster, to docker in the image, tuning the inotify watches, etc...), I'm not sure which change fixed it though. 😞

@aojea
Copy link
Contributor

aojea commented Oct 14, 2019

This can be related to the issue opened by the release people, I was checking the logs in the jobs that fails to create the cluster and all of them fail in

W1013 23:43:03.998] ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --ignore-preflight-errors=all --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: signal: broken pipe

can this broken pipe the sympton of

I currently suspect that docker exec is flaky, which is ... problematic. We have some possible options but none of them are super great.

@BenTheElder
Copy link
Member

yes, as referenced by

I have been looking into this intermittently as a possible issue in Kubernetes CI as well.

I am debugging this as we speak (OK I took a sec to check messages...)

@BenTheElder
Copy link
Member

most of these seem to be related to etcd timing out rather than the docker exec issue we're seeing in kubernetes CI.

these failures seem to be in kubeadm / etcd / ... rather than in kind. etcd. can definitely get slow read issues when the host is out of IOPs for example.

@BenTheElder
Copy link
Member

see also: #845 which is an interesting hack. in kubernetes CI it didn't seem to matter, but we're possibly running bigger SSDs or less loaded disks?

@BenTheElder
Copy link
Member

@aojea after looking through a lot of logs, I'm pretty sure istio is actually not seeing the broken pipe issue in any of these examples (instead etcd timing out / kubeadm time outs seem to be the main issue). filed #949 to track the pipe issue.

@howardjohn preview of more detailed failure output (expand the logs) when -v is > 0
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/83914/pull-kubernetes-e2e-kind/1183813161560576002

@BenTheElder BenTheElder added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Oct 15, 2019
@BenTheElder
Copy link
Member

This is really unfortunate, but looking through the logs again I'm pretty sure this one is etcd timeouts which we can't do much more about. See discussion in #845 for trying memory backed storage.

We're tracking ensuring we've resolved this step failing in Kubernetes CI in #949

@BenTheElder
Copy link
Member

Let's revisit this with the enhanced debug tooling in v0.6, so far all signs point to "not a kind bug", we don't have much options regarding etcd performance here other than the trick in #845 which is already possible. We can maybe work upstream to try to reduce etcd costs though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.
Projects
None yet
Development

No branches or pull requests

3 participants