Infrequent `failed to init node with kubeadm: exit status 1` on create cluster #928

howardjohn · 2019-10-07T16:53:05Z

(maybe similar to #921 - basically copy pasting that issue and changing a few words. We can consolidate if this is likely the same root cause)

What happened:

We occasionally are seeing issues setting up kind in CI. From a rough grep I think this is impacting roughly 1.5% of our PRs. Note that each PR is running ~20 tests and maybe be rerun many times due to test failures or new commits, etc. So the actual failure rate is probably more like 0.5%?

What you expected to happen:

Ideally, kind create cluster is more robust and doesn't experience errors. But if that is not feasible it may be nice to have some better logging/error messages, maybe retries possibly? I'm not too sure as I don't yet understand the root cause.

From the other issue

Logging is substantially more powerful in HEAD, -v 1 or greater will result in a stack trace being logged on failure, along with the command output if the failure's cause is executing a command.

so I should probably try this at HEAD, but haven't had a chance yet

How to reproduce it (as minimally and precisely as possible):

I cannot reproduce it but can point to a collection of failures:

We do everything with loglevel=debug and dump kind logs in Artifacts so hopefully we have everything there. I didn't really look through the logs much as I don't know what to look for, but happy to look deeper if I am pointed in the right direction.

Anything else we need to know?:

I am planning to dig through the logs soon to see if I can root cause this, but I figured I would open an issue in the mean time

I've seen some similar issues but they seemed to be about using btfs or the wrong node image, neither of which should apply here I think

Environment:

kind version: (use kind version): 0.5.1
Kubernetes version: (use kubectl version): Running Kind on GKE 1.13. I think all of these are spinning up 1.15 clusters
Docker version: (use docker info): 18.06.1
OS (e.g. from /etc/os-release): cOS

The text was updated successfully, but these errors were encountered:

BenTheElder · 2019-10-07T17:51:04Z

I've been running around chasing other issues (also oncall for prow.k8s.io and a bunch of other infra currently...) but I have been looking into this intermittently as a possible issue in Kubernetes CI as well.

I currently suspect that docker exec is flaky, which is ... problematic. We have some possible options but none of them are super great.

A quick (and rather unfortunate) mitigation might be kind create cluster || kind delete cluster && kind create cluster (IE retry creation once).

BenTheElder · 2019-10-10T00:31:06Z

Regarding the cgroups errors with https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17400/e2e-bookInfoTests-trustdomain_istio/1420 I think we've stopped encountering these on prow.k8s.io since some upgrades (to the cluster, to docker in the image, tuning the inotify watches, etc...), I'm not sure which change fixed it though. 😞

aojea · 2019-10-14T22:15:06Z

This can be related to the issue opened by the release people, I was checking the logs in the jobs that fails to create the cluster and all of them fail in

W1013 23:43:03.998] ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --ignore-preflight-errors=all --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: signal: broken pipe

can this broken pipe the sympton of

I currently suspect that docker exec is flaky, which is ... problematic. We have some possible options but none of them are super great.

BenTheElder · 2019-10-14T22:17:25Z

yes, as referenced by

I have been looking into this intermittently as a possible issue in Kubernetes CI as well.

I am debugging this as we speak (OK I took a sec to check messages...)

BenTheElder · 2019-10-14T22:45:34Z

most of these seem to be related to etcd timing out rather than the docker exec issue we're seeing in kubernetes CI.

these failures seem to be in kubeadm / etcd / ... rather than in kind. etcd. can definitely get slow read issues when the host is out of IOPs for example.

BenTheElder · 2019-10-14T22:46:49Z

see also: #845 which is an interesting hack. in kubernetes CI it didn't seem to matter, but we're possibly running bigger SSDs or less loaded disks?

BenTheElder · 2019-10-14T22:51:11Z

@aojea after looking through a lot of logs, I'm pretty sure istio is actually not seeing the broken pipe issue in any of these examples (instead etcd timing out / kubeadm time outs seem to be the main issue). filed #949 to track the pipe issue.

@howardjohn preview of more detailed failure output (expand the logs) when -v is > 0
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/83914/pull-kubernetes-e2e-kind/1183813161560576002

BenTheElder · 2019-10-22T22:21:43Z

This is really unfortunate, but looking through the logs again I'm pretty sure this one is etcd timeouts which we can't do much more about. See discussion in #845 for trying memory backed storage.

We're tracking ensuring we've resolved this step failing in Kubernetes CI in #949

BenTheElder · 2019-10-22T22:23:29Z

Let's revisit this with the enhanced debug tooling in v0.6, so far all signs point to "not a kind bug", we don't have much options regarding etcd performance here other than the trick in #845 which is already possible. We can maybe work upstream to try to reduce etcd costs though.

howardjohn added the kind/bug Categorizes issue or PR as related to a bug. label Oct 7, 2019

BenTheElder self-assigned this Oct 7, 2019

BenTheElder mentioned this issue Oct 8, 2019

Remove virtuakube from alternatives. #935

Merged

BenTheElder mentioned this issue Oct 14, 2019

infrequent docker exec .... signal: broken pipe during kubeadm init | kubeadm join #949

Closed

BenTheElder mentioned this issue Oct 15, 2019

Use memory storage for etcd #845

Open

BenTheElder added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Oct 15, 2019

axsaucedo mentioned this issue Oct 22, 2019

GKE Pod Failed to create cluster with "failed to init node with kubeadm" #997

Closed

BenTheElder closed this as completed Oct 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infrequent `failed to init node with kubeadm: exit status 1` on create cluster #928

Infrequent `failed to init node with kubeadm: exit status 1` on create cluster #928

howardjohn commented Oct 7, 2019

BenTheElder commented Oct 7, 2019

BenTheElder commented Oct 10, 2019

aojea commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

BenTheElder commented Oct 22, 2019

BenTheElder commented Oct 22, 2019

Infrequent failed to init node with kubeadm: exit status 1 on create cluster #928

Infrequent failed to init node with kubeadm: exit status 1 on create cluster #928

Comments

howardjohn commented Oct 7, 2019

BenTheElder commented Oct 7, 2019

BenTheElder commented Oct 10, 2019

aojea commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

BenTheElder commented Oct 14, 2019

BenTheElder commented Oct 22, 2019

BenTheElder commented Oct 22, 2019

Infrequent `failed to init node with kubeadm: exit status 1` on create cluster #928

Infrequent `failed to init node with kubeadm: exit status 1` on create cluster #928