-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infrequent failed to init node with kubeadm: exit status 1
on create cluster
#928
Comments
I've been running around chasing other issues (also oncall for prow.k8s.io and a bunch of other infra currently...) but I have been looking into this intermittently as a possible issue in Kubernetes CI as well. I currently suspect that A quick (and rather unfortunate) mitigation might be |
Regarding the cgroups errors with https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17400/e2e-bookInfoTests-trustdomain_istio/1420 I think we've stopped encountering these on prow.k8s.io since some upgrades (to the cluster, to docker in the image, tuning the inotify watches, etc...), I'm not sure which change fixed it though. 😞 |
This can be related to the issue opened by the release people, I was checking the logs in the jobs that fails to create the cluster and all of them fail in
can this broken pipe the sympton of
|
yes, as referenced by
I am debugging this as we speak (OK I took a sec to check messages...) |
most of these seem to be related to etcd timing out rather than the docker exec issue we're seeing in kubernetes CI. these failures seem to be in kubeadm / etcd / ... rather than in kind. etcd. can definitely get slow read issues when the host is out of IOPs for example. |
see also: #845 which is an interesting hack. in kubernetes CI it didn't seem to matter, but we're possibly running bigger SSDs or less loaded disks? |
@aojea after looking through a lot of logs, I'm pretty sure istio is actually not seeing the broken pipe issue in any of these examples (instead etcd timing out / kubeadm time outs seem to be the main issue). filed #949 to track the pipe issue. @howardjohn preview of more detailed failure output (expand the logs) when |
Let's revisit this with the enhanced debug tooling in v0.6, so far all signs point to "not a kind bug", we don't have much options regarding etcd performance here other than the trick in #845 which is already possible. We can maybe work upstream to try to reduce etcd costs though. |
(maybe similar to #921 - basically copy pasting that issue and changing a few words. We can consolidate if this is likely the same root cause)
What happened:
We occasionally are seeing issues setting up kind in CI. From a rough grep I think this is impacting roughly 1.5% of our PRs. Note that each PR is running ~20 tests and maybe be rerun many times due to test failures or new commits, etc. So the actual failure rate is probably more like 0.5%?
What you expected to happen:
Ideally,
kind create cluster
is more robust and doesn't experience errors. But if that is not feasible it may be nice to have some better logging/error messages, maybe retries possibly? I'm not too sure as I don't yet understand the root cause.From the other issue
so I should probably try this at HEAD, but haven't had a chance yet
How to reproduce it (as minimally and precisely as possible):
I cannot reproduce it but can point to a collection of failures:
We do everything with loglevel=debug and dump kind logs in Artifacts so hopefully we have everything there. I didn't really look through the logs much as I don't know what to look for, but happy to look deeper if I am pointed in the right direction.
Anything else we need to know?:
I am planning to dig through the logs soon to see if I can root cause this, but I figured I would open an issue in the mean time
I've seen some similar issues but they seemed to be about using btfs or the wrong node image, neither of which should apply here I think
Environment:
kind version: (use kind version): 0.5.1
Kubernetes version: (use kubectl version): Running Kind on GKE 1.13. I think all of these are spinning up 1.15 clusters
Docker version: (use docker info): 18.06.1
OS (e.g. from /etc/os-release): cOS
The text was updated successfully, but these errors were encountered: