Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cAdvisor] cluster creation with v0.8.x and Kubernetes built from source fails on some hosts #1569

Closed
matte21 opened this issue May 6, 2020 · 36 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/external upstream bugs lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@matte21
Copy link

matte21 commented May 6, 2020

What happened:
I have cloned the Kubernetes repo on my dev machine (MAC) at $(go env GOPATH)/src/k8s.io/kubernetes.
I successfully ran kind build node-image, which picked the latest Kubernetes master branch commit (0a6c826d3e92dae8f20d6199d0ac7deeca9eed71).
Then I ran kind create cluster --image kindest/node:latest, and got:

Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:latest) 🖼
✓ Preparing nodes 📦
✓ Writing configuration 📜
✗ Starting control-plane 🕹️
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --ignore-preflight-errors=all --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
Command Output: I0506 16:54:03.054571 166 initconfiguration.go:200] loading configuration from "/kind/kubeadm.conf"
[config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta2, Kind=JoinConfiguration
I0506 16:54:03.061537 166 interface.go:400] Looking for default routes with IPv4 addresses
I0506 16:54:03.061671 166 interface.go:405] Default route transits interface "eth0"
I0506 16:54:03.061894 166 interface.go:208] Interface eth0 is up
I0506 16:54:03.062666 166 interface.go:256] Interface "eth0" has 3 addresses :[172.19.0.2/16 fc00:f853:ccd:e793::2/64 fe80::42:acff:fe13:2/64].
I0506 16:54:03.063309 166 interface.go:223] Checking addr 172.19.0.2/16.
I0506 16:54:03.063412 166 interface.go:230] IP found 172.19.0.2
I0506 16:54:03.063484 166 interface.go:262] Found valid IPv4 address 172.19.0.2 for interface "eth0".
I0506 16:54:03.063579 166 interface.go:411] Found active IP 172.19.0.2
W0506 16:54:03.071914 166 configset.go:202] WARNING: kubeadm cannot validate component configs for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
[init] Using Kubernetes version: v1.19.0-alpha.3.33+0a6c826d3e92da
[preflight] Running pre-flight checks
I0506 16:54:03.072688 166 checks.go:577] validating Kubernetes and kubeadm version
I0506 16:54:03.072964 166 checks.go:166] validating if the firewall is enabled and active
I0506 16:54:03.082840 166 checks.go:201] validating availability of port 6443
I0506 16:54:03.083497 166 checks.go:201] validating availability of port 10259
I0506 16:54:03.083621 166 checks.go:201] validating availability of port 10257
I0506 16:54:03.083786 166 checks.go:286] validating the existence of file /etc/kubernetes/manifests/kube-apiserver.yaml
I0506 16:54:03.084065 166 checks.go:286] validating the existence of file /etc/kubernetes/manifests/kube-controller-manager.yaml
I0506 16:54:03.084377 166 checks.go:286] validating the existence of file /etc/kubernetes/manifests/kube-scheduler.yaml
I0506 16:54:03.084626 166 checks.go:286] validating the existence of file /etc/kubernetes/manifests/etcd.yaml
I0506 16:54:03.084766 166 checks.go:432] validating if the connectivity type is via proxy or direct
I0506 16:54:03.085139 166 checks.go:471] validating http connectivity to first IP address in the CIDR
I0506 16:54:03.085433 166 checks.go:471] validating http connectivity to first IP address in the CIDR
I0506 16:54:03.085569 166 checks.go:102] validating the container runtime
I0506 16:54:03.087021 166 checks.go:376] validating the presence of executable crictl
I0506 16:54:03.087156 166 checks.go:335] validating the contents of file /proc/sys/net/bridge/bridge-nf-call-iptables
[WARNING FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
I0506 16:54:03.087865 166 checks.go:335] validating the contents of file /proc/sys/net/ipv4/ip_forward
I0506 16:54:03.088065 166 checks.go:649] validating whether swap is enabled or not
[WARNING Swap]: running with swap on is not supported. Please disable swap
I0506 16:54:03.088756 166 checks.go:376] validating the presence of executable conntrack
I0506 16:54:03.089310 166 checks.go:376] validating the presence of executable ip
I0506 16:54:03.089447 166 checks.go:376] validating the presence of executable iptables
I0506 16:54:03.089925 166 checks.go:376] validating the presence of executable mount
I0506 16:54:03.090039 166 checks.go:376] validating the presence of executable nsenter
I0506 16:54:03.090240 166 checks.go:376] validating the presence of executable ebtables
I0506 16:54:03.090429 166 checks.go:376] validating the presence of executable ethtool
I0506 16:54:03.090726 166 checks.go:376] validating the presence of executable socat
I0506 16:54:03.090832 166 checks.go:376] validating the presence of executable tc
I0506 16:54:03.091171 166 checks.go:376] validating the presence of executable touch
I0506 16:54:03.091303 166 checks.go:520] running all checks
I0506 16:54:03.099470 166 checks.go:406] checking whether the given node name is reachable using net.LookupHost
I0506 16:54:03.103053 166 checks.go:618] validating kubelet version
I0506 16:54:03.180399 166 checks.go:128] validating if the "kubelet" service is enabled and active
I0506 16:54:03.191708 166 checks.go:201] validating availability of port 10250
I0506 16:54:03.191805 166 checks.go:201] validating availability of port 2379
I0506 16:54:03.191844 166 checks.go:201] validating availability of port 2380
I0506 16:54:03.191909 166 checks.go:249] validating the existence and emptiness of directory /var/lib/etcd
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
I0506 16:54:03.205613 166 checks.go:839] image exists: k8s.gcr.io/kube-apiserver:v1.19.0-alpha.3.33_0a6c826d3e92da
I0506 16:54:03.216818 166 checks.go:839] image exists: k8s.gcr.io/kube-controller-manager:v1.19.0-alpha.3.33_0a6c826d3e92da
I0506 16:54:03.226920 166 checks.go:839] image exists: k8s.gcr.io/kube-scheduler:v1.19.0-alpha.3.33_0a6c826d3e92da
I0506 16:54:03.236217 166 checks.go:839] image exists: k8s.gcr.io/kube-proxy:v1.19.0-alpha.3.33_0a6c826d3e92da
I0506 16:54:03.246675 166 checks.go:839] image exists: k8s.gcr.io/pause:3.2
I0506 16:54:03.256707 166 checks.go:839] image exists: k8s.gcr.io/etcd:3.4.7-0
I0506 16:54:03.266186 166 checks.go:839] image exists: k8s.gcr.io/coredns:1.6.7
I0506 16:54:03.266250 166 kubelet.go:64] Stopping the kubelet
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[certs] Using certificateDir folder "/etc/kubernetes/pki"
I0506 16:54:03.350199 166 certs.go:103] creating a new certificate authority for ca
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kind-control-plane kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local kind-control-plane localhost] and IPs [10.96.0.1 172.19.0.2 127.0.0.1]
[certs] Generating "apiserver-kubelet-client" certificate and key
I0506 16:54:04.533981 166 certs.go:103] creating a new certificate authority for front-proxy-ca
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
I0506 16:54:04.814400 166 certs.go:103] creating a new certificate authority for etcd-ca
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [kind-control-plane localhost] and IPs [172.19.0.2 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [kind-control-plane localhost] and IPs [172.19.0.2 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
I0506 16:54:05.662998 166 certs.go:69] creating new public/private key files for signing service account users
[certs] Generating "sa" key and public key
I0506 16:54:06.241602 166 kubeconfig.go:79] creating kubeconfig file for admin.conf
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
I0506 16:54:06.834313 166 kubeconfig.go:79] creating kubeconfig file for kubelet.conf
[kubeconfig] Writing "kubelet.conf" kubeconfig file
I0506 16:54:06.984831 166 kubeconfig.go:79] creating kubeconfig file for controller-manager.conf
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
I0506 16:54:07.340111 166 kubeconfig.go:79] creating kubeconfig file for scheduler.conf
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
I0506 16:54:07.427816 166 manifests.go:91] [control-plane] getting StaticPodSpecs
I0506 16:54:07.428480 166 manifests.go:104] [control-plane] adding volume "ca-certs" for component "kube-apiserver"
I0506 16:54:07.428525 166 manifests.go:104] [control-plane] adding volume "etc-ca-certificates" for component "kube-apiserver"
I0506 16:54:07.428546 166 manifests.go:104] [control-plane] adding volume "k8s-certs" for component "kube-apiserver"
I0506 16:54:07.428562 166 manifests.go:104] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-apiserver"
I0506 16:54:07.428583 166 manifests.go:104] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-apiserver"
I0506 16:54:07.435072 166 manifests.go:121] [control-plane] wrote static Pod manifest for component "kube-apiserver" to "/etc/kubernetes/manifests/kube-apiserver.yaml"
I0506 16:54:07.435127 166 manifests.go:91] [control-plane] getting StaticPodSpecs
[control-plane] Creating static Pod manifest for "kube-controller-manager"
I0506 16:54:07.435495 166 manifests.go:104] [control-plane] adding volume "ca-certs" for component "kube-controller-manager"
I0506 16:54:07.435547 166 manifests.go:104] [control-plane] adding volume "etc-ca-certificates" for component "kube-controller-manager"
I0506 16:54:07.435567 166 manifests.go:104] [control-plane] adding volume "flexvolume-dir" for component "kube-controller-manager"
I0506 16:54:07.435589 166 manifests.go:104] [control-plane] adding volume "k8s-certs" for component "kube-controller-manager"
I0506 16:54:07.435748 166 manifests.go:104] [control-plane] adding volume "kubeconfig" for component "kube-controller-manager"
I0506 16:54:07.435764 166 manifests.go:104] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-controller-manager"
I0506 16:54:07.435788 166 manifests.go:104] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
I0506 16:54:07.436691 166 manifests.go:121] [control-plane] wrote static Pod manifest for component "kube-controller-manager" to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
I0506 16:54:07.436792 166 manifests.go:91] [control-plane] getting StaticPodSpecs
I0506 16:54:07.437037 166 manifests.go:104] [control-plane] adding volume "kubeconfig" for component "kube-scheduler"
I0506 16:54:07.437718 166 manifests.go:121] [control-plane] wrote static Pod manifest for component "kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I0506 16:54:07.439832 166 local.go:72] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml"
I0506 16:54:07.439886 166 waitcontrolplane.go:87] [wait-control-plane] Waiting for the API server to be healthy
I0506 16:54:07.441481 166 loader.go:375] Config loaded from file: /etc/kubernetes/admin.conf
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
I0506 16:54:07.448725 166 round_trippers.go:443] GET https://kind-control-plane:6443/healthz?timeout=10s in 2 milliseconds
... (GET to /healthz many times)
[kubelet-check] Initial timeout of 40s passed.
... (GET to /healthz many times)
I0506 16:58:07.178131 166 round_trippers.go:443] GET https://kind-control-plane:6443/healthz?timeout=10s in 3 milliseconds
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:114
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:422
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.NewCmdInit.func1
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/init.go:147
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:826
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:914
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:864
k8s.io/kubernetes/cmd/kubeadm/app.Run
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:50
main.main
_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
/usr/local/go/src/runtime/proc.go:203
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357
error execution phase wait-control-plane
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:235
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:422
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.NewCmdInit.func1
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/init.go:147
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:826
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:914
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:864
k8s.io/kubernetes/cmd/kubeadm/app.Run
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:50
main.main
_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
/usr/local/go/src/runtime/proc.go:203
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1357

Unfortunately, an error has occurred:
timed out waiting for the condition

This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.

Here is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:
- 'crictl --runtime-endpoint /run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'crictl --runtime-endpoint /run/containerd/containerd.sock logs CONTAINERID'

So apparently the Kubelet never replies to GET https://kind-control-plane:6443/healthz?timeout=10s requests.

What you expected to happen:
I expected the cluster to boot successfully.

How to reproduce it (as minimally and precisely as possible):
As explained above.

Anything else we need to know?:
If I simply run kind create cluster a Kubernetes v1.18.2 cluster gets created successfully.

Follow the logs from the node container when running kind create cluster --image kindest/node:latest, notice that there are some "Failed to..." messages; they show up even for the successful kind create cluster case.

INFO: ensuring we can execute /bin/mount even with userns-remap
INFO: remounting /sys read-only
INFO: making mounts shared
INFO: fix cgroup mounts for all subsystems
INFO: clearing and regenerating /etc/machine-id
Initializing machine ID from random generator.
INFO: faking /sys/class/dmi/id/product_name to be "kind"
INFO: faking /sys/class/dmi/id/product_uuid to be random
INFO: faking /sys/devices/virtual/dmi/id/product_uuid as well
INFO: setting iptables to detected mode: legacy
INFO: Detected IPv4 address: 172.19.0.2
INFO: Detected IPv6 address: fc00:f853:ccd:e793::2
Failed to find module 'autofs4'
systemd 242 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.
Failed to create symlink /sys/fs/cgroup/net_cls: File exists
Failed to create symlink /sys/fs/cgroup/net_prio: File exists
Failed to create symlink /sys/fs/cgroup/cpuacct: File exists
Failed to create symlink /sys/fs/cgroup/cpu: File exists

Welcome to Ubuntu 19.10!

Set hostname to .
Failed to bump fs.file-max, ignoring: Invalid argument
Configuration file /kind/systemd/kubelet.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Configuration file /etc/systemd/system/kubelet.service.d/10-kubeadm.conf is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
[UNSUPP] Starting of Arbitrary Exec…Automount Point not supported.
[ OK ] Listening on Journal Socket.
[ OK ] Listening on Journal Socket (/dev/log).
Starting Create list of re…odes for the current kernel...
[ OK ] Reached target Slices.
Mounting Huge Pages File System...
[ OK ] Started Dispatch Password …ts to Console Directory Watch.
[ OK ] Reached target Paths.
[ OK ] Reached target Local Encrypted Volumes.
[ OK ] Listening on Journal Audit Socket.
Starting Journal Service...
[ OK ] Reached target Sockets.
Mounting FUSE Control File System...
Mounting Kernel Debug File System...
[ OK ] Reached target Swap.
Starting Remount Root and Kernel File Systems...
Starting Apply Kernel Variables...
[ OK ] Started Create list of req… nodes for the current kernel.
[ OK ] Mounted Huge Pages File System.
[ OK ] Mounted FUSE Control File System.
[ OK ] Mounted Kernel Debug File System.
[ OK ] Started Remount Root and Kernel File Systems.
Starting Create System Users...
Starting Update UTMP about System Boot/Shutdown...
[ OK ] Started Apply Kernel Variables.
[ OK ] Started Update UTMP about System Boot/Shutdown.
[ OK ] Started Create System Users.
Starting Create Static Device Nodes in /dev...
[ OK ] Started Create Static Device Nodes in /dev.
[ OK ] Reached target Local File Systems (Pre).
[ OK ] Reached target Local File Systems.
[ OK ] Started Journal Service.
[ OK ] Reached target System Initialization.
[ OK ] Reached target Basic System.
Starting containerd container runtime...
[ OK ] Started kubelet: The Kubernetes Node Agent.
[ OK ] Started Daily Cleanup of Temporary Directories.
[ OK ] Reached target Timers.
Starting Flush Journal to Persistent Storage...
[ OK ] Started containerd container runtime.
[ OK ] Reached target Multi-User System.
[ OK ] Reached target Graphical Interface.
Starting Update UTMP about System Runlevel Changes...
[ OK ] Started Flush Journal to Persistent Storage.
[ OK ] Started Update UTMP about System Runlevel Changes.

Environment:

  • kind version: (use kind version): both v0.8.0 and v0.8.1
  • Kubernetes version: (use kubectl version): kubectl is v1.18.0; Kubernetes is at commit 0a6c826d3e92dae8f20d6199d0ac7deeca9eed71 from master (latest commit at the time of this writing)
  • Docker version: (use docker info): 19.03.8
  • OS (e.g. from /etc/os-release): Mac OS X 10.14.6
@matte21 matte21 added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2020
@BenTheElder
Copy link
Member

/assign
/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label May 6, 2020
@BenTheElder
Copy link
Member

Setting GOFLAGS= (empty string) may fix, I suspect it's the upstream providerless build regressing.

@BenTheElder
Copy link
Member

/lifecycle active
investigating

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label May 6, 2020
@BenTheElder
Copy link
Member

have this reproduced at least, still determining root cause. side effect of root cause is clearly that the api server never comes up which breaks everything else.

@BenTheElder
Copy link
Member

does not appear to be the GOFLAGS / providerless build.

@BenTheElder
Copy link
Member

Seems to work at v1.19.0-alpha.0 but not v1.19.0-alpha.1 at least on this mac
I forget when we tag alpha.1 / alpha.0, but at least we know it's fairly early in 1.19's development.

@BenTheElder
Copy link
Member

started a proper bisect.

@BenTheElder
Copy link
Member

on a linux environment that builds faster but has similar issues:
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[4274ea2c89dee24e4c188a71e8164b2a40d1e181] Update cadvisor and containerd

@liggitt
Copy link
Contributor

liggitt commented May 6, 2020

any hints in the kubelet logs?

@BenTheElder
Copy link
Member

grabbing them now

@BenTheElder
Copy link
Member

BenTheElder commented May 6, 2020

on gLinux workstation:
kubelet.log
logs.zip

@BenTheElder
Copy link
Member

ahhh I see a familiar log line, this is probably https://kubernetes.slack.com/archives/CEKK1KTN2/p1586971733013400?thread_ts=1586856163.478400&cid=CEKK1KTN2

... was overloaded and forgot about this :(

Perhaps we're now detecting available CPU differently, which would make sense given a cAdvisor upgrade.

... great

@BenTheElder
Copy link
Member

@dims helped me hunt down kubernetes/kubernetes#89859

@BenTheElder BenTheElder added the kind/external upstream bugs label May 7, 2020
@BenTheElder
Copy link
Member

we're going to need a cAdvisor upgrade in k/k, which is blocked on klog => klog v2 working out between the various repos ...

I'm not sure if there's a good work around in kind, possibly specifying the resources manually.

@matte21
Copy link
Author

matte21 commented May 7, 2020

we're going to need a cAdvisor upgrade in k/k

What's the ETA?

which is blocked on klog => klog v2 working out between the various repos ...

I thought go modules allowed having multiple versions of the same dependency/module in the same module.

@BenTheElder
Copy link
Member

There's no ETA currently, but I'll be pushing to get this sorted before 1.19 is released, at least.
I've started discussion with @dims, and I own some of this (k/k deps).

I thought go modules allowed having multiple versions of the same dependency/module in the same module.

I'm not actually clear what went wrong here yet, just what I've been informed on the state of things upstream there at a high level.

@BenTheElder
Copy link
Member

This bug isn't kind specific and definitely needs to get fixed, but it's difficult to give an ETA currently.

I think cAdvisor is pulled into multiple downstream deps that are NOT module enabled is the problem, we may have to upgrade them too, but I think it switched to the v2 API which they may not be prepared for, or something like that (btw kubernetes builds with modules disabled for ... reasons).

@zanetworker
Copy link

Not the first time I see cadvisor causing issues: https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/

@matte21 matte21 changed the title cluster creation with v0.8.x and Kubernetes built from source fails on MAC OS X cluster creation with v0.8.x and Kubernetes built from source fails May 8, 2020
@BenTheElder BenTheElder changed the title cluster creation with v0.8.x and Kubernetes built from source fails [cAdvisor] cluster creation with v0.8.x and Kubernetes built from source fails on some hosts May 8, 2020
@BenTheElder
Copy link
Member

update: there are some PRs in flight regarding klog.

rollback doesn't seem to be an option, it's in too many repos and they will want to roll forward.

I'm going to try to devote some more time to helping get these in soon.

@BenTheElder
Copy link
Member

xref: kubernetes/kubernetes#90183

@BenTheElder
Copy link
Member

Kubernetes is on klog v2 now, haven't checked if we managed to get cAdvisor updated enough yet though.

@dims
Copy link
Member

dims commented May 18, 2020

@BenTheElder yes it did ! google/cadvisor@8af10c6...6a8d614

@BenTheElder
Copy link
Member

thanks dims. validating the fix today 👍

@matte21
Copy link
Author

matte21 commented May 20, 2020

I have just successfully created a cluster from the latest Kubernetes master branch. Thanks!

@matte21 matte21 closed this as completed May 20, 2020
@BenTheElder
Copy link
Member

thanks for confirming! FYI @howardjohn

@BenTheElder
Copy link
Member

I still see this:

Jun 01 16:56:11 kind-control-plane kubelet[309]: W0601 16:56:11.089995 309 predicate.go:111] Failed to admit pod kube-apiserver-kind-control-plane_kube-system(d37105821a9f5fbb1d0c5457e4ceea7c) - Unexpected error while attempting to recover from admission failure: preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: cpu, q: 250), ]

this is with kind v0.8.1 and kubernetes v1.19.0-beta.0.320+3fc7831cd8a704

@BenTheElder BenTheElder reopened this Jun 1, 2020
@dghubble
Copy link

dghubble commented Jun 5, 2020

Same issue in a different context, if it helps us narrow down kubernetes/kubernetes#91795

@BenTheElder
Copy link
Member

BenTheElder commented Jun 5, 2020

kubernetes/kubernetes#89859 is tracking the 0 cpu reported, though it seems it's more than one bug with the new NUMA detection in cAdvisor.

There's a patch out right now that I need to vet in our environment, perhaps it also fixes it in your context?

Have been tracking down a containerd regression ..

@BenTheElder
Copy link
Member

patch: google/cadvisor#2567

@BenTheElder
Copy link
Member

progress in the cAdvisor patch, but it's now counting disabled cores (HT disabled) in num_cores (which kubernetes uses) so .. not quite there yet.

once that's done it still has to get pulled into kubernetes

@dims dims removed their assignment Jun 8, 2020
@BenTheElder
Copy link
Member

I've validated a working fix in cAdvisor in this follow-up google/cadvisor#2579

It's not merged yet, and then we'll need to pull into Kubernetes.

@BenTheElder
Copy link
Member

Discussed the situation we're in managing this dependency with k8s-code-organization project today:

I'm not sure we have a clear solution yet, but we at least have a pair of not so great options:

  • rollback cAdvisor to the last known good version and shim klog v1 => v2
  • roll forward and figure out how to deal with the mid-release runc and libcontainer changes, in addition to requiring a grpc + etcd upgrade ..

@BenTheElder
Copy link
Member

pending kubernetes/kubernetes#91366, after further discussion with SIG node recently.

@BenTheElder
Copy link
Member

kubernetes/kubernetes#91366 is poised to merge with the fix. likely within the next day (it's in the queue)

@BenTheElder
Copy link
Member

kubernetes/kubernetes#91366 merged 6 hours ago.

@BenTheElder
Copy link
Member

confirmed that this is fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/external upstream bugs lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

7 participants