Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot create cluster due to 'docker exec --privileged kind-control-plane cat /kind/version' failing #2156

Closed
nuno-silva opened this issue Mar 24, 2021 · 10 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@nuno-silva
Copy link

nuno-silva commented Mar 24, 2021

What happened:

kind create cluster fails to run:

$ kind create cluster --retain
Creating cluster "kind" ...
 βœ“ Ensuring node image (kindest/node:v1.20.2) πŸ–Ό 
 βœ“ Preparing nodes πŸ“¦  
 βœ— Writing configuration πŸ“œ 
ERROR: failed to create cluster: failed to generate kubeadm config content: failed to get kubernetes version from node: failed to get file: command "docker exec --privileged kind-control-plane cat /kind/version" failed with error: exit status 1
Command Output: Error response from daemon: Container 9dec1f8399d60f2d41456f9f711484a156d1a0fa8af731a7673a729169344e4e is not running
$ kind export logs
ERROR: [command "docker exec --privileged kind-control-plane sh -c 'tar --hard-dereference -C /var/log/ -chf - . || (r=$?; [ $r -eq 1 ] || exit $r)'" failed with error: exit status 1, [command "docker exec --privileged kind-control-plane journalctl --no-pager" failed with error: exit status 1, command "docker exec --privileged kind-control-plane journalctl --no-pager -u kubelet.service" failed with error: exit status 1, command "docker exec --privileged kind-control-plane journalctl --no-pager -u containerd.service" failed with error: exit status 1, command "docker exec --privileged kind-control-plane cat /kind/version" failed with error: exit status 1]]

What you expected to happen:

kind create cluster should create a cluster.

How to reproduce it (as minimally and precisely as possible):

  1. Install kind: go get sigs.k8s.io/kind
  2. Add kind to PATH: export PATH=$PATH:$(go env GOPATH)/bin
  3. Try to create a cluster: kind create cluster
  4. See error

Anything else we need to know?:

Same symptoms as #1288 (?)

Environment:

  • kind version: (use kind version): kind v0.10.0 go1.16.2 linux/amd64
  • Kubernetes version: (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"archive", BuildDate:"2021-03-05T18:26:23Z", GoVersion:"go1.15.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.12-eks-31566f", GitCommit:"31566f851673e809d4d667b7235ed87587d37722", GitTreeState:"clean", BuildDate:"2020-10-20T23:25:14Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}
  • Docker version: (use docker info):
Client:
 Debug Mode: false

Server:
 Containers: 29
  Running: 0
  Paused: 0
  Stopped: 29
 Images: 6
 Server Version: 19.03.15
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ea765ab
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683b971d9c3ef73f284f176672c44b448662
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.10.20
 Operating System: Gentoo/Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 15.3GiB
 Name: t14
 ID: TJGI:VYKL:BSJM:3BAY:3EA4:NJH5:3FZH:V2GV:PDCO:2ZGB:A4RZ:TG6P
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 25
  Goroutines: 41
  System Time: 2021-03-24T12:05:46.565719027Z
  EventsListeners: 0
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
  • OS (e.g. from /etc/os-release):
NAME=Gentoo
ID=gentoo
PRETTY_NAME="Gentoo/Linux"
ANSI_COLOR="1;32"
HOME_URL="https://www.gentoo.org/"
SUPPORT_URL="https://www.gentoo.org/support/"
BUG_REPORT_URL="https://bugs.gentoo.org/"
  • Gentoo-specific stuff
$ emerge --info docker | grep -E "USE=|built" | tail -n2
app-emulation/docker-19.03.15::gentoo was built with the following:
USE="aufs container-init overlay seccomp -apparmor -btrfs -device-mapper -hardened (-selinux)" ABI_X86="(64)"

dockerd is running as root. I'm running kind as a normal user which is in the docker group.

@nuno-silva nuno-silva added the kind/bug Categorizes issue or PR as related to a bug. label Mar 24, 2021
@BenTheElder
Copy link
Member

BenTheElder commented Mar 24, 2021

We need docker logs kind-control-plane and docker inspect kind-control-plane.
EDIT: since we didn't get them from kind export logs

Also yes, it is the same symptom but the symptom is just "the first time we try to interact with the node container we find out it isn't running" which means it crashed due to something with the host. It's not related to that issue.

@BenTheElder BenTheElder added the triage/needs-information Indicates an issue needs more information in order to work on it. label Mar 25, 2021
@BenTheElder
Copy link
Member

This reminds me of #2112 (comment) but there we only know it's related to the kernel and not the root cause yet. Here we also need more information (I will update the issue template for future issues).

@deorder
Copy link

deorder commented Mar 28, 2021

Same here. Also Gentoo. Could it be because we are not running SystemD, but OpenRC?

Output from docker logs kind-control-plane:

INFO: ensuring we can execute mount/umount even with userns-remap
INFO: remounting /sys read-only
INFO: making mounts shared
INFO: detected cgroup v1
INFO: fix cgroup mounts for all subsystems
INFO: ensuring we can execute mount/umount even with userns-remap
INFO: remounting /sys read-only
INFO: making mounts shared
INFO: detected cgroup v1
INFO: fix cgroup mounts for all subsystems

My docker info:

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 20
  Running: 0
  Paused: 0
  Stopped: 20
 Images: 15
 Server Version: 20.10.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: fec3683b971d9c3ef73f284f176672c44b448662 (expected: de40ad007797e0dcd8b7126f27bb87401d224240)
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.11.8-gentoo-x86_64
 Operating System: Gentoo/Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 15.53GiB
 Name: test-laptop
 ID: YIJJ:27XV:TKIL:AJ3P:UOZ6:GL6V:6JBY:WEX7:EKSO:2FT2:WIZM:QYWQ
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Output from docker inspect kind-control-plane:
inspect.txt

@BenTheElder
Copy link
Member

Please note this part:

Also yes, it is the same symptom but the symptom is just "the first time we try to interact with the node container we find out it isn't running" which means it crashed due to something with the host. It's not related to that issue.

When you see this step fail (docker exec --privileged kind-control-plane cat /kind/version) it can be any number of issues and it is not a single bug. It means the node container exited early because it's the first action we take against it after starting it.

Same here. Also Gentoo. Could it be because we are not running SystemD, but OpenRC?

Yes, please try kind @ HEAD, particularly after: d777456.

SystemD is basically required in the container space now (particularly for cgroupsv2) but we're doing our best to only require it within the nodes. Nobody is running Kubernetes / containerd / ... CI that isn't systemd based though.

It should work without systemd after that patch, I've had one user report success on a different distro.

To test:
git clone https://github.com/kubernetes-sigs/kind, cd kind, make build, bin/kind ... (bin/kind will contain a freshly built binary).

@BenTheElder
Copy link
Member

And thanks for the tip! I couldn't tell from the existing report that this was not on systemd. In that case it is most likely already fixed, we have a release coming in the next few weeks.

@BenTheElder BenTheElder self-assigned this Mar 28, 2021
@deorder
Copy link

deorder commented Mar 28, 2021

And thanks for the tip! I couldn't tell from the existing report that this was not on systemd. In that case it is most likely already fixed, we have a release coming in the next few weeks.

Thank you. That worked! Could not have been a better timing for the fix ;)

I will consider moving all my Gentoo machines to SystemD, but I am sad to hear that it is becoming the only option. All my machines at home use OpenRC with either Gentoo or Alpine. Thank you for still supporting installations using non-SystemD init.

@BenTheElder
Copy link
Member

Thank you. That worked! Could not have been a better timing for the fix ;)

Excellent, glad to hear it. We'll be shipping a tagged release soon, tentatively inline with Kubernetes 1.21 https://www.kubernetes.dev/resources/release/

I will consider moving all my Gentoo machines to SystemD, but I am sad to hear that it is becoming the only option. All my machines at home use OpenRC with either Gentoo or Alpine. Thank you for still supporting installations using non-SystemD init.

You really shouldn't have to move your development machines, but on the other hand I wouldn't be too surprised to discover more places depending on this sort of thing unintentionally in the future.

It's possible it will continue to receive support from other tools, but see e.g. discussion here: kubernetes/kubeadm#2376
#1726 (comment)
https://github.com/opencontainers/runc/blob/master/docs/cgroup-v2.md#systemd

For the most part we should be able to run systemd inside docker on a host with or without systemd, but I can't speak for the other projects, only that my observation is that the ecosystem seems to be trending in the direction of that last link for some time now.

@BenTheElder
Copy link
Member

Could it be because we are not running SystemD, but OpenRC?

When you say "we", does that also refer to the original report by @nuno-silva?

@deorder
Copy link

deorder commented Mar 28, 2021

When you say "we", does that also refer to the original report by @nuno-silva?

I cannot see if @nuno-silva is using systemd or openrc. The use flags do not show it (some packages have use flags to turn on/off systemd support, but docker does not). You may have to wait on @nuno-silva for a response. I can tell you that the issue I run into is almost 100% similar, same errors and log output except for the kernel version, machine name and of course what usually differs per installation.

@nuno-silva
Copy link
Author

Hey everyone! Thanks for looking into this and sorry for the delay.
I'm indeed using OpenRC.

Here's the missing logs (they seem to be just like @deorder's except for SecurityOpt, CgroupnsMode and Capabilities, so you can skip them... see below):

kind create cluster --retain ``` $ kind create cluster --retain Creating cluster "kind" ... βœ“ Ensuring node image (kindest/node:v1.20.2) πŸ–Ό βœ“ Preparing nodes πŸ“¦ βœ— Writing configuration πŸ“œ ERROR: failed to create cluster: failed to generate kubeadm config content: failed to get kubernetes version from node: failed to get file: command "docker exec --privileged kind-control-plane cat /kind/version" failed with error: exit status 1 Command Output: Error response from daemon: Container 19bb0e3e1f33a874d9c10689f891c1a42f6d0bf9c8124b30083d20789a6d165d is not running ```
docker logs kind-control-plane ``` $ docker logs kind-control-plane INFO: ensuring we can execute mount/umount even with userns-remap INFO: remounting /sys read-only INFO: making mounts shared INFO: detected cgroup v1 INFO: fix cgroup mounts for all subsystems INFO: ensuring we can execute mount/umount even with userns-remap INFO: remounting /sys read-only INFO: making mounts shared INFO: detected cgroup v1 INFO: fix cgroup mounts for all subsystems ```

Note that since the original report I've upgraded sys-kernel/gentoo-kernel-bin from 5.10.20 to 5.10.26. Still does the same thing.


As per @BenTheElder's #2156 (comment), I tried kind as of 8fe8b96 and it works!

$ git reflog | cat
8fe8b96 HEAD@{0}: clone: from https://github.com/kubernetes-sigs/kind

$ bin/kind create cluster --retain
Creating cluster "kind" ...
 βœ“ Ensuring node image (kindest/node:v1.20.2) πŸ–Ό 
 βœ“ Preparing nodes πŸ“¦  
 βœ“ Writing configuration πŸ“œ 
 βœ“ Starting control-plane πŸ•ΉοΈ 
 βœ“ Installing CNI πŸ”Œ 
 βœ“ Installing StorageClass πŸ’Ύ 
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Have a nice day! πŸ‘‹

so it seems to be related to not using Systemd and is now fixed. Thank you both for the help and a special thanks to the Kind team for still supporting installations using non-systemd init!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

3 participants