kubeadm on AWS cloud provider deadlocks because of insufficent permissions #330

namliz · 2017-07-01T16:52:58Z

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version: &version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T22:55:19Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T23:15:59Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration: aws
OS (e.g. from /etc/os-release): Centos7
Kernel (e.g. uname -a): Linux ip-172-20-0-134.us-west-2.compute.internal 3.10.0-514.21.2.el7.x86_64 kubeadm join on slave node fails preflight checks #1 SMP Tue Jun 20 12:24:47 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

What happened?

[root@ip-172-20-0-134 centos]# kubeadm init --config=/etc/kubernetes/kubeadm.conf
[kubeadm] WARNING: kubeadm is in beta, please do not use it for production clusters.
[init] Using Kubernetes version: v1.7.0
[init] Using Authorization modes: [Node RBAC]
[init] WARNING: For cloudprovider integrations to work --cloud-provider must be set for all kubelets in the cluster.
	(/etc/systemd/system/kubelet.service.d/10-kubeadm.conf should be edited for this purpose)
[preflight] Running pre-flight checks
[certificates] Generated CA certificate and key.
[certificates] Generated API server certificate and key.
[certificates] API Server serving cert is signed for DNS names [ip-172-20-0-134 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 172.20.0.134]
[certificates] Generated API server kubelet client certificate and key.
[certificates] Generated service account token signing key and public key.
[certificates] Generated front-proxy CA certificate and key.
[certificates] Generated front-proxy client certificate and key.
[certificates] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/admin.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/controller-manager.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/scheduler.conf"
[apiclient] Created API client, waiting for the control plane to become ready
[apiclient] All control plane components are healthy after 78.500853 seconds

...hangs forever

What you expected to happen?

It's supposed to continue with:

[apiclient] Waiting for at least one node to register
[apiclient] First node has registered after 3.002484 seconds
[token] Using token: cncfci.geneisbatman4242
[apiconfig] Created RBAC rules
[addons] Created essential addon: kube-proxy
[addons] Created essential addon: kube-dns

Your Kubernetes master has initialized successfully!

To start using your cluster, you need to run (as a regular user):

  sudo cp /etc/kubernetes/admin.conf $HOME/
  sudo chown $(id -u):$(id -g) $HOME/admin.conf
  export KUBECONFIG=$HOME/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  http://kubernetes.io/docs/admin/addons/

You can now join any number of machines by running the following on each node
as root:

  kubeadm join --token cncfci.geneisbatman4242 172.20.0.134:6443

How to reproduce it (as minimally and precisely as possible)?

stop kubelet
kubeadm reset
yum remove kubeadm
yum install kubeadm-1.6.6
kubeadm init --config=/etc/kubernetes/kubeadm.conf

That's the only change and this time the init succeeds (which is where the second paste above comes from).

Anything else we need to know?

This seems to be a regression/repeat of kubernetes/kubernetes#43815, which is a bummer.

Kubernetes v1.7.0 got released two days ago and, well, does kubeadm v1.7.0 get built and released in tandem automatically or something? Nobody tests it first?

People who didn't pin to kubeadm v1.6.6 will suddenly get a bunch of broken clusters this week.

The main error to look out for is once again:
Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Reproducibly only happens on v1.7.0.

I really hope we do better and catch such things in time for v1.8.0 and am working on an automated test suite for this, if anybody is interested in that please ping me.

The text was updated successfully, but these errors were encountered:

scholzj · 2017-07-03T22:17:56Z

I have the same problem. Also with CentOS 7 / AWS. Also in my case 1.6.6 works fine.

In my case kubelet seems to be not allowed to register it self with the credentials created by kubeadm (/etc/kubernetes/kubelet.conf):

Jul 03 22:07:41 ip-10-0-0-33 kubelet[12427]: E0703 22:07:41.154368   12427 kubelet_node_status.go:106] Unable to register node "ip-10-0-0-33.eu-central-1.compute.internal" with API server: nodes "ip-10-0-0-33.eu-central-1.compute.internal" is forbidden: node ip-10-0-0-33 cannot modify node ip-10-0-0-33.eu-central-1.compute.internal

However, even when I reconfigure kubelet to be able register (for example by using /etc/kubernetes/admin.conf credentials for kubelet) or even get it to the Ready state by installing the network plugin kubeadm still seems to be stuck at the same point and doesn't move forward. So I'm not sure what is kubeadm actually waiting for.

I have no clue what is the actual cause and what is the consequence.

luxas · 2017-07-04T20:02:35Z

This seems to be a regression/repeat of kubernetes/kubernetes#43815, which is a bummer.

No, it's absolutely not that issue.

Kubernetes v1.7.0 got released two days ago and, well, does kubeadm v1.7.0 get built and released in tandem automatically or something? Nobody tests it first?

We absolutely test things a lot. We have automated CI e2e tests that are green: https://k8s-testgrid.appspot.com/sig-cluster-lifecycle#kubeadm-gce-1.7
However, as with everything, it's hard to test in exactly your environment

I think the cloud provider is automatically detected on the kubelet, ref the --cloud-provider flag description on the kubelet.
So it means the kubelet uses custom AWS logic for creating the Node object, and it seems to set the node name based on calls to the AWS API, ref: Node Name == ip-10-0-0-33.eu-central-1.compute.internal

I want to remind you that cloud provider integrations are experimental as kubeadm will only fully support the out-of-tree cloud providers. The current in-tree providers might just as well work fine, but we can't promise it will work.

This is such a case. The kubelet talks to the AWS API, gets a Node Name that is different from the hostname (ip-10-0-0-33.eu-central-1.compute.internal vs ip-10-0-0-33). The kubelet has CN=system:node:ip-10-0-0-33, but needs CN=system:node:ip-10-0-0-33.eu-central-1.compute.internal in order to be able to modify itself. Now the latest security feature that is enabled in kubeadm, the Node Authorizer sees the Node API object that the cloud provider has created and the kubelet actually on that node as two different persons.

So what can we do? As you see, the flow when bootstrapping a cluster using a cloud provider varies from the normal flow and even a lot between providers (AWS is the only provider that does this AFAIK).
I think fixing #64 will just fix this issue, but you still have to know the node name in beforehand and pass it to kubeadm init.

Do you want to contribute a fix for #64?

I really hope we do better and catch such things in time for v1.8.0 and am working on an automated test suite for this, if anybody is interested in that please ping me.

I have no AWS credits, if you want to contribute with AWS results, great and thanks 👍

Thanks @Zilman and @scholzj for the bug report, we appreciate it.
Please have oversight with that this is indeed a very specific edge case that only affects AWS due to how the cloud provider code there works. It has nothing to do with what happened with v1.6.
And it is much better to secure things more than less, so I defend enabling the Node Authorizer.

luxas · 2017-07-06T08:04:11Z

@GheRivero volunteered to work on this 🎉 (i.e. #64 that will solve this problem)
I can't assign him, so therefore commenting instead...

luxas · 2017-07-06T15:05:08Z

Status report:

First of two PRs is up: kubernetes/kubernetes#48538

I expect this to be fixed in v1.7.1, so that you can specify the node name yourself with --node-name.

GheRivero · 2017-07-06T16:34:03Z

I can confirm the same behavior with Ubuntu 16.04
"Failed creating a mirror pod for "kube-apiserver-ip-10-0-0-51.ec2.internal_kube-system(fe921e27127eb782227d38f55946771e)": pods "kube-apiserver-ip-10-0-0-51.ec2.internal" is forbidden: node ip-10-0-0-51 can only create pods with spec.nodeName set to itself"

The first patch solve that situation for kubeadm join but something similar should be done for kubeadm init

rushabh268 · 2017-07-06T20:27:29Z

Seeing the same issue as @scholzj on CentOS 7.3 with 1.7.0. Falling back to 1.6.6 for now.

GheRivero · 2017-07-07T12:22:43Z

Second part of the fix: kubernetes/kubernetes#48594
Added the node-name flag to kubeadm init

trung · 2017-07-07T17:29:08Z

Workaround for EC2 Ubuntu
#64 (comment)

luxas · 2017-07-14T14:42:44Z

Fixed with v1.7.1

gtaylor · 2017-08-18T04:06:43Z

I'm still experiencing this with 1.7.3 during kubeadm init.

luxas · 2017-08-18T11:24:37Z

@gtaylor You must set --node-name specifically to the name the node will have later (you should query the AWS API)

gtaylor · 2017-08-18T15:22:05Z

After reading a handful of these issue threads, I think my issue is that I'm setting a non-default hostname, leading to the indefinite deadlocking. Even if kubelet's --hostname-override matches --node-name.

If I am understanding correctly, you can't use the node authorizer with a hostname that doesn't match the one returned by the AWS API. So you're always going to end up having to stick with the default AWS ip-<x>-<x>-<x>-<x> hostnames. But you can pass in a --node-name if you have kept the AWS-provided hostname.

I hope I am misunderstanding something, because we strongly prefer to change our master and node hostnames!

neolit123 · 2017-09-06T09:26:24Z

i'm getting this hang:
[apiclient] All control plane components are healthy after XX seconds

with kubernetes v1.7.5 on Ubuntu 17.04.

multiple online tutorials, including the official one here do not work out of the box:
https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/

so i found this comment at github:
kubernetes/kubernetes#33544 (comment)

after running that script kubeadm init started working.

petergardfjall · 2017-11-21T07:49:58Z

I ran into this issue also and got it to work by updating the hostname of my EC2 instance to match that of the EC2 metadata service. On Ubuntu I added the following steps to my master bootscript:

sudo apt-get update && sudo apt-get install -y curl
echo 127.0.0.1 $(curl 169.254.169.254/latest/meta-data/hostname) | sudo tee -a /etc/hosts
curl 169.254.169.254/latest/meta-data/hostname | sudo tee /etc/hostname
sudo hostname $(curl 169.254.169.254/latest/meta-data/hostname)

(adding to /etc/hosts and /etc/hostname should ensure that the hostname change survives a reboot)

namliz changed the title ~~kubeadm/kubernetes v1.7 regression -~~ kubeadm/kubernetes v1.7 regression - kubeadm init hangs forever Jul 1, 2017

luxas added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jul 5, 2017

luxas added this to the v1.7 milestone Jul 5, 2017

luxas changed the title ~~kubeadm/kubernetes v1.7 regression - kubeadm init hangs forever~~ kubeadm on AWS cloud provider deadlocks because of insufficent permissions Jul 5, 2017

luxas mentioned this issue Jul 5, 2017

kubeadm should not assume that hostname == nodename #64

Closed

luxas mentioned this issue Jul 6, 2017

Add node-name flag to join phase kubernetes/kubernetes#48538

Merged

luxas added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jul 6, 2017

This was referenced Jul 10, 2017

Add node-name flag to init phase kubernetes/kubernetes#48594

Merged

kubeadm's os.Hostname() isn't getting the FQDN the expected way? (causes a fail to join a node in 1.7.0 when hostnameOverride is used) kubernetes/kubernetes#48617

Closed

luxas closed this as completed Jul 14, 2017

ledzep2 mentioned this issue Aug 9, 2017

can't join node!!! kubeup/okdc#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubeadm on AWS cloud provider deadlocks because of insufficent permissions #330

kubeadm on AWS cloud provider deadlocks because of insufficent permissions #330

namliz commented Jul 1, 2017 •

edited by jbeda

Loading

scholzj commented Jul 3, 2017

luxas commented Jul 4, 2017

luxas commented Jul 6, 2017

luxas commented Jul 6, 2017

GheRivero commented Jul 6, 2017

rushabh268 commented Jul 6, 2017

GheRivero commented Jul 7, 2017

trung commented Jul 7, 2017

luxas commented Jul 14, 2017

gtaylor commented Aug 18, 2017

luxas commented Aug 18, 2017

gtaylor commented Aug 18, 2017

neolit123 commented Sep 6, 2017 •

edited

Loading

petergardfjall commented Nov 21, 2017 •

edited

Loading

kubeadm on AWS cloud provider deadlocks because of insufficent permissions #330

kubeadm on AWS cloud provider deadlocks because of insufficent permissions #330

Comments

namliz commented Jul 1, 2017 • edited by jbeda Loading

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

scholzj commented Jul 3, 2017

luxas commented Jul 4, 2017

luxas commented Jul 6, 2017

luxas commented Jul 6, 2017

GheRivero commented Jul 6, 2017

rushabh268 commented Jul 6, 2017

GheRivero commented Jul 7, 2017

trung commented Jul 7, 2017

luxas commented Jul 14, 2017

gtaylor commented Aug 18, 2017

luxas commented Aug 18, 2017

gtaylor commented Aug 18, 2017

neolit123 commented Sep 6, 2017 • edited Loading

petergardfjall commented Nov 21, 2017 • edited Loading

namliz commented Jul 1, 2017 •

edited by jbeda

Loading

neolit123 commented Sep 6, 2017 •

edited

Loading

petergardfjall commented Nov 21, 2017 •

edited

Loading