Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm on AWS cloud provider deadlocks because of insufficent permissions #330

Closed
namliz opened this issue Jul 1, 2017 · 14 comments
Closed
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@namliz
Copy link

namliz commented Jul 1, 2017

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version: &version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T22:55:19Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.0", GitCommit:"d3ada0119e776222f11ec7945e6d860061339aad", GitTreeState:"clean", BuildDate:"2017-06-29T23:15:59Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: aws
  • OS (e.g. from /etc/os-release): Centos7
  • Kernel (e.g. uname -a): Linux ip-172-20-0-134.us-west-2.compute.internal 3.10.0-514.21.2.el7.x86_64 kubeadm join on slave node fails preflight checks #1 SMP Tue Jun 20 12:24:47 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

What happened?

[root@ip-172-20-0-134 centos]# kubeadm init --config=/etc/kubernetes/kubeadm.conf
[kubeadm] WARNING: kubeadm is in beta, please do not use it for production clusters.
[init] Using Kubernetes version: v1.7.0
[init] Using Authorization modes: [Node RBAC]
[init] WARNING: For cloudprovider integrations to work --cloud-provider must be set for all kubelets in the cluster.
	(/etc/systemd/system/kubelet.service.d/10-kubeadm.conf should be edited for this purpose)
[preflight] Running pre-flight checks
[certificates] Generated CA certificate and key.
[certificates] Generated API server certificate and key.
[certificates] API Server serving cert is signed for DNS names [ip-172-20-0-134 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 172.20.0.134]
[certificates] Generated API server kubelet client certificate and key.
[certificates] Generated service account token signing key and public key.
[certificates] Generated front-proxy CA certificate and key.
[certificates] Generated front-proxy client certificate and key.
[certificates] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/admin.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/controller-manager.conf"
[kubeconfig] Wrote KubeConfig file to disk: "/etc/kubernetes/scheduler.conf"
[apiclient] Created API client, waiting for the control plane to become ready
[apiclient] All control plane components are healthy after 78.500853 seconds

...hangs forever

What you expected to happen?

It's supposed to continue with:

[apiclient] Waiting for at least one node to register
[apiclient] First node has registered after 3.002484 seconds
[token] Using token: cncfci.geneisbatman4242
[apiconfig] Created RBAC rules
[addons] Created essential addon: kube-proxy
[addons] Created essential addon: kube-dns

Your Kubernetes master has initialized successfully!

To start using your cluster, you need to run (as a regular user):

  sudo cp /etc/kubernetes/admin.conf $HOME/
  sudo chown $(id -u):$(id -g) $HOME/admin.conf
  export KUBECONFIG=$HOME/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  http://kubernetes.io/docs/admin/addons/

You can now join any number of machines by running the following on each node
as root:

  kubeadm join --token cncfci.geneisbatman4242 172.20.0.134:6443

How to reproduce it (as minimally and precisely as possible)?

  • stop kubelet
  • kubeadm reset
  • yum remove kubeadm
  • yum install kubeadm-1.6.6
  • kubeadm init --config=/etc/kubernetes/kubeadm.conf

That's the only change and this time the init succeeds (which is where the second paste above comes from).

Anything else we need to know?

This seems to be a regression/repeat of kubernetes/kubernetes#43815, which is a bummer.

Kubernetes v1.7.0 got released two days ago and, well, does kubeadm v1.7.0 get built and released in tandem automatically or something? Nobody tests it first?

People who didn't pin to kubeadm v1.6.6 will suddenly get a bunch of broken clusters this week.

The main error to look out for is once again:
Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Reproducibly only happens on v1.7.0.

I really hope we do better and catch such things in time for v1.8.0 and am working on an automated test suite for this, if anybody is interested in that please ping me.

@namliz namliz changed the title kubeadm/kubernetes v1.7 regression - kubeadm/kubernetes v1.7 regression - kubeadm init hangs forever Jul 1, 2017
@scholzj
Copy link

scholzj commented Jul 3, 2017

I have the same problem. Also with CentOS 7 / AWS. Also in my case 1.6.6 works fine.

In my case kubelet seems to be not allowed to register it self with the credentials created by kubeadm (/etc/kubernetes/kubelet.conf):

Jul 03 22:07:41 ip-10-0-0-33 kubelet[12427]: E0703 22:07:41.154368   12427 kubelet_node_status.go:106] Unable to register node "ip-10-0-0-33.eu-central-1.compute.internal" with API server: nodes "ip-10-0-0-33.eu-central-1.compute.internal" is forbidden: node ip-10-0-0-33 cannot modify node ip-10-0-0-33.eu-central-1.compute.internal

However, even when I reconfigure kubelet to be able register (for example by using /etc/kubernetes/admin.conf credentials for kubelet) or even get it to the Ready state by installing the network plugin kubeadm still seems to be stuck at the same point and doesn't move forward. So I'm not sure what is kubeadm actually waiting for.

I have no clue what is the actual cause and what is the consequence.

@luxas
Copy link
Member

luxas commented Jul 4, 2017

This seems to be a regression/repeat of kubernetes/kubernetes#43815, which is a bummer.

No, it's absolutely not that issue.

Kubernetes v1.7.0 got released two days ago and, well, does kubeadm v1.7.0 get built and released in tandem automatically or something? Nobody tests it first?

We absolutely test things a lot. We have automated CI e2e tests that are green: https://k8s-testgrid.appspot.com/sig-cluster-lifecycle#kubeadm-gce-1.7
However, as with everything, it's hard to test in exactly your environment

I think the cloud provider is automatically detected on the kubelet, ref the --cloud-provider flag description on the kubelet.
So it means the kubelet uses custom AWS logic for creating the Node object, and it seems to set the node name based on calls to the AWS API, ref: Node Name == ip-10-0-0-33.eu-central-1.compute.internal

I want to remind you that cloud provider integrations are experimental as kubeadm will only fully support the out-of-tree cloud providers. The current in-tree providers might just as well work fine, but we can't promise it will work.

This is such a case. The kubelet talks to the AWS API, gets a Node Name that is different from the hostname (ip-10-0-0-33.eu-central-1.compute.internal vs ip-10-0-0-33). The kubelet has CN=system:node:ip-10-0-0-33, but needs CN=system:node:ip-10-0-0-33.eu-central-1.compute.internal in order to be able to modify itself. Now the latest security feature that is enabled in kubeadm, the Node Authorizer sees the Node API object that the cloud provider has created and the kubelet actually on that node as two different persons.

So what can we do? As you see, the flow when bootstrapping a cluster using a cloud provider varies from the normal flow and even a lot between providers (AWS is the only provider that does this AFAIK).
I think fixing #64 will just fix this issue, but you still have to know the node name in beforehand and pass it to kubeadm init.

Do you want to contribute a fix for #64?

I really hope we do better and catch such things in time for v1.8.0 and am working on an automated test suite for this, if anybody is interested in that please ping me.

I have no AWS credits, if you want to contribute with AWS results, great and thanks 👍

Thanks @Zilman and @scholzj for the bug report, we appreciate it.
Please have oversight with that this is indeed a very specific edge case that only affects AWS due to how the cloud provider code there works. It has nothing to do with what happened with v1.6.
And it is much better to secure things more than less, so I defend enabling the Node Authorizer.

@luxas luxas added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jul 5, 2017
@luxas luxas added this to the v1.7 milestone Jul 5, 2017
@luxas luxas changed the title kubeadm/kubernetes v1.7 regression - kubeadm init hangs forever kubeadm on AWS cloud provider deadlocks because of insufficent permissions Jul 5, 2017
@luxas
Copy link
Member

luxas commented Jul 6, 2017

@GheRivero volunteered to work on this 🎉 (i.e. #64 that will solve this problem)
I can't assign him, so therefore commenting instead...

@luxas
Copy link
Member

luxas commented Jul 6, 2017

Status report:

First of two PRs is up: kubernetes/kubernetes#48538

I expect this to be fixed in v1.7.1, so that you can specify the node name yourself with --node-name.

@GheRivero
Copy link

I can confirm the same behavior with Ubuntu 16.04
"Failed creating a mirror pod for "kube-apiserver-ip-10-0-0-51.ec2.internal_kube-system(fe921e27127eb782227d38f55946771e)": pods "kube-apiserver-ip-10-0-0-51.ec2.internal" is forbidden: node ip-10-0-0-51 can only create pods with spec.nodeName set to itself"

The first patch solve that situation for kubeadm join but something similar should be done for kubeadm init

@luxas luxas added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jul 6, 2017
@rushabh268
Copy link

Seeing the same issue as @scholzj on CentOS 7.3 with 1.7.0. Falling back to 1.6.6 for now.

@GheRivero
Copy link

Second part of the fix: kubernetes/kubernetes#48594
Added the node-name flag to kubeadm init

@trung
Copy link

trung commented Jul 7, 2017

Workaround for EC2 Ubuntu
#64 (comment)

@luxas
Copy link
Member

luxas commented Jul 14, 2017

Fixed with v1.7.1

@gtaylor
Copy link

gtaylor commented Aug 18, 2017

I'm still experiencing this with 1.7.3 during kubeadm init.

@luxas
Copy link
Member

luxas commented Aug 18, 2017

@gtaylor You must set --node-name specifically to the name the node will have later (you should query the AWS API)

@gtaylor
Copy link

gtaylor commented Aug 18, 2017

After reading a handful of these issue threads, I think my issue is that I'm setting a non-default hostname, leading to the indefinite deadlocking. Even if kubelet's --hostname-override matches --node-name.

If I am understanding correctly, you can't use the node authorizer with a hostname that doesn't match the one returned by the AWS API. So you're always going to end up having to stick with the default AWS ip-<x>-<x>-<x>-<x> hostnames. But you can pass in a --node-name if you have kept the AWS-provided hostname.

I hope I am misunderstanding something, because we strongly prefer to change our master and node hostnames!

@neolit123
Copy link
Member

neolit123 commented Sep 6, 2017

i'm getting this hang:
[apiclient] All control plane components are healthy after XX seconds

with kubernetes v1.7.5 on Ubuntu 17.04.

multiple online tutorials, including the official one here do not work out of the box:
https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/

so i found this comment at github:
kubernetes/kubernetes#33544 (comment)

after running that script kubeadm init started working.

@petergardfjall
Copy link

petergardfjall commented Nov 21, 2017

I ran into this issue also and got it to work by updating the hostname of my EC2 instance to match that of the EC2 metadata service. On Ubuntu I added the following steps to my master bootscript:

sudo apt-get update && sudo apt-get install -y curl
echo 127.0.0.1 $(curl 169.254.169.254/latest/meta-data/hostname) | sudo tee -a /etc/hosts
curl 169.254.169.254/latest/meta-data/hostname | sudo tee /etc/hostname
sudo hostname $(curl 169.254.169.254/latest/meta-data/hostname)

(adding to /etc/hosts and /etc/hostname should ensure that the hostname change survives a reboot)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

9 participants