JWS token not being created in cluster-info ConfigMap #335

erhudy · 2017-07-03T20:57:47Z

Versions

kubeadm version (use kubeadm version): 1.7.0, commit d3ada0119e776222f11ec7945e6d860061339aad

Environment:

Kubernetes version (use kubectl version): 1.7.0, commit d3ada0119e776222f11ec7945e6d860061339aad
Cloud provider or hardware configuration: Vagrant environment being configured by https://github.com/erhudy/kubeadm-vagrant
OS (e.g. from /etc/os-release): Xenial 16.04.2
Kernel (e.g. uname -a): 4.4.0-81-generic
Others: N/A

What happened?

The current version of kubeadm does not appear to be inserting the JWS token into the cluster-info ConfigMap. I tried providing it a token that I want it to use (the mode used by the Vagrantfile referenced above), and when that failed, resetting kubeadm and re-running init while allowing it to generate the token itself. Both modes failed. The consequence of this is that joining nodes to the master is not possible unless the JWS token is manually created and inserted into the cluster-info ConfigMap.

Rolling back to 1.6.6 (in the Vagrantfile, modifying the package installation line to apt-get install -y docker.io kubelet=1.6.6-00 kubeadm=1.6.6-00 kubectl=1.6.6-00 kubernetes-cni) causes everything to function as expected.

When I compared the config maps generated by 1.6.6 versus 1.7.0, the JWS key is indeed missing from 1.7.0. In 1.6.6, under the top-level data, key, there was a key beginning with jws-kubeconfig-, with its value being a JWS token. No such key exists when the cluster is bootstrapped by kubeadm 1.7.0.

What you expected to happen?

Joining workers to the master should be possible in 1.7.0 without manually editing the cluster-info ConfigMap.

How to reproduce it (as minimally and precisely as possible)?

Run the Vagrantfile from https://github.com/erhudy/kubeadm-vagrant with vagrant up. When it attempts to join the first worker, kubeadm will fail with the error message there is no JWS signed token in the cluster-info ConfigMap.

Anything else we need to know?

No.

The text was updated successfully, but these errors were encountered:

alexpekurovsky · 2017-07-03T21:29:38Z

I have the same issue, also had to rollback to 1.6.6.
However kubeadm 1.6.6 with kubelet 1.7.0 and kubernetes 1.7.0 works as expected also, so problem in kubeadm.
I'm using CentOS 7.2 instances on AWS

shekharoracle · 2017-07-03T21:54:10Z

Facing the same problem as above. Minions are not able to join the cluster and keep on failing with
Failed to connect to API Server "host:6443": there is no JWS signed token in the cluster-info ConfigMap. This token id "4e9c3a" is invalid for this cluster, can't connect

Make sure, we don't use Kubernetes 1.7, until [1] is fixed or we know a workaround for it. [1] kubernetes/kubeadm#335

erhudy · 2017-07-04T17:05:36Z

Tested on a different Mac with the same Vagrant setup - this one bootstrapped successfully. Not sure what differences there could be, aside from the computer where it's functional being older and slower (which always leads to suspicions of some sort of race condition).

luxas · 2017-07-04T17:07:29Z

What does the logs of the controller-manager say in the faulty deployment?
The problem seems to be in the controller-manager, since the cluster-info ConfigMap isn't updated

I'm having trouble reproducing this...

erhudy · 2017-07-04T17:17:10Z

Just ran the bootstrap on the computer where it was failing - failed again. Here is the controller-manager log from the failing deployment: https://gist.github.com/erhudy/65029423cfbe35983c32ff69d2eec0c8

erhudy · 2017-07-04T17:28:17Z

By way of comparison, here are the controller-manager logs from a successful deployment, immediately after kubeadm joins the first worker to the master: https://gist.github.com/erhudy/102af7fe0394edcfae49c75c9192e187

luxas · 2017-07-04T17:37:46Z

No question about it:

E0704 17:15:23.523852       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)
E0704 17:15:24.527348       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)
E0704 17:15:25.530671       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)
E0704 17:15:26.532794       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)
E0704 17:15:27.535508       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)
E0704 17:15:28.537732       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)
E0704 17:15:29.541843       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)
E0704 17:15:30.543980       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)

What does kubectl -n kube-public get role system:controller:bootstrap-signer -oyaml output?

erhudy · 2017-07-04T17:51:42Z

ubuntu@master:~$ kubectl -n kube-public get role system:controller:bootstrap-signer -oyaml
Error from server (NotFound): roles.rbac.authorization.k8s.io "system:controller:bootstrap-signer" not found

Strangely enough, while rebuilding the environment again on the computer where it's been consistently failing, it actually joined a worker successfully to the master, so I had to destroy the environment and rebuild it again to get a failure. There definitely seems to be something timing-related going on.

luxas · 2017-07-04T18:36:52Z

cc @kubernetes/sig-auth-bugs

Seems like it takes a lot of time sometimes to create auto-bootstrapped RBAC rules...

@erhudy The API server is responsible for creating RBAC rules specified here: https://github.com/kubernetes/kubernetes/tree/master/plugin/pkg/auth/authorizer/rbac

It seems like the API server somehow doesn't do that for you (at least not fast enough); which results in a broken state where the BootstrapSigner can't sign the cluster-info ConfigMap so kubeadm join can succeed.

As a workaround; here is what the rule should look like:

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:controller:bootstrap-signer
  namespace: kube-public
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resourceNames:
  - cluster-info
  resources:
  - configmaps
  verbs:
  - update
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
  - update

Applying that to a faulty deployment should fix it...

liggitt · 2017-07-04T19:06:51Z

If the signer only attempts once, it should wait until the server is healthy (via /healthz) before attempting. If it is done via a controller loop, it should requeue on failure

luxas · 2017-07-04T19:28:20Z

@liggitt I think the signer tries again and again and again (see the log), but the RBAC Role for it isn't just created as @erhudy confirmed with the kubectl command.

liggitt · 2017-07-04T19:33:43Z

apiserver log would be helpful in that case, as well as the /healthz status

luxas · 2017-07-04T19:40:09Z

@erhudy ^

Dirbaio · 2017-07-04T22:31:13Z

I'm also hitting this: kubeadm/k8s 1.7.0 on GCE/Ubuntu.

I could workaround it by applying the missing role AND rolebinding to the kube-public namespace.

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:controller:bootstrap-signer
  namespace: kube-public
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resourceNames:
  - cluster-info
  resources:
  - configmaps
  verbs:
  - update
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
  - update
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: RoleBinding
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:controller:bootstrap-signer
  namespace: kube-public
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: system:controller:bootstrap-signer
subjects:
- kind: ServiceAccount
  name: bootstrap-signer
  namespace: kube-system

erhudy · 2017-07-04T22:40:16Z

API server log: https://gist.github.com/erhudy/fe9e30b588025dc4596fc7c06861c01f

erhudy · 2017-07-04T22:40:52Z

healthz status while the join attempts from the worker are ongoing and failing:

ubuntu@master:~$ curl -k https://10.96.0.1/healthz
ok

erhudy · 2017-07-04T22:45:53Z

Looks like there's something causing the kube-public namespace to not be created in time?

E0704 22:35:57.681740       1 storage_rbac.go:235] \
unable to reconcile role.rbac.authorization.k8s.io/system:controller:bootstrap-signer \
in kube-public: namespaces "kube-public" not found

liggitt · 2017-07-04T22:46:55Z

Namespace doesn't exist at reconcile time:

E0704 22:35:57.681740 1 storage_rbac.go:235] unable to reconcile role.rbac.authorization.k8s.io/system:controller:bootstrap-signer in kube-public: namespaces "kube-public" not found

liggitt · 2017-07-04T23:19:19Z

Fixed by kubernetes/kubernetes#48480

liggitt · 2017-07-04T23:20:39Z

The kube-public namespace is created by the bootstrap controller, which can race with storage post-start hooks.

luxas · 2017-07-05T09:26:55Z

bootstrap controller

I suppose you're talking about this code: https://github.com/kubernetes/kubernetes/blob/master/pkg/master/controller.go#L148

Yeah, very unlucky that our e2e CI didn't catch this race condition a single time :/

Thanks to @erhudy @alexpekurovsky @shekharupland and @Dirbaio we are now aware of it and could fix the race condition between the controller-manager and apiserver post-start hooks 👍!

…leased When nodes try to join a master, they can fail because cluster-info is not updated with the expected tokens. To work around that, add the required Role and RoleBinging to let the token signer do its work in time. See kubernetes/kubeadm#335 for details about the workaround. Signed-off-by: Roman Mohr <rmohr@redhat.com>

rmohr · 2017-07-06T10:52:48Z

#335 (comment) worked for me too. Thanks for the fix and the workaround.

…leased When nodes try to join a master, they can fail because cluster-info is not updated with the expected tokens. To work around that, add the required Role and RoleBinging to let the token signer do its work in time. See kubernetes/kubeadm#335 for details about the workaround. Signed-off-by: Roman Mohr <rmohr@redhat.com>

There is a race condition in k8s 1.7.0 that prevents it from working with kubicle. Sometimes the worker nodes are unable to join the cluster. The k8s bug is here: kubernetes/kubeadm#335 (comment) and will be fixed in k8s 1.7.1. In the meantime we fix the k8s version to 1.6.7 which is known to work well. Signed-off-by: Mark Ryan <mark.d.ryan@intel.com>

See kubernetes/kubeadm#335

kubernetes/kubeadm#335 kubevirt/kubevirt#292

praparn · 2017-09-30T03:54:07Z

Update this issue from my lab test with 3 difference scenario (Vbox, Google Cloud, VMWare On-Premise). We facing this problem only on oracle virtualbox only with difference parameter of " --apiserver-advertise-address " is it issue ?

dimitrijezivkovic · 2017-10-12T14:25:33Z

Hi,
same as @praparn, I'm facing the same problem on qemu VM with kubeadm 1.8.1 and with --apiserver-advertise-address=0.0.0.0 parameter changed.

OS is CentOS 7.

luxas · 2017-10-12T17:21:29Z

@praparn @dimitrijezivkovic If you think you've found a new issue with v1.8.1, please create a new issue with more details.

vglisin · 2017-10-17T09:30:47Z

Same problem with 1.8. Any possible repair with:
"ailed to connect to API Server "XXXX:6443": there is no JWS signed token in the cluster-info ConfigMap. This token id "fb0a7d" is invalid for this cluster, can't connect".
That same token was ok last week.
Any possible, logical explantion or workaround? Next year you will have this working 100%?

luxas · 2017-10-17T13:55:34Z

@vglisin That is because the token has expired. We have informed about this policy already in kubeadm v1.7 CLI output, and in the release notes: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.8.md#behavioral-changes

The default Bootstrap Token created with kubeadm init v1.8 expires and is deleted after 24 hours by default to limit the exposure of the valuable credential. You can create a new Bootstrap Token with kubeadm token create or make the default token permanently valid by specifying --token-ttl 0 to kubeadm init. The default token can later be deleted with kubeadm token delete.

Note that the issue you're describing is vastly different from the topic of this issue. That's why I asked you to open new issues instead of commenting on old, resolved ones.

Also I want that you keep in mind that this is open source. If you find things that are sub-optimal, no one is gonna stop you from contributing a good change.

nelsonfassis · 2017-11-23T18:40:15Z

That was my problem @luxas , I missed this part of information and was trying to join with an expired token. Thank you :)

vhosakot · 2017-12-18T18:48:41Z

Thanks @luxas. kubeadm init --token-ttl 0 works for me. I'll use it as a workaround.

mlushpenko · 2018-02-05T21:23:49Z

@luxas same here. In case you are using kubespray, do the following to check if problem is exactly that:

On master node run this command

kubeadm token create and copy generate token

On worker node, edit /etc/kubernetes/kubeadm-client.conf and put your new token into token field.

Then, run: kubeadm join --config /etc/kubernetes/kubeadm-client.conf --ignore-preflight-errors=all and it shall join the cluster

Even though there it kubeadm_token_ttl=0 which means that kubeadm token never expires, it is not present in `kubeadm token list` after cluster is provisioned (at least after it is running for some time) and there is issue regarding this kubernetes/kubeadm#335, so we need to create a new temporary token during the cluster upgrade.

ratulb · 2021-01-14T15:04:18Z

gnature for token ID "w30hqq", will try again
I0114 15:02:45.146194 5300 round_trippers.go:445] GET https://10.128.0.57:80/api/v1/namespaces/kube-public/co
nfigmaps/cluster-info?timeout=10s 200 OK in 7 milliseconds
I0114 15:02:45.146496 5300 token.go:221] [discovery] The cluster-info ConfigMap does not yet contain a JWS si
gnature for token ID "w30hqq", will try again
I0114 15:02:51.009632 5300 round_trippers.go:445] GET https://10.128.0.57:80/api/v1/namespaces/kube-public/co
nfigmaps/cluster-info?timeout=10s 200 OK in 6 milliseconds
I0114 15:02:51.009999 5300 token.go:221] [discovery] The cluster-info ConfigMap does not yet contain a JWS si
gnature for token ID "w30hqq", will try again

kubeadm version: &version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c6
80a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:25:59Z", GoVersion:"go1.15.5", Compiler:"gc", P
latform:"linux/amd64"}

ratulb · 2021-01-14T15:05:33Z

kubeadm init --token-ttl 0 - has no effect.

ratulb · 2021-01-14T15:18:55Z

I am facing this issue intermittently. While joining multiple control-plane nodes - facing this for one or two - while others succeed. This is done in a loop.

I am on cri-containerd-cni 1.3.4.

neolit123 · 2021-01-14T15:24:51Z

I0114 15:02:51.009999 5300 token.go:221] [discovery] The cluster-info ConfigMap does not yet contain a JWS si
gnature for token ID "w30hqq", will try again

there is a controller that is responsible for adding the bootstrap tokens in "cluster-info". kubeadm waits for that to happen for a while. if the token is never added, there must be a problem elsewhere - e.g. controller in question or the controller-manager.

rossigee · 2021-07-24T14:09:02Z

Also, worth checking whether there are any validating webhooks configured that may be unreachable at the time. This could prevent the update to 'cluster-info'.

kubectl get validatingwebhookconfiguration

If there are problems, it should show up as related errors in the API server logs.

W0724 12:38:49.791136       1 dispatcher.go:170] Failed calling webhook, failing open mutate.kyverno.svc: failed calling webhook "mutate.kyverno.svc": Post "https://kyverno-svc.kyverno.svc:443/mutate?timeout=3s": dial tcp 10.2.201.182:443: i/o timeout
E0724 12:38:49.791533       1 dispatcher.go:171] failed calling webhook "mutate.kyverno.svc": Post "https://kyverno-svc.kyverno.svc:443/mutate?timeout=3s": dial tcp 10.2.201.182:443: i/o timeout
I0724 12:38:50.128717       1 trace.go:205] Trace[956883512]: "Call mutating webhook" configuration:kyverno-resource-mutating-webhook-cfg,webhook:mutate.kyverno.svc,resource:coordination.k8s.io/v1, Resource=leases,subresource:,operation:UPDATE,UID:7ae21e36-6841-4461-8a21-635325ea3799 (24-Jul-2021 12:38:47.127) (total time: 3001ms):

rmohr mentioned this issue Jul 4, 2017

Kubernetes 1.7 is there and our Vagrant kubeadm setup does not work anymore kubevirt/kubevirt#286

Closed

rmohr added a commit to rmohr/kubevirt that referenced this issue Jul 4, 2017

User Kubernetes 1.6.6 in the Vagrant deployment

2dda178

Make sure, we don't use Kubernetes 1.7, until [1] is fixed or we know a workaround for it. [1] kubernetes/kubeadm#335

rmohr mentioned this issue Jul 4, 2017

User Kubernetes 1.6.6 in the Vagrant deployment kubevirt/kubevirt#288

Closed

luxas added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jul 4, 2017

liggitt mentioned this issue Jul 4, 2017

Ensure namespace exists as part of RBAC reconciliation kubernetes/kubernetes#48480

Merged

luxas modified the milestones: v1.8, v1.7 Jul 5, 2017

luxas closed this as completed in luxas/kubernetes@b12314e Jul 6, 2017

ctrlaltdel added a commit to infraly/k8s-on-openstack that referenced this issue Jul 12, 2017

Workaround kubeadm 1.7.0 race condition

eb6751c

See kubernetes/kubeadm#335

piersbarrios mentioned this issue Jul 12, 2017

Kubeadm join fail #344

Closed

mwalzer added a commit to mwalzer/k8s-galaxy-ansible that referenced this issue Aug 28, 2017

overcame the kubeadm join race condition issue with serial: 1 role

c7f8892

kubernetes/kubeadm#335 kubevirt/kubevirt#292

kmova mentioned this issue Sep 6, 2017

Unable to bring up the minions: Stuck on 'Using Token' openebs/openebs#218

Closed

gryphius mentioned this issue Nov 18, 2017

remove obsolete 1.7.0 workaround zioproto/k8s-on-openstack#4

Merged

This was referenced Dec 18, 2017

Docker downgrade fails Project31/ansible-kubernetes-openshift-pi3#28

Closed

Fixed kubelet RPM and kubeadm token expiry issue contiv/install#327

Merged

This was referenced Feb 5, 2018

Fix safe upgrade mlushpenko/kubespray#1

Closed

Fix safe upgrade kubernetes-sigs/kubespray#2256

Merged

JWS token not being created in cluster-info ConfigMap #335

JWS token not being created in cluster-info ConfigMap #335

Comments

erhudy commented Jul 3, 2017

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

alexpekurovsky commented Jul 3, 2017

shekharoracle commented Jul 3, 2017

erhudy commented Jul 4, 2017

luxas commented Jul 4, 2017

erhudy commented Jul 4, 2017

erhudy commented Jul 4, 2017

luxas commented Jul 4, 2017

erhudy commented Jul 4, 2017

luxas commented Jul 4, 2017

liggitt commented Jul 4, 2017

luxas commented Jul 4, 2017

liggitt commented Jul 4, 2017

luxas commented Jul 4, 2017

Dirbaio commented Jul 4, 2017

erhudy commented Jul 4, 2017

erhudy commented Jul 4, 2017

erhudy commented Jul 4, 2017

liggitt commented Jul 4, 2017 • edited Loading

liggitt commented Jul 4, 2017

liggitt commented Jul 4, 2017

luxas commented Jul 5, 2017

rmohr commented Jul 6, 2017

praparn commented Sep 30, 2017 • edited Loading

dimitrijezivkovic commented Oct 12, 2017

luxas commented Oct 12, 2017

vglisin commented Oct 17, 2017

luxas commented Oct 17, 2017

nelsonfassis commented Nov 23, 2017

vhosakot commented Dec 18, 2017

mlushpenko commented Feb 5, 2018

ratulb commented Jan 14, 2021

ratulb commented Jan 14, 2021

ratulb commented Jan 14, 2021

neolit123 commented Jan 14, 2021

rossigee commented Jul 24, 2021

liggitt commented Jul 4, 2017 •

edited

Loading

praparn commented Sep 30, 2017 •

edited

Loading