Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JWS token not being created in cluster-info ConfigMap #335

Closed
erhudy opened this issue Jul 3, 2017 · 38 comments
Closed

JWS token not being created in cluster-info ConfigMap #335

erhudy opened this issue Jul 3, 2017 · 38 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@erhudy
Copy link

erhudy commented Jul 3, 2017

Versions

kubeadm version (use kubeadm version): 1.7.0, commit d3ada0119e776222f11ec7945e6d860061339aad

Environment:

  • Kubernetes version (use kubectl version): 1.7.0, commit d3ada0119e776222f11ec7945e6d860061339aad
  • Cloud provider or hardware configuration: Vagrant environment being configured by https://github.com/erhudy/kubeadm-vagrant
  • OS (e.g. from /etc/os-release): Xenial 16.04.2
  • Kernel (e.g. uname -a): 4.4.0-81-generic
  • Others: N/A

What happened?

The current version of kubeadm does not appear to be inserting the JWS token into the cluster-info ConfigMap. I tried providing it a token that I want it to use (the mode used by the Vagrantfile referenced above), and when that failed, resetting kubeadm and re-running init while allowing it to generate the token itself. Both modes failed. The consequence of this is that joining nodes to the master is not possible unless the JWS token is manually created and inserted into the cluster-info ConfigMap.

Rolling back to 1.6.6 (in the Vagrantfile, modifying the package installation line to apt-get install -y docker.io kubelet=1.6.6-00 kubeadm=1.6.6-00 kubectl=1.6.6-00 kubernetes-cni) causes everything to function as expected.

When I compared the config maps generated by 1.6.6 versus 1.7.0, the JWS key is indeed missing from 1.7.0. In 1.6.6, under the top-level data, key, there was a key beginning with jws-kubeconfig-, with its value being a JWS token. No such key exists when the cluster is bootstrapped by kubeadm 1.7.0.

What you expected to happen?

Joining workers to the master should be possible in 1.7.0 without manually editing the cluster-info ConfigMap.

How to reproduce it (as minimally and precisely as possible)?

Run the Vagrantfile from https://github.com/erhudy/kubeadm-vagrant with vagrant up. When it attempts to join the first worker, kubeadm will fail with the error message there is no JWS signed token in the cluster-info ConfigMap.

Anything else we need to know?

No.

@alexpekurovsky
Copy link

I have the same issue, also had to rollback to 1.6.6.
However kubeadm 1.6.6 with kubelet 1.7.0 and kubernetes 1.7.0 works as expected also, so problem in kubeadm.
I'm using CentOS 7.2 instances on AWS

@shekharoracle
Copy link

Facing the same problem as above. Minions are not able to join the cluster and keep on failing with
Failed to connect to API Server "host:6443": there is no JWS signed token in the cluster-info ConfigMap. This token id "4e9c3a" is invalid for this cluster, can't connect

rmohr added a commit to rmohr/kubevirt that referenced this issue Jul 4, 2017
Make sure, we don't use Kubernetes 1.7, until [1] is fixed or we know a
workaround for it.

[1] kubernetes/kubeadm#335
@erhudy
Copy link
Author

erhudy commented Jul 4, 2017

Tested on a different Mac with the same Vagrant setup - this one bootstrapped successfully. Not sure what differences there could be, aside from the computer where it's functional being older and slower (which always leads to suspicions of some sort of race condition).

@luxas
Copy link
Member

luxas commented Jul 4, 2017

What does the logs of the controller-manager say in the faulty deployment?
The problem seems to be in the controller-manager, since the cluster-info ConfigMap isn't updated

I'm having trouble reproducing this...

@erhudy
Copy link
Author

erhudy commented Jul 4, 2017

Just ran the bootstrap on the computer where it was failing - failed again. Here is the controller-manager log from the failing deployment: https://gist.github.com/erhudy/65029423cfbe35983c32ff69d2eec0c8

@erhudy
Copy link
Author

erhudy commented Jul 4, 2017

By way of comparison, here are the controller-manager logs from a successful deployment, immediately after kubeadm joins the first worker to the master: https://gist.github.com/erhudy/102af7fe0394edcfae49c75c9192e187

@luxas
Copy link
Member

luxas commented Jul 4, 2017

No question about it:

E0704 17:15:23.523852       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)
E0704 17:15:24.527348       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)
E0704 17:15:25.530671       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)
E0704 17:15:26.532794       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)
E0704 17:15:27.535508       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)
E0704 17:15:28.537732       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)
E0704 17:15:29.541843       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)
E0704 17:15:30.543980       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/bootstrap/bootstrapsigner.go:151: Failed to list *v1.ConfigMap: User "system:serviceaccount:kube-system:bootstrap-signer" cannot list configmaps in the namespace "kube-public". (get configmaps)

What does kubectl -n kube-public get role system:controller:bootstrap-signer -oyaml output?

@erhudy
Copy link
Author

erhudy commented Jul 4, 2017

ubuntu@master:~$ kubectl -n kube-public get role system:controller:bootstrap-signer -oyaml
Error from server (NotFound): roles.rbac.authorization.k8s.io "system:controller:bootstrap-signer" not found

Strangely enough, while rebuilding the environment again on the computer where it's been consistently failing, it actually joined a worker successfully to the master, so I had to destroy the environment and rebuild it again to get a failure. There definitely seems to be something timing-related going on.

@luxas
Copy link
Member

luxas commented Jul 4, 2017

cc @kubernetes/sig-auth-bugs

Seems like it takes a lot of time sometimes to create auto-bootstrapped RBAC rules...

@erhudy The API server is responsible for creating RBAC rules specified here: https://github.com/kubernetes/kubernetes/tree/master/plugin/pkg/auth/authorizer/rbac

It seems like the API server somehow doesn't do that for you (at least not fast enough); which results in a broken state where the BootstrapSigner can't sign the cluster-info ConfigMap so kubeadm join can succeed.

As a workaround; here is what the rule should look like:

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:controller:bootstrap-signer
  namespace: kube-public
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resourceNames:
  - cluster-info
  resources:
  - configmaps
  verbs:
  - update
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
  - update

Applying that to a faulty deployment should fix it...

@liggitt
Copy link
Member

liggitt commented Jul 4, 2017

If the signer only attempts once, it should wait until the server is healthy (via /healthz) before attempting. If it is done via a controller loop, it should requeue on failure

@luxas
Copy link
Member

luxas commented Jul 4, 2017

@liggitt I think the signer tries again and again and again (see the log), but the RBAC Role for it isn't just created as @erhudy confirmed with the kubectl command.

@liggitt
Copy link
Member

liggitt commented Jul 4, 2017

apiserver log would be helpful in that case, as well as the /healthz status

@luxas
Copy link
Member

luxas commented Jul 4, 2017

@erhudy ^

@luxas luxas added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jul 4, 2017
@Dirbaio
Copy link

Dirbaio commented Jul 4, 2017

I'm also hitting this: kubeadm/k8s 1.7.0 on GCE/Ubuntu.

I could workaround it by applying the missing role AND rolebinding to the kube-public namespace.

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:controller:bootstrap-signer
  namespace: kube-public
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resourceNames:
  - cluster-info
  resources:
  - configmaps
  verbs:
  - update
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
  - update
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: RoleBinding
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:controller:bootstrap-signer
  namespace: kube-public
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: system:controller:bootstrap-signer
subjects:
- kind: ServiceAccount
  name: bootstrap-signer
  namespace: kube-system

@erhudy
Copy link
Author

erhudy commented Jul 4, 2017

@erhudy
Copy link
Author

erhudy commented Jul 4, 2017

healthz status while the join attempts from the worker are ongoing and failing:

ubuntu@master:~$ curl -k https://10.96.0.1/healthz
ok

@erhudy
Copy link
Author

erhudy commented Jul 4, 2017

Looks like there's something causing the kube-public namespace to not be created in time?

E0704 22:35:57.681740       1 storage_rbac.go:235] \
unable to reconcile role.rbac.authorization.k8s.io/system:controller:bootstrap-signer \
in kube-public: namespaces "kube-public" not found

@liggitt
Copy link
Member

liggitt commented Jul 4, 2017

Namespace doesn't exist at reconcile time:

E0704 22:35:57.681740 1 storage_rbac.go:235] unable to reconcile role.rbac.authorization.k8s.io/system:controller:bootstrap-signer in kube-public: namespaces "kube-public" not found

@liggitt
Copy link
Member

liggitt commented Jul 4, 2017

Fixed by kubernetes/kubernetes#48480

@liggitt
Copy link
Member

liggitt commented Jul 4, 2017

The kube-public namespace is created by the bootstrap controller, which can race with storage post-start hooks.

@luxas
Copy link
Member

luxas commented Jul 5, 2017

bootstrap controller

I suppose you're talking about this code: https://github.com/kubernetes/kubernetes/blob/master/pkg/master/controller.go#L148

Yeah, very unlucky that our e2e CI didn't catch this race condition a single time :/

Thanks to @erhudy @alexpekurovsky @shekharupland and @Dirbaio we are now aware of it and could fix the race condition between the controller-manager and apiserver post-start hooks 👍!

@luxas luxas modified the milestones: v1.8, v1.7 Jul 5, 2017
rmohr added a commit to rmohr/kubevirt that referenced this issue Jul 6, 2017
…leased

When nodes try to join a master, they can fail because cluster-info is
not updated with the expected tokens. To work around that, add the
required Role and RoleBinging to let the token signer do its work in
time. See kubernetes/kubeadm#335 for details
about the workaround.

Signed-off-by: Roman Mohr <rmohr@redhat.com>
@rmohr
Copy link

rmohr commented Jul 6, 2017

#335 (comment) worked for me too. Thanks for the fix and the workaround.

rmohr added a commit to rmohr/kubevirt that referenced this issue Jul 6, 2017
…leased

When nodes try to join a master, they can fail because cluster-info is
not updated with the expected tokens. To work around that, add the
required Role and RoleBinging to let the token signer do its work in
time. See kubernetes/kubeadm#335 for details
about the workaround.

Signed-off-by: Roman Mohr <rmohr@redhat.com>
markdryan pushed a commit to markdryan/ciao that referenced this issue Jul 12, 2017
There is a race condition in k8s 1.7.0 that prevents it from working with
kubicle.  Sometimes the worker nodes are unable to join the cluster.

The k8s bug is here:

kubernetes/kubeadm#335 (comment)

and will be fixed in k8s 1.7.1.  In the meantime we fix the k8s version
to 1.6.7 which is known to work well.

Signed-off-by: Mark Ryan <mark.d.ryan@intel.com>
markdryan pushed a commit to markdryan/ciao that referenced this issue Jul 12, 2017
There is a race condition in k8s 1.7.0 that prevents it from working with
kubicle.  Sometimes the worker nodes are unable to join the cluster.

The k8s bug is here:

kubernetes/kubeadm#335 (comment)

and will be fixed in k8s 1.7.1.  In the meantime we fix the k8s version
to 1.6.7 which is known to work well.

Signed-off-by: Mark Ryan <mark.d.ryan@intel.com>
ctrlaltdel added a commit to infraly/k8s-on-openstack that referenced this issue Jul 12, 2017
@praparn
Copy link

praparn commented Sep 30, 2017

Update this issue from my lab test with 3 difference scenario (Vbox, Google Cloud, VMWare On-Premise). We facing this problem only on oracle virtualbox only with difference parameter of " --apiserver-advertise-address " is it issue ?

@dimitrijezivkovic
Copy link

Hi,
same as @praparn, I'm facing the same problem on qemu VM with kubeadm 1.8.1 and with --apiserver-advertise-address=0.0.0.0 parameter changed.

OS is CentOS 7.

@luxas
Copy link
Member

luxas commented Oct 12, 2017

@praparn @dimitrijezivkovic If you think you've found a new issue with v1.8.1, please create a new issue with more details.

@vglisin
Copy link

vglisin commented Oct 17, 2017

Same problem with 1.8. Any possible repair with:
"ailed to connect to API Server "XXXX:6443": there is no JWS signed token in the cluster-info ConfigMap. This token id "fb0a7d" is invalid for this cluster, can't connect".
That same token was ok last week.
Any possible, logical explantion or workaround? Next year you will have this working 100%?

@luxas
Copy link
Member

luxas commented Oct 17, 2017

@vglisin That is because the token has expired. We have informed about this policy already in kubeadm v1.7 CLI output, and in the release notes: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.8.md#behavioral-changes

The default Bootstrap Token created with kubeadm init v1.8 expires and is deleted after 24 hours by default to limit the exposure of the valuable credential. You can create a new Bootstrap Token with kubeadm token create or make the default token permanently valid by specifying --token-ttl 0 to kubeadm init. The default token can later be deleted with kubeadm token delete.

Note that the issue you're describing is vastly different from the topic of this issue. That's why I asked you to open new issues instead of commenting on old, resolved ones.

Also I want that you keep in mind that this is open source. If you find things that are sub-optimal, no one is gonna stop you from contributing a good change.

@nelsonfassis
Copy link

That was my problem @luxas , I missed this part of information and was trying to join with an expired token. Thank you :)

@vhosakot
Copy link

Thanks @luxas. kubeadm init --token-ttl 0 works for me. I'll use it as a workaround.

@mlushpenko
Copy link

@luxas same here. In case you are using kubespray, do the following to check if problem is exactly that:

On master node run this command

kubeadm token create and copy generate token

On worker node, edit /etc/kubernetes/kubeadm-client.conf and put your new token into token field.

Then, run: kubeadm join --config /etc/kubernetes/kubeadm-client.conf --ignore-preflight-errors=all and it shall join the cluster

mlushpenko added a commit to mlushpenko/kubespray that referenced this issue Feb 5, 2018
Even though there it kubeadm_token_ttl=0 which means that kubeadm token never expires, it is not present in `kubeadm token list` after cluster is provisioned (at least after it is running for some time) and there is issue regarding this kubernetes/kubeadm#335, so we need to create a new temporary token during the cluster upgrade.
mlushpenko added a commit to mlushpenko/kubespray that referenced this issue Feb 9, 2018
Even though there it kubeadm_token_ttl=0 which means that kubeadm token never expires, it is not present in `kubeadm token list` after cluster is provisioned (at least after it is running for some time) and there is issue regarding this kubernetes/kubeadm#335, so we need to create a new temporary token during the cluster upgrade.
@ratulb
Copy link

ratulb commented Jan 14, 2021

gnature for token ID "w30hqq", will try again
I0114 15:02:45.146194 5300 round_trippers.go:445] GET https://10.128.0.57:80/api/v1/namespaces/kube-public/co
nfigmaps/cluster-info?timeout=10s 200 OK in 7 milliseconds
I0114 15:02:45.146496 5300 token.go:221] [discovery] The cluster-info ConfigMap does not yet contain a JWS si
gnature for token ID "w30hqq", will try again
I0114 15:02:51.009632 5300 round_trippers.go:445] GET https://10.128.0.57:80/api/v1/namespaces/kube-public/co
nfigmaps/cluster-info?timeout=10s 200 OK in 6 milliseconds
I0114 15:02:51.009999 5300 token.go:221] [discovery] The cluster-info ConfigMap does not yet contain a JWS si
gnature for token ID "w30hqq", will try again

kubeadm version: &version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c6
80a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:25:59Z", GoVersion:"go1.15.5", Compiler:"gc", P
latform:"linux/amd64"}

@ratulb
Copy link

ratulb commented Jan 14, 2021

kubeadm init --token-ttl 0 - has no effect.

@ratulb
Copy link

ratulb commented Jan 14, 2021

I am facing this issue intermittently. While joining multiple control-plane nodes - facing this for one or two - while others succeed. This is done in a loop.

I am on cri-containerd-cni 1.3.4.

@neolit123
Copy link
Member

I0114 15:02:51.009999 5300 token.go:221] [discovery] The cluster-info ConfigMap does not yet contain a JWS si
gnature for token ID "w30hqq", will try again

there is a controller that is responsible for adding the bootstrap tokens in "cluster-info". kubeadm waits for that to happen for a while. if the token is never added, there must be a problem elsewhere - e.g. controller in question or the controller-manager.

@rossigee
Copy link

Also, worth checking whether there are any validating webhooks configured that may be unreachable at the time. This could prevent the update to 'cluster-info'.

kubectl get validatingwebhookconfiguration

If there are problems, it should show up as related errors in the API server logs.

W0724 12:38:49.791136       1 dispatcher.go:170] Failed calling webhook, failing open mutate.kyverno.svc: failed calling webhook "mutate.kyverno.svc": Post "https://kyverno-svc.kyverno.svc:443/mutate?timeout=3s": dial tcp 10.2.201.182:443: i/o timeout
E0724 12:38:49.791533       1 dispatcher.go:171] failed calling webhook "mutate.kyverno.svc": Post "https://kyverno-svc.kyverno.svc:443/mutate?timeout=3s": dial tcp 10.2.201.182:443: i/o timeout
I0724 12:38:50.128717       1 trace.go:205] Trace[956883512]: "Call mutating webhook" configuration:kyverno-resource-mutating-webhook-cfg,webhook:mutate.kyverno.svc,resource:coordination.k8s.io/v1, Resource=leases,subresource:,operation:UPDATE,UID:7ae21e36-6841-4461-8a21-635325ea3799 (24-Jul-2021 12:38:47.127) (total time: 3001ms):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests