-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JWS token not being created in cluster-info ConfigMap #335
Comments
I have the same issue, also had to rollback to 1.6.6. |
Facing the same problem as above. Minions are not able to join the cluster and keep on failing with |
Make sure, we don't use Kubernetes 1.7, until [1] is fixed or we know a workaround for it. [1] kubernetes/kubeadm#335
Tested on a different Mac with the same Vagrant setup - this one bootstrapped successfully. Not sure what differences there could be, aside from the computer where it's functional being older and slower (which always leads to suspicions of some sort of race condition). |
What does the logs of the controller-manager say in the faulty deployment? I'm having trouble reproducing this... |
Just ran the bootstrap on the computer where it was failing - failed again. Here is the controller-manager log from the failing deployment: https://gist.github.com/erhudy/65029423cfbe35983c32ff69d2eec0c8 |
By way of comparison, here are the controller-manager logs from a successful deployment, immediately after kubeadm joins the first worker to the master: https://gist.github.com/erhudy/102af7fe0394edcfae49c75c9192e187 |
No question about it:
What does |
Strangely enough, while rebuilding the environment again on the computer where it's been consistently failing, it actually joined a worker successfully to the master, so I had to destroy the environment and rebuild it again to get a failure. There definitely seems to be something timing-related going on. |
cc @kubernetes/sig-auth-bugs Seems like it takes a lot of time sometimes to create auto-bootstrapped RBAC rules... @erhudy The API server is responsible for creating RBAC rules specified here: https://github.com/kubernetes/kubernetes/tree/master/plugin/pkg/auth/authorizer/rbac It seems like the API server somehow doesn't do that for you (at least not fast enough); which results in a broken state where the BootstrapSigner can't sign the cluster-info ConfigMap so As a workaround; here is what the rule should look like: apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: system:controller:bootstrap-signer
namespace: kube-public
rules:
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- watch
- apiGroups:
- ""
resourceNames:
- cluster-info
resources:
- configmaps
verbs:
- update
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- update Applying that to a faulty deployment should fix it... |
If the signer only attempts once, it should wait until the server is healthy (via /healthz) before attempting. If it is done via a controller loop, it should requeue on failure |
apiserver log would be helpful in that case, as well as the /healthz status |
@erhudy ^ |
I'm also hitting this: kubeadm/k8s 1.7.0 on GCE/Ubuntu. I could workaround it by applying the missing role AND rolebinding to the kube-public namespace. apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: system:controller:bootstrap-signer
namespace: kube-public
rules:
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- watch
- apiGroups:
- ""
resourceNames:
- cluster-info
resources:
- configmaps
verbs:
- update
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- update
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: RoleBinding
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: system:controller:bootstrap-signer
namespace: kube-public
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: system:controller:bootstrap-signer
subjects:
- kind: ServiceAccount
name: bootstrap-signer
namespace: kube-system
|
healthz status while the join attempts from the worker are ongoing and failing:
|
Looks like there's something causing the kube-public namespace to not be created in time?
|
Namespace doesn't exist at reconcile time:
|
Fixed by kubernetes/kubernetes#48480 |
The kube-public namespace is created by the bootstrap controller, which can race with storage post-start hooks. |
I suppose you're talking about this code: https://github.com/kubernetes/kubernetes/blob/master/pkg/master/controller.go#L148 Yeah, very unlucky that our e2e CI didn't catch this race condition a single time :/ Thanks to @erhudy @alexpekurovsky @shekharupland and @Dirbaio we are now aware of it and could fix the race condition between the controller-manager and apiserver post-start hooks 👍! |
…leased When nodes try to join a master, they can fail because cluster-info is not updated with the expected tokens. To work around that, add the required Role and RoleBinging to let the token signer do its work in time. See kubernetes/kubeadm#335 for details about the workaround. Signed-off-by: Roman Mohr <rmohr@redhat.com>
#335 (comment) worked for me too. Thanks for the fix and the workaround. |
…leased When nodes try to join a master, they can fail because cluster-info is not updated with the expected tokens. To work around that, add the required Role and RoleBinging to let the token signer do its work in time. See kubernetes/kubeadm#335 for details about the workaround. Signed-off-by: Roman Mohr <rmohr@redhat.com>
There is a race condition in k8s 1.7.0 that prevents it from working with kubicle. Sometimes the worker nodes are unable to join the cluster. The k8s bug is here: kubernetes/kubeadm#335 (comment) and will be fixed in k8s 1.7.1. In the meantime we fix the k8s version to 1.6.7 which is known to work well. Signed-off-by: Mark Ryan <mark.d.ryan@intel.com>
There is a race condition in k8s 1.7.0 that prevents it from working with kubicle. Sometimes the worker nodes are unable to join the cluster. The k8s bug is here: kubernetes/kubeadm#335 (comment) and will be fixed in k8s 1.7.1. In the meantime we fix the k8s version to 1.6.7 which is known to work well. Signed-off-by: Mark Ryan <mark.d.ryan@intel.com>
Update this issue from my lab test with 3 difference scenario (Vbox, Google Cloud, VMWare On-Premise). We facing this problem only on oracle virtualbox only with difference parameter of " --apiserver-advertise-address " is it issue ? |
Hi, OS is CentOS 7. |
@praparn @dimitrijezivkovic If you think you've found a new issue with v1.8.1, please create a new issue with more details. |
Same problem with 1.8. Any possible repair with: |
@vglisin That is because the token has expired. We have informed about this policy already in kubeadm v1.7 CLI output, and in the release notes: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.8.md#behavioral-changes
Note that the issue you're describing is vastly different from the topic of this issue. That's why I asked you to open new issues instead of commenting on old, resolved ones. Also I want that you keep in mind that this is open source. If you find things that are sub-optimal, no one is gonna stop you from contributing a good change. |
That was my problem @luxas , I missed this part of information and was trying to join with an expired token. Thank you :) |
Thanks @luxas. |
@luxas same here. In case you are using kubespray, do the following to check if problem is exactly that: On master node run this command
On worker node, edit Then, run: |
Even though there it kubeadm_token_ttl=0 which means that kubeadm token never expires, it is not present in `kubeadm token list` after cluster is provisioned (at least after it is running for some time) and there is issue regarding this kubernetes/kubeadm#335, so we need to create a new temporary token during the cluster upgrade.
Even though there it kubeadm_token_ttl=0 which means that kubeadm token never expires, it is not present in `kubeadm token list` after cluster is provisioned (at least after it is running for some time) and there is issue regarding this kubernetes/kubeadm#335, so we need to create a new temporary token during the cluster upgrade.
gnature for token ID "w30hqq", will try again kubeadm version: &version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c6 |
kubeadm init --token-ttl 0 - has no effect. |
I am facing this issue intermittently. While joining multiple control-plane nodes - facing this for one or two - while others succeed. This is done in a loop. I am on cri-containerd-cni 1.3.4. |
there is a controller that is responsible for adding the bootstrap tokens in "cluster-info". kubeadm waits for that to happen for a while. if the token is never added, there must be a problem elsewhere - e.g. controller in question or the controller-manager. |
Also, worth checking whether there are any validating webhooks configured that may be unreachable at the time. This could prevent the update to 'cluster-info'.
If there are problems, it should show up as related errors in the API server logs.
|
Versions
kubeadm version (use
kubeadm version
): 1.7.0, commit d3ada0119e776222f11ec7945e6d860061339aadEnvironment:
kubectl version
): 1.7.0, commit d3ada0119e776222f11ec7945e6d860061339aaduname -a
): 4.4.0-81-genericWhat happened?
The current version of
kubeadm
does not appear to be inserting the JWS token into the cluster-info ConfigMap. I tried providing it a token that I want it to use (the mode used by the Vagrantfile referenced above), and when that failed, resettingkubeadm
and re-runninginit
while allowing it to generate the token itself. Both modes failed. The consequence of this is that joining nodes to the master is not possible unless the JWS token is manually created and inserted into the cluster-info ConfigMap.Rolling back to 1.6.6 (in the Vagrantfile, modifying the package installation line to
apt-get install -y docker.io kubelet=1.6.6-00 kubeadm=1.6.6-00 kubectl=1.6.6-00 kubernetes-cni
) causes everything to function as expected.When I compared the config maps generated by 1.6.6 versus 1.7.0, the JWS key is indeed missing from 1.7.0. In 1.6.6, under the top-level
data
, key, there was a key beginning withjws-kubeconfig-
, with its value being a JWS token. No such key exists when the cluster is bootstrapped by kubeadm 1.7.0.What you expected to happen?
Joining workers to the master should be possible in 1.7.0 without manually editing the cluster-info ConfigMap.
How to reproduce it (as minimally and precisely as possible)?
Run the Vagrantfile from https://github.com/erhudy/kubeadm-vagrant with
vagrant up
. When it attempts to join the first worker,kubeadm
will fail with the error messagethere is no JWS signed token in the cluster-info ConfigMap
.Anything else we need to know?
No.
The text was updated successfully, but these errors were encountered: