-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing secret webhook causes unexpected behavior #4266
Comments
@Oats87 can you post the complete log? I am not familiar with that specific error and I'm not finding it anywhere in either the K3s or RKE2 codebase. I can't think of what secret we would need to create in order for the kube-controller-manager manifest to be dropped. The creation of static pod manifests is a completely standalone operation without any dependencies on the apiserver being available. |
I suspect that this is coming from the dynamiclistener secret store; it is failing to update the cert secret, and for some reason that’s preventing the supervisor listener from coming up all the way, so it is blocking on the ready check before starting the rest of the control-plane components. |
It appears that this is being fixed on the rancher side by rancher/webhook#240 - it seems that it was not intended for the downstream cluster webook to include rules to require it to be called on creation of secrets. |
I am testing with a bogus webhook that blocks secret creates: apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: rancher.cattle.io
webhooks:
- admissionReviewVersions:
- v1
- v1beta1
clientConfig:
url: https://httpbin.org/status/502
failurePolicy: Fail
matchPolicy: Equivalent
name: rancher.cattle.io.secrets
namespaceSelector: {}
objectSelector: {}
reinvocationPolicy: Never
rules:
- apiGroups:
- ""
apiVersions:
- v1
operations:
- CREATE
resources:
- secrets
scope: Namespaced
sideEffects: NoneOnDryRun
timeoutSeconds: 5 When node registration is blocked by the failing webhook, the error on the server isn't clear what's going on:
The error log on the agent does indicate that the password was rejected:
I am not able to get RKE2 stuck in a situation where it won't recreate the control-plane static pods. I do see an error from dynamiclistener when it tries to create the secret, but this is just a warning, and all subsequent writes use Update which works since it is not blocked by the webhook: https://github.com/rancher/dynamiclistener/blob/2b62d5cc694d566dd8f3f67eb7b0f6bb46266a65/storage/kubernetes/controller.go#L108
The static pods all start up fine: systemd-node-1:/ # kubectl get pod -n kube-system -l tier=control-plane -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cloud-controller-manager-systemd-node-1 1/1 Running 1 (5m41s ago) 5m17s 172.17.0.3 systemd-node-1 <none> <none>
etcd-systemd-node-1 1/1 Running 1 (5m41s ago) 13m 172.17.0.3 systemd-node-1 <none> <none>
kube-apiserver-systemd-node-1 1/1 Running 1 (5m41s ago) 13m 172.17.0.3 systemd-node-1 <none> <none>
kube-controller-manager-systemd-node-1 1/1 Running 1 (5m41s ago) 5m17s 172.17.0.3 systemd-node-1 <none> <none>
kube-proxy-systemd-node-1 1/1 Running 1 (5m41s ago) 5m18s 172.17.0.3 systemd-node-1 <none> <none>
kube-proxy-systemd-node-2 1/1 Running 0 9m52s 172.17.0.4 systemd-node-2 <none> <none>
kube-scheduler-systemd-node-1 1/1 Running 1 (5m41s ago) 5m17s 172.17.0.3 systemd-node-1 <none> <none> |
For the node password creation case, I think we could address this by soft failing with a warning if there is an error when creating the node password secret: I would defer to @macedogm on the security implications of this. Node password secrets are a protection that we have added to ensure that one node cannot impersonate another by joining the cluster with an existing name. The attacker would need to know the node's password, in addition to its hostname. If we just retried the secret creation in the background, instead of failing immediately, this could address the concern around webhooks blocking node joins, while still eventually ensuring that the secret is created once the new node is up and the webhook outage is resolved. |
@brandond I wasn't able to understand exactly which behavior will happen when the password creation fails. Will RKE2 proceed by creating the node with an empty password, completely fail creating the node or goes into an unknown state? Nevertheless, I believe that the right approach should be:
|
@macedogm the workflow is:
It is essentially a shared secret that agents must continue to reuse when connecting to the cluster. Failing to save this does not materially weaken any of the core Kubernetes security model, but it does potentially allow multiple nodes to join with the same node name, or for an attacker that has compromised one node to configure it to rejoin the cluster as another node in order to attract privileged workloads (although there are many other more serious attacks that someone could mount if they had control over an agent node). |
@brandond thanks for the explanation.
Agree.
Based on the previous comment, there are much worse attacks that can happen, but giving that we provide at least this basic protection of checking the
Apologies for making this thread longer, but I would just like to be sure that by |
There is a chicken-and-egg issue here, in that we need to allow hosts to join while the webhook is down, because there may not be any host for the webhook pod to run on, until a new node is joined to the cluster. The proposal is to allow the node to join even when the secret is still pending creation, with an awareness of the fact that this does create a window where the node password protection is not enforced - the period of time between the node joining, and the webhook coming up and allowing the secret to be created. |
It's a tricky situation indeed. I guess this mainly affects RKE2 in connection with Rancher, unless other webhooks that provide the same kind of validation are configured. Do you see this happening in other scenarios outside of Rancher? I fail to see a way to 100% mitigate this without causing a possible deadlock. Suppose a possible timing attack happens and a malicious user adds a second node that has the same name of a previously added node. I believe that such attack would actually be blocked by K8s due to the name uniqueness property, right? If this is true, would still make sense to possibly implement some kind "post-webhook is running again" validation to warn on such situations? I'm just being overzealous, because as we discussed before, if such type of attack happens, the malicious user already has certain privileges in the environment that would allow it to possibly execute other kinds of attacks. Not sure if a note in the docs would be needed, giving the K8s property mentioned above. |
Yeah, thats probably reasonable. There are currently other code paths that just warn if the node password can't be validated - such as when the etcd or apiserver nodes are starting up, and we cannot access secrets yet. If there is a failure, we just log it. We could probably enhance all of these to create a Kubernetes event or something? |
That would be amazing if we could do it. I guess a How should we proceed from here, please? |
PR is opened on the K3s side. |
/backport v1.26.6+rke2r1 |
/backport v1.25.11+rke2r1 |
/backport v1.24.15+rke2r1 |
Validated on
|
Environmental Info:
RKE2 Version: v1.26.4+rke2r1
Node(s) CPU architecture, OS, and Version:
Not relevant.
Cluster Configuration:
1 server node
Describe the bug:
The existing of a set of failing webhook configurations in a cluster blocks RKE2 server from starting up successfully, but RKE2 server still happily starts up pretending like nothing is wrong.
Steps To Reproduce:
kube-controller-manager.yaml
static pod manifest from/var/lib/rancher/rke2/agent/pod-manifests
rke2-server
kube-controller-manager
never comes back onlineExpected behavior:
rke2-server
errors out on startup, starts failing, or something more useful than just silently failing.Actual behavior:
rke2-server
silently fails creating thekube-controller-manager
manifest and the coordinator is stuck without proper feedback as to the fact thatrke2-server
is not really healthy beyond looking at symptoms.Additional context / logs:
I hit this when manually validating:
The
rke2-server
log message was:There is a corresponding
rancher/rancher
issue filed here: rancher/rancher#41613 but while this will fix this instance of the "storm", it is not a holistic solution as there may be other webhook configurations that manipulate secrets (OPA?)The text was updated successfully, but these errors were encountered: