Restarted pods request new CSR instead of using existing cert in k8s secret #35

irlevesque · 2020-02-13T17:04:13Z

What happened:

Pods aren't annotated due to webhook service availability issue:

...
E0213 16:11:52.169236       1 certificate_manager.go:396] Certificate request was not signed: timed out waiting for the condition
E0213 16:27:20.169637       1 certificate_manager.go:396] Certificate request was not signed: timed out waiting for the condition
2020/02/13 16:30:29 http: TLS handshake error from 10.244.29.0:60094: no serving certificate available for the webhook, is the CSR approved?

This appears to happen when the pod restarts, causing mass CSR requests:

NAME        AGE     REQUESTOR                                                                          CONDITION
csr-28lr7   3h8m    system:serviceaccount:kube-system:aws-pod-id-production-aws-pod-identity-webhook   Pending
csr-2fk56   4h53m   system:serviceaccount:kube-system:aws-pod-id-production-aws-pod-identity-webhook   Pending
csr-2jvt7   3h39m   system:serviceaccount:kube-system:aws-pod-id-production-aws-pod-identity-webhook   Pending
...

What you expected to happen:

Re-use of existing cert stored in kubernetes secret.

How to reproduce it (as minimally and precisely as possible):

      /webhook
      --in-cluster
      --namespace=kube-system
      --service-name=aws-pod-id-production-aws-pod-identity-webhook
      --tls-secret=pod-identity-webhook
      --annotation-prefix=iam.amazonaws.com
      --token-audience=sts.amazonaws.com
      --logtostderr

Environment:

AWS Region: us-east-1
EKS Platform version (if using EKS, run aws eks describe-cluster --name <name> --query cluster.platformVersion): NA
Kubernetes version (if using EKS, run aws eks describe-cluster --name <name> --query cluster.version): 1.15.6
Webhook Version: v0.1.0

The text was updated successfully, but these errors were encountered:

micahhausler · 2020-02-14T01:00:07Z

Did you approve an initial CSR for the webhook? This is a required step on install. The Makefile does this for you on the initial install, but if you installed via other means that may have gotten missed.

irlevesque · 2020-02-14T14:18:00Z

Yes, I did approve the initial CSR, as well as all subsequent CSRs that are created.

irlevesque · 2020-02-14T16:53:57Z

I should note that this is happening on two separate clusters that we manage.

micahhausler · 2020-02-14T17:48:09Z

Can you post an approved and issued csr yaml? This should include the CSR and public cert (both b64 encoded).

How many CSRs are you seeing generated in an hour?

Are you approving and getting the certificate signed within 15minutes of the CSR request?

irlevesque · 2020-02-18T14:57:02Z

I have a suspicion about what's causing this issue.

When the webhook pod restarts while the cluster is under heavy load, the k8s API service may be temporarily degraded/unavailable due to the load. This results in the pod being unable to retrieve the stored secret, which starts the routine for creating a new CSR + storing the secret, overwriting the existing (valid) cert.

I'll try to get some logs for the next time this happens. A workaround may be to configure the app for the "out of cluster" tlsKeyFile and tlsCertFile, which can be managed out-of-band.

irlevesque · 2020-03-02T16:05:25Z

Just caught this happening again, and sure enough the logs seem to indicate that this is the scenario playing out:

E0225 20:44:03.691595       1 store.go:58] Error fetching secret: Get https://10.96.0.1:443/api/v1/namespaces/kube-system/secrets/pod-identity-webhook: dial
tcp 10.96.0.1:443: i/o timeout

I0225 20:44:03.691998       1 main.go:173] Creating server

E0225 20:44:03.692317       1 reflector.go:126] pkg/mod/k8s.io/client-go@v11.0.1-0.20190606204521-b8faab9c5193+incompatible/tools/cache/reflector.go:94: Fail
ed to list *v1.ServiceAccount: Get https://10.96.0.1:443/api/v1/serviceaccounts?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout

E0225 20:44:12.419005       1 reflector.go:270] pkg/mod/k8s.io/client-go@v11.0.1-0.20190606204521-b8faab9c5193+incompatible/tools/cache/reflector.go:94: Fail
ed to watch *v1.ServiceAccount: Get https://10.96.0.1:443/api/v1/serviceaccounts?resourceVersion=264646980&timeoutSeconds=427&watch=true: dial tcp 10.96.0.1:
443: connect: connection refused

E0225 20:44:12.419224       1 reflector.go:270] pkg/mod/k8s.io/client-go@v11.0.1-0.20190606204521-b8faab9c5193+incompatible/tools/cache/reflector.go:94: Fai$
ed to watch *v1beta1.CertificateSigningRequest: Get https://10.96.0.1:443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests?fieldSelector=metadata.
name%3Dcsr-4mhht&resourceVersion=264647936&timeout=7m11s&timeoutSeconds=431&watch=true: dial tcp 10.96.0.1:443: connect: connection refused
2020/02/25 20:44:13 http: TLS handshake error from 172.19.201.214:37378: no serving certificate available for the webhook, is the CSR approved?

mattsawyer77 · 2020-12-17T22:37:54Z

hi @irlevesque, we're having what seems to be the exact same issue.

I'll try to get some logs for the next time this happens. A workaround may be to configure the app for the "out of cluster" tlsKeyFile and tlsCertFile, which can be managed out-of-band.

Out of curiosity, did you find a workaround for this yet?

mcristina422 · 2020-12-17T23:27:17Z

An example of out of band management with tlsKeyFile and tlsCertFile is here. #94 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarted pods request new CSR instead of using existing cert in k8s secret #35

Restarted pods request new CSR instead of using existing cert in k8s secret #35

irlevesque commented Feb 13, 2020

micahhausler commented Feb 14, 2020

irlevesque commented Feb 14, 2020

irlevesque commented Feb 14, 2020

micahhausler commented Feb 14, 2020

irlevesque commented Feb 18, 2020

irlevesque commented Mar 2, 2020

mattsawyer77 commented Dec 17, 2020

mcristina422 commented Dec 17, 2020

Restarted pods request new CSR instead of using existing cert in k8s secret #35

Restarted pods request new CSR instead of using existing cert in k8s secret #35

Comments

irlevesque commented Feb 13, 2020

micahhausler commented Feb 14, 2020

irlevesque commented Feb 14, 2020

irlevesque commented Feb 14, 2020

micahhausler commented Feb 14, 2020

irlevesque commented Feb 18, 2020

irlevesque commented Mar 2, 2020

mattsawyer77 commented Dec 17, 2020

mcristina422 commented Dec 17, 2020