Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarted pods request new CSR instead of using existing cert in k8s secret #35

Open
irlevesque opened this issue Feb 13, 2020 · 8 comments

Comments

@irlevesque
Copy link

What happened:

Pods aren't annotated due to webhook service availability issue:

...
E0213 16:11:52.169236       1 certificate_manager.go:396] Certificate request was not signed: timed out waiting for the condition
E0213 16:27:20.169637       1 certificate_manager.go:396] Certificate request was not signed: timed out waiting for the condition
2020/02/13 16:30:29 http: TLS handshake error from 10.244.29.0:60094: no serving certificate available for the webhook, is the CSR approved?

This appears to happen when the pod restarts, causing mass CSR requests:

NAME        AGE     REQUESTOR                                                                          CONDITION
csr-28lr7   3h8m    system:serviceaccount:kube-system:aws-pod-id-production-aws-pod-identity-webhook   Pending
csr-2fk56   4h53m   system:serviceaccount:kube-system:aws-pod-id-production-aws-pod-identity-webhook   Pending
csr-2jvt7   3h39m   system:serviceaccount:kube-system:aws-pod-id-production-aws-pod-identity-webhook   Pending
...

What you expected to happen:

Re-use of existing cert stored in kubernetes secret.

How to reproduce it (as minimally and precisely as possible):

      /webhook
      --in-cluster
      --namespace=kube-system
      --service-name=aws-pod-id-production-aws-pod-identity-webhook
      --tls-secret=pod-identity-webhook
      --annotation-prefix=iam.amazonaws.com
      --token-audience=sts.amazonaws.com
      --logtostderr

Environment:

  • AWS Region: us-east-1
  • EKS Platform version (if using EKS, run aws eks describe-cluster --name <name> --query cluster.platformVersion): NA
  • Kubernetes version (if using EKS, run aws eks describe-cluster --name <name> --query cluster.version): 1.15.6
  • Webhook Version: v0.1.0
@micahhausler
Copy link
Member

Did you approve an initial CSR for the webhook? This is a required step on install. The Makefile does this for you on the initial install, but if you installed via other means that may have gotten missed.

@irlevesque
Copy link
Author

Yes, I did approve the initial CSR, as well as all subsequent CSRs that are created.

@irlevesque
Copy link
Author

I should note that this is happening on two separate clusters that we manage.

@micahhausler
Copy link
Member

Can you post an approved and issued csr yaml? This should include the CSR and public cert (both b64 encoded).

How many CSRs are you seeing generated in an hour?

Are you approving and getting the certificate signed within 15minutes of the CSR request?

@irlevesque
Copy link
Author

I have a suspicion about what's causing this issue.

When the webhook pod restarts while the cluster is under heavy load, the k8s API service may be temporarily degraded/unavailable due to the load. This results in the pod being unable to retrieve the stored secret, which starts the routine for creating a new CSR + storing the secret, overwriting the existing (valid) cert.

I'll try to get some logs for the next time this happens. A workaround may be to configure the app for the "out of cluster" tlsKeyFile and tlsCertFile, which can be managed out-of-band.

@irlevesque
Copy link
Author

Just caught this happening again, and sure enough the logs seem to indicate that this is the scenario playing out:

E0225 20:44:03.691595       1 store.go:58] Error fetching secret: Get https://10.96.0.1:443/api/v1/namespaces/kube-system/secrets/pod-identity-webhook: dial
tcp 10.96.0.1:443: i/o timeout

I0225 20:44:03.691998       1 main.go:173] Creating server

E0225 20:44:03.692317       1 reflector.go:126] pkg/mod/k8s.io/client-go@v11.0.1-0.20190606204521-b8faab9c5193+incompatible/tools/cache/reflector.go:94: Fail
ed to list *v1.ServiceAccount: Get https://10.96.0.1:443/api/v1/serviceaccounts?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout

E0225 20:44:12.419005       1 reflector.go:270] pkg/mod/k8s.io/client-go@v11.0.1-0.20190606204521-b8faab9c5193+incompatible/tools/cache/reflector.go:94: Fail
ed to watch *v1.ServiceAccount: Get https://10.96.0.1:443/api/v1/serviceaccounts?resourceVersion=264646980&timeoutSeconds=427&watch=true: dial tcp 10.96.0.1:
443: connect: connection refused

E0225 20:44:12.419224       1 reflector.go:270] pkg/mod/k8s.io/client-go@v11.0.1-0.20190606204521-b8faab9c5193+incompatible/tools/cache/reflector.go:94: Fai$
ed to watch *v1beta1.CertificateSigningRequest: Get https://10.96.0.1:443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests?fieldSelector=metadata.
name%3Dcsr-4mhht&resourceVersion=264647936&timeout=7m11s&timeoutSeconds=431&watch=true: dial tcp 10.96.0.1:443: connect: connection refused
2020/02/25 20:44:13 http: TLS handshake error from 172.19.201.214:37378: no serving certificate available for the webhook, is the CSR approved?

@mattsawyer77
Copy link

hi @irlevesque, we're having what seems to be the exact same issue.

I'll try to get some logs for the next time this happens. A workaround may be to configure the app for the "out of cluster" tlsKeyFile and tlsCertFile, which can be managed out-of-band.

Out of curiosity, did you find a workaround for this yet?

@mcristina422
Copy link

An example of out of band management with tlsKeyFile and tlsCertFile is here. #94 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants