-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VirtualGateways should not break with a missing TLS certificate #6957
Comments
This issue actually spawned two production incident last week. so this one is pretty high in priority for Xero. it would be great if this can be looked at sooner. |
Related issue https://solo-io.zendesk.com/agent/tickets/2123 |
There are a few dimensions to this issue, and I want to make sure we break them down, so we can better understand the challenges and afterwards help come up with a desired solution. To me there are a few aspects at play:
Delayed Certificate IssuanceCan you help me understand how this plays a role here? I would imagine that cert-manager (as an example) doesn't delete the secret, but instead modifies it when there is an update. I'm happy to learn more about the internals about what takes place, but can you share what the expected order of events is and where in there Secrets are being deleted? Deleting Secrets That are ValidI agree that we should not support this. There are a few cases that I think are worth considering:
Proxy restarts causing differing behavior in the new pod from the previous oneI observed a similar behavior when configuring a single VS. I believe this is because if you invalidate the VS (and thus the filter chain), the listener is no longer valid and is not updated. As a result, the "old", previously working tls config is still stored in memory in the proxy. When a new pod starts up, it requests the config from gloo and doesn't pick up any listeners, because the xds cache has no valid listeners. However, I noticed that when I had multiple virtual services, invalidating one resulted in the listener being updated by removing the now invalid filter chain. However, since there was >0 valid filter chains, the listener kept functioning. This to me feels like the proper behavior, but the open questions is how we want to handle the case where a listener has 0 valid virtual hosts (because each has an invalid tls config) The listener should keep work functioning for all connections where a proper cert is existingTo my understanding, this is the current behavior. It feels like the open questions are:
|
I feel that there is some context missing here as you mention secret being deleted which I don't see in this issue (maybe part of the zendesk ticket that is linked but I do not have access to it), so our situation might be a bit different but I think this is similar to an issue we're having. Let me know if you prefer that I create a separate issue. Delayed Certificate IssuanceAFAIK this is only an issue when creating a new VirtualService and associated cert, not on deletion or updates. The way nginx-ingress solves this problem is by serving a default self-signed certificate if the secret cannot be found. Deleting Secrets That are ValidI haven't had this issue, it seems reasonable to have a validation webhook on secret deletion, but as you mentioned with a flag to disable it for backwards-compatibility. One issue I have with the validation webhook way is that it introduces a notion of dependency (the VirtualService has to be deleted first, then the secret). This can make simple workflows using a Note that this is also true at creation time with some of the existing validation hooks. Here's an example of applying a single yaml file that contains a VirtualService and a cert-manager Certificate
To work around this one would have to either:
Serving a default cert when the secret cannot be found would be a simple way to solve this issue. The validation webhook could maybe just throw a warning instead of failing? Proxy restarts causing differing behavior in the new pod from the previous oneThe listener should keep work functioning for all connections where a proper cert is existingMerging these 2 questions, as I'm not sure in which one this falls in to but what happened to us was that the gloo Gateway was in an error state (because the certificate couldn't be fetched and thus the tls secret couldn't be found).
(This was tested on gloo OSS 1.13.30) This can be easily reproduced by creating a VirtualService referencing a secret that doesn't exist (with I do have other https VirtualServices that have a good config, so it's not that we ended up with 0 valid routes. Here's the logs on gateway-proxy when this happens (I removed the deprecation warnings)
Logs for the gloo pod (note that the timestamps won't match as I had to restart the pod again to capture the logs)
|
I just found out that cert-manager is able to issue a temporary certificate by setting the This works nicely as a workaround, and it actually makes a lot more sense for cert-manager to do that rather than gloo. I think there's still work to do on the gloo side:
|
Another related issue |
@sam-heilbron to look at this issue for scope/estimation. |
Zendesk ticket #3753 has been linked to this issue. |
ReproductionI have been able to reproduce this with the following steps (assuming running cluster):
kubectl create namespace cert-manager
kubectl apply --validate=false -f https://github.com/cert-manager/cert-manager/releases/download/v1.11.0/cert-manager.yaml
helm repo add gloo https://storage.googleapis.com/solo-public-helm
helm repo update
helm install -n gloo-system gloo gloo/gloo --set gateway.persistProxySpec=true --set gateway.validation.alwaysAcceptResources=false --set gateway.validation.allowWarnings=false
kubectl apply -f - << EOF
apiVersion: v1
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: selfsigned-issuer
spec:
selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: my-selfsigned-ca
namespace: cert-manager
spec:
isCA: true
commonName: my-selfsigned-ca
secretName: root-secret
privateKey:
algorithm: ECDSA
size: 256
issuerRef:
name: selfsigned-issuer
kind: ClusterIssuer
group: cert-manager.io
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: my-ca-issuer
spec:
ca:
secretName: root-secret
EOF
kubectl apply -f - << EOF
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: test-1.solo.io
namespace: gloo-system
spec:
secretName: test-1.solo.io
dnsNames:
- test-1.solo.io
issuerRef:
name: my-ca-issuer
kind: ClusterIssuer
---
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
name: direct-response-ssl-1
namespace: gloo-system
spec:
virtualHost:
domains:
- test-1.solo.io
- test-1.solo.io:8443
routes:
- matchers:
- prefix: /
directResponseAction:
status: 200
body: you did it, number 1!
sslConfig:
secretRef:
name: test-1.solo.io
namespace: gloo-system
oneWayTls: true
sniDomains: [ "test-1.solo.io" ]
EOF
kubectl port-forward -n gloo-system deploy/gateway-proxy 8443
curl -vk https://test-1.solo.io:8443
kubectl patch settings -n gloo-system default --type merge -p '{"spec":{"gateway":{"validation":{"alwaysAccept":true}}}}'
sleep 1
kubectl delete Certificate -n gloo-system test-1.solo.io && kubectl delete secret -n gloo-system test-1.solo.io
watch curl -vk https://test-1.solo.io:8443 An odd state
|
This work will be released in the following versions: Gloo Edge: Gloo Edge Enterprise: This work will not be available in 1.14 versions of Gloo Edge because the validating webhook behavior was significantly different, and I was unable to reproduce the issue on 1.14 |
This work is merged into all relevant versions as detailed in the comment above. |
Gloo Edge Version
1.11.x (latest stable)
Kubernetes Version
1.23.x
Describe the bug
At times when certificate issuance is slow (could be because the DNS challenge isnt solved) in cert-manager (i.e. the TLS secret is not available yet), Gloo Edge will go into bad state (when mounting a TLS). This is specially true if the proxy restarts as this causes HTTPS listener to disappear and will cause an outage.
The listener should keep work functioning for all connections where a proper cert is existing.
Steps to reproduce the bug
Observed when deleting a referenced certificate.
Expected Behavior
Expect the proxy to behave good for all other connections where there is a certificate.
Additional Context
No response
┆Issue is synchronized with this Asana task by Unito
The text was updated successfully, but these errors were encountered: