Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist TLS certificate and key of antrea-controller #5955

Merged
merged 1 commit into from
Feb 7, 2024

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Feb 1, 2024

In clusters where upgrade is performed with rolling update of Nodes and images of new versions are only available on new Nodes, the deployment strategy of antrea-controller is set to RollingUpdate to prevent antrea-controller Pod from being deleted immediately when the deployment is updated, leading to a period in which no antrea-controller is running. However, it also causes two instances of antrea-controller to run simultaneously in a short time, making it possible that the old instance overrides the CA bundle stored in antrea-ca ConfigMap, APIServices, and Webhooks, while the new instance won't update them again.

The commit makes two changes to fix the problem:

  1. CACertController will periodically sync the CA cert to improve the robustness.

  2. Self-signed TLS certificate and key of antrea-controller will be stored in a Secret and will be reused after restarting controller. This makes running multiple antrea-controller instances simultaneously possible and makes restart of antrea-controller smoother as antrea-agents don't need to retrieve a new CA bundle most of the time.

Besides, the change is helpful for implementing high-availability of antrea-controller in the future.

@tnqn tnqn added the action/release-note Indicates a PR that should be included in release notes. label Feb 1, 2024
@tnqn tnqn force-pushed the tolerate-ephemeral-multi-controllers branch 7 times, most recently from d003e37 to a34f717 Compare February 2, 2024 16:22
@tnqn tnqn marked this pull request as ready for review February 2, 2024 16:27
}
remainingDuration := certs[0].NotAfter.Sub(p.clock.Now())
if remainingDuration < p.caConfig.MinValidationDuration {
klog.InfoS("The remaining duration of the TLS certificate and key is less than max rotate duration", "remaining", remainingDuration, "max", p.caConfig.MinValidationDuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the log message still appropriate after replacing MaxRotateDuration with MinValidationDuration (same for the "max" log key)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for reminding, forgot updating this one

ServiceName string
PairName string
// MinValidationDuration is the minimal remaining validation duration for the self-signed certificate. It must be
// rotated once shorter than the duration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// rotated once shorter than the duration.
// rotated once the time until the certificate expires becomes shorter than this duration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -113,7 +113,7 @@ func getCaConfig(isLeader bool, controllerNs string) *certificate.CAConfig {
MutationWebhookSelector: getWebhookLabel(isLeader, controllerNs),
ValidatingWebhookSelector: getWebhookLabel(isLeader, controllerNs),
CertReadyTimeout: 2 * time.Minute,
MaxRotateDuration: time.Hour * (24 * 365),
MinValidationDuration: time.Hour * 24 * 90, // Rotate the certificate 90 days in advance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel like MinValidDuration would be a more "correct" name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, I was thinking valid but somehow wrote validation.

Comment on lines 80 to 81
secureServing.ServerCert.CertKey.CertFile = path.Join(caConfig.SelfSignedCertDir, caConfig.PairName+".crt")
secureServing.ServerCert.CertKey.KeyFile = path.Join(caConfig.SelfSignedCertDir, caConfig.PairName+".key")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the path/filepath package should be used here instead of path, given that these are actual filepaths AFAIK

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

caConfig := &CAConfig{
TLSSecretName: tlsSecretName,
SelfSignedCertDir: t.TempDir(),
MinValidationDuration: maxRotateDuration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're using the maxRotateDuration term here still, yet you use the parameter value to initialize MinValidationDuration. But the 2 mean different things right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, forgot to update after renaming it

Comment on lines +98 to +108
provider.secretInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { provider.enqueue() },
UpdateFunc: func(_, _ interface{}) { provider.enqueue() },
DeleteFunc: func(obj interface{}) { provider.enqueue() },
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to check my understanding: normally there is no reason for the secret to be modified and for these handlers to be invoked, as the Controller should be the only entity modifying the secret?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the corner case that the certificate needs rotation when upgrading. With the handling, regardless of which instance updates the secret first, the other one will switch to it and stop generating a new one.
In the future when HA is implemented, we should only let the active instance rotate the certificate, and the standby instances should refresh its certificate immediately, so the even handlers are also needed for that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a comment would be a good idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

secretBeforeRestart, err := data.clientset.CoreV1().Secrets(tlsSecretNamespace).Get(context.TODO(), tlsSecretName, metav1.GetOptions{})
require.NoError(t, err)

testCert(t, data, string(secretBeforeRestart.Data[certificate.TLSCertFile]), true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in the self-signed certificate case, the published CA bundle (published to the ConfigMap) is equal to the certificate itself (stored in the Secret)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, self-signed certificate util always include both CA cert and signed cert in server cert. We can choose to only publish CA cert for server authentication, but never made the change. We discussed this before in #5135 (comment)

@antoninbas
Copy link
Contributor

@tnqn

However, it also causes two instances of antrea-controller to run simultaneously in a short time, making it possible that the old instance overrides the CA bundle stored in antrea-ca ConfigMap, APIServices, and Webhooks, while the new instance won't update them again.

Should we backport this then? It seems that there is no automatic recovery in this case, until the Controller is manually restarted. Could this explain some issues we saw over the years, with APIServices / Webhooks not working because of TLS issues?

@@ -171,6 +169,7 @@ func (c *CACertController) syncConversionWebhooks(caCert []byte) error {
crdDef.Spec.Conversion.Webhook.ClientConfig.CABundle = caCert
}
if updated {
klog.InfoS("Syncing CA certificate with CRD that have conversion webhooks", "name", crdDef.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.InfoS("Syncing CA certificate with CRD that have conversion webhooks", "name", crdDef.Name)
klog.InfoS("Syncing CA certificate with CRD that have conversion webhook", "name", crdDef.Name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a CRD can have multiple conversion webhooks.

}

func (p *selfSignedCertProvider) enqueue() {
// The key can be anything as we only have single item.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The key can be anything as we only have single item.
// The key can be anything as we only have a single item.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

var err error
var secret *corev1.Secret
// If Secret is specified, we should prioritize it.
if p.caConfig.TLSSecretName != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe define a new variable hasTLSSecretName=p.caConfig.TLSSecretName != "" which can be used in following codes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel they are equivalently readable and there are only two such checks. It seems a bit verbose to me to define a variable.

if p.shouldRotateCertificate(cert) {
klog.InfoS("Generating self signed cert")
if cert, key, err = generateSelfSignedCertKey(p.caConfig.ServiceName, loopbackAddresses, GetAntreaServerNames(p.caConfig.ServiceName)); err != nil {
return fmt.Errorf("unable to generate self signed cert: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "self-signed cert" is a more common one. Probably replace the self signed with self-signed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

if bytes.Equal(cert, p.cert) && bytes.Equal(key, p.key) {
return nil
}
klog.InfoS("Writing certificate and key to cert directory")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.InfoS("Writing certificate and key to cert directory")
klog.InfoS("Writing certificate and key to the cert directory")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

func TestSelfSignedCertProviderRun(t *testing.T) {
secretNamespace := env.GetAntreaNamespace()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we can simply set this as "kube-system" considering it's for testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the correctness of this test would be subject to the value of defaultAntreaNamespace.
I would prefer not to introduce obsecure dependency unnecessarily. The provider is implemented to not assume the Pod must run in "kube-system", the test code shouldn't assume that too. To make it more clear, I have changed the code to set the env var in the beginning of the test.

@tnqn tnqn force-pushed the tolerate-ephemeral-multi-controllers branch from a34f717 to c2bbf66 Compare February 4, 2024 10:04
@tnqn
Copy link
Member Author

tnqn commented Feb 4, 2024

However, it also causes two instances of antrea-controller to run simultaneously in a short time, making it possible that the old instance overrides the CA bundle stored in antrea-ca ConfigMap, APIServices, and Webhooks, while the new instance won't update them again.

Should we backport this then? It seems that there is no automatic recovery in this case, until the Controller is manually restarted. Could this explain some issues we saw over the years, with APIServices / Webhooks not working because of TLS issues?

There was a time the certificate was overridden by the deployment tool kapp when it's used. I don't know what was the cause for other occurrences. Even for the latest report, we only know there was an old controller overwriting the CA bundle with its own one generated two days ago, but this shouldn't happen as overwriting the ConfigMap is one-time job and an old instance will only do it again after 1/2 of the valid duration (half a year) or it never succeeded to update all resources' CA bundle in the last two days.

The change is compatible when upgrading from an old controller to patched controller, as well as patched controller to patched controller. However, it may cause issue if upgrading from patch controller to unpatched controller. For example, if we release 1.14.3 with the patch, and users upgrade 1.14.3 to 1.15.0, there would be an issue. Not sure if we can assume users will always upgrade to latest patch version of a minor release. But it's pretty sure this only happens when the deployment strategy is changed to RollingUpdate.

@tnqn tnqn force-pushed the tolerate-ephemeral-multi-controllers branch from c2bbf66 to af8b178 Compare February 5, 2024 15:31
@antoninbas
Copy link
Contributor

But it's pretty sure this only happens when the deployment strategy is changed to RollingUpdate.

Thanks, I missed that earlier. Didn't realize it could only happen for custom installations.

antoninbas
antoninbas previously approved these changes Feb 5, 2024
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +98 to +108
provider.secretInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { provider.enqueue() },
UpdateFunc: func(_, _ interface{}) { provider.enqueue() },
DeleteFunc: func(obj interface{}) { provider.enqueue() },
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a comment would be a good idea?

}

func newVerifyOptions(caBundle []byte) *x509.VerifyOptions {
// We don't really use the CA bundle to verify clients, this is just to follow DynamicFileCAContent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to double-check: with this, you mean that the generated cert is never used for client-side auth, only server-side auth?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated cert is used for client-side server authentication. The server, antrea-controller, doesn't validate clients' TLS using this cert and it can't use it because none of the client cert is signed by it. It uses token to validate clients.

I added the comment because it sets the key usage to "ExtKeyUsageClientAuth" which might be confusing for people who wonders how can this CA bundle be used for server-side client authentication as we never generate any client cert. It's to follow DynamicFileCAContent's implementation.The verify option will never be used actually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only server-side auth

That was poor phrasing on my part. I indeed meant server authentication on the client-side...

pkg/apiserver/certificate/selfsignedcert_provider_test.go Outdated Show resolved Hide resolved
In clusters where upgrade is performed with rolling update of Nodes and
images of new versions are only available on new Nodes, the deployment
strategy of antrea-controller is set to RollingUpdate to prevent
antrea-controller Pod from being deleted immediately when the deployment
is updated, leading to a period in which no antrea-controller is
running. However, it also causes two instances of antrea-controller to
run simultaneously in a short time, making it possible that the old
instance overrides the CA bundle stored in antrea-ca ConfigMap,
APIServices, and Webhooks, while the new instance won't update them
again.

The commit makes two changes to fix the problem:

1. CACertController will periodically sync the CA cert to improve the
   robustness.

2. Self-signed TLS certificate and key of antrea-controller will be
   stored in a Secret and will be reused after restarting controller.
   This makes running multiple antrea-controller instances
   simultaneously possible and makes restart of antrea-controller
   smoother as antrea-agents don't need to retrieve a new CA bundle most
   of the time.

Besides, the change is helpful for implementing high-availability of
antrea-controller in the future.

Signed-off-by: Quan Tian <qtian@vmware.com>
@tnqn
Copy link
Member Author

tnqn commented Feb 7, 2024

/test-all

@tnqn tnqn merged commit f877806 into antrea-io:main Feb 7, 2024
51 of 54 checks passed
@tnqn tnqn deleted the tolerate-ephemeral-multi-controllers branch February 7, 2024 13:33
tnqn added a commit to tnqn/antrea that referenced this pull request Mar 6, 2024
In clusters where upgrade is performed with rolling update of Nodes and
images of new versions are only available on new Nodes, the deployment
strategy of antrea-controller is set to RollingUpdate to prevent
antrea-controller Pod from being deleted immediately when the deployment
is updated, leading to a period in which no antrea-controller is
running. However, it also causes two instances of antrea-controller to
run simultaneously in a short time, making it possible that the old
instance overrides the CA bundle stored in antrea-ca ConfigMap,
APIServices, and Webhooks, while the new instance won't update them
again.

The commit makes two changes to fix the problem:

1. CACertController will periodically sync the CA cert to improve the
   robustness.

2. Self-signed TLS certificate and key of antrea-controller will be
   stored in a Secret and will be reused after restarting controller.
   This makes running multiple antrea-controller instances
   simultaneously possible and makes restart of antrea-controller
   smoother as antrea-agents don't need to retrieve a new CA bundle most
   of the time.

Besides, the change is helpful for implementing high-availability of
antrea-controller in the future.

Signed-off-by: Quan Tian <qtian@vmware.com>
tnqn added a commit that referenced this pull request Mar 7, 2024
In clusters where upgrade is performed with rolling update of Nodes and
images of new versions are only available on new Nodes, the deployment
strategy of antrea-controller is set to RollingUpdate to prevent
antrea-controller Pod from being deleted immediately when the deployment
is updated, leading to a period in which no antrea-controller is
running. However, it also causes two instances of antrea-controller to
run simultaneously in a short time, making it possible that the old
instance overrides the CA bundle stored in antrea-ca ConfigMap,
APIServices, and Webhooks, while the new instance won't update them
again.

The commit makes two changes to fix the problem:

1. CACertController will periodically sync the CA cert to improve the
   robustness.

2. Self-signed TLS certificate and key of antrea-controller will be
   stored in a Secret and will be reused after restarting controller.
   This makes running multiple antrea-controller instances
   simultaneously possible and makes restart of antrea-controller
   smoother as antrea-agents don't need to retrieve a new CA bundle most
   of the time.

Besides, the change is helpful for implementing high-availability of
antrea-controller in the future.

Signed-off-by: Quan Tian <qtian@vmware.com>
tnqn added a commit to tnqn/antrea that referenced this pull request Mar 7, 2024
In clusters where upgrade is performed with rolling update of Nodes and
images of new versions are only available on new Nodes, the deployment
strategy of antrea-controller is set to RollingUpdate to prevent
antrea-controller Pod from being deleted immediately when the deployment
is updated, leading to a period in which no antrea-controller is
running. However, it also causes two instances of antrea-controller to
run simultaneously in a short time, making it possible that the old
instance overrides the CA bundle stored in antrea-ca ConfigMap,
APIServices, and Webhooks, while the new instance won't update them
again.

The commit makes two changes to fix the problem:

1. CACertController will periodically sync the CA cert to improve the
   robustness.

2. Self-signed TLS certificate and key of antrea-controller will be
   stored in a Secret and will be reused after restarting controller.
   This makes running multiple antrea-controller instances
   simultaneously possible and makes restart of antrea-controller
   smoother as antrea-agents don't need to retrieve a new CA bundle most
   of the time.

Besides, the change is helpful for implementing high-availability of
antrea-controller in the future.

Signed-off-by: Quan Tian <qtian@vmware.com>
tnqn added a commit to tnqn/antrea that referenced this pull request Mar 7, 2024
In clusters where upgrade is performed with rolling update of Nodes and
images of new versions are only available on new Nodes, the deployment
strategy of antrea-controller is set to RollingUpdate to prevent
antrea-controller Pod from being deleted immediately when the deployment
is updated, leading to a period in which no antrea-controller is
running. However, it also causes two instances of antrea-controller to
run simultaneously in a short time, making it possible that the old
instance overrides the CA bundle stored in antrea-ca ConfigMap,
APIServices, and Webhooks, while the new instance won't update them
again.

The commit makes two changes to fix the problem:

1. CACertController will periodically sync the CA cert to improve the
   robustness.

2. Self-signed TLS certificate and key of antrea-controller will be
   stored in a Secret and will be reused after restarting controller.
   This makes running multiple antrea-controller instances
   simultaneously possible and makes restart of antrea-controller
   smoother as antrea-agents don't need to retrieve a new CA bundle most
   of the time.

Besides, the change is helpful for implementing high-availability of
antrea-controller in the future.

Signed-off-by: Quan Tian <qtian@vmware.com>
tnqn added a commit that referenced this pull request Mar 8, 2024
In clusters where upgrade is performed with rolling update of Nodes and
images of new versions are only available on new Nodes, the deployment
strategy of antrea-controller is set to RollingUpdate to prevent
antrea-controller Pod from being deleted immediately when the deployment
is updated, leading to a period in which no antrea-controller is
running. However, it also causes two instances of antrea-controller to
run simultaneously in a short time, making it possible that the old
instance overrides the CA bundle stored in antrea-ca ConfigMap,
APIServices, and Webhooks, while the new instance won't update them
again.

The commit makes two changes to fix the problem:

1. CACertController will periodically sync the CA cert to improve the
   robustness.

2. Self-signed TLS certificate and key of antrea-controller will be
   stored in a Secret and will be reused after restarting controller.
   This makes running multiple antrea-controller instances
   simultaneously possible and makes restart of antrea-controller
   smoother as antrea-agents don't need to retrieve a new CA bundle most
   of the time.

Besides, the change is helpful for implementing high-availability of
antrea-controller in the future.

Signed-off-by: Quan Tian <qtian@vmware.com>
tnqn added a commit that referenced this pull request Mar 8, 2024
In clusters where upgrade is performed with rolling update of Nodes and
images of new versions are only available on new Nodes, the deployment
strategy of antrea-controller is set to RollingUpdate to prevent
antrea-controller Pod from being deleted immediately when the deployment
is updated, leading to a period in which no antrea-controller is
running. However, it also causes two instances of antrea-controller to
run simultaneously in a short time, making it possible that the old
instance overrides the CA bundle stored in antrea-ca ConfigMap,
APIServices, and Webhooks, while the new instance won't update them
again.

The commit makes two changes to fix the problem:

1. CACertController will periodically sync the CA cert to improve the
   robustness.

2. Self-signed TLS certificate and key of antrea-controller will be
   stored in a Secret and will be reused after restarting controller.
   This makes running multiple antrea-controller instances
   simultaneously possible and makes restart of antrea-controller
   smoother as antrea-agents don't need to retrieve a new CA bundle most
   of the time.

Besides, the change is helpful for implementing high-availability of
antrea-controller in the future.

Signed-off-by: Quan Tian <qtian@vmware.com>
@tnqn tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants