API-1802: cert-rotation: allow specifying multiple target certs in CertRotationController #1722

vrutkovs · 2024-04-19T13:13:41Z

Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.

Tested with

openshift-ci-robot · 2024-04-24T10:36:13Z

@vrutkovs: This pull request references API-1802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.16.0" version, but no target version was set.

In response to this:

Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

vrutkovs · 2024-05-09T06:44:37Z

/cc @tkashem @p0lyn0mial

dinhxuanvu

/lgtm

tkashem

a) RotatedSigningCASecret creates the signer CA secret
b) CABundleConfigMap creates a configmap, and
c) RotatedSelfSignedCertKeySecret creates a secret

I like the idea of a controller doing one thing, can we explore the idea of the individual controllers?
a) SignerCAController: this controller manager the signer secret object.
b) CABundleController: it watched the secret object from a and creates a single configmap and manages it.
c) CertKeySecretController: this can watch objects from a and b and creates a secret with cert/key and manages it.

With this, we can have N instances of CertKeySecretController, where each instance derives it cert/key from a single instance of a and b

tkashem · 2024-06-25T15:23:59Z

pkg/operator/certrotation/client_cert_rotation_controller.go

+		WithPostStartHooks(
+			c.targetCertRecheckerPostRunHook,
+		).
+		ToController("CertRotationController", recorder.WithComponentSuffix("cert-rotation-controller").WithComponentSuffix(name))


MultipleTargetCertRotationController, so we have two distinct names?

tkashem · 2024-06-25T19:48:27Z

pkg/operator/certrotation/client_cert_rotation_controller.go

+			}
+		}(ch)
+	}
+


I would like to avoid making any runtime behavioral changes to NewCertRotationController if possible, we have the following options:
a) completely separate implementations: CertRotationController for NewCertRotationController, and MultipleTargetCertRotationController for NewCertRotationControllerMultipleTargets

b) abstract out the targetCertRecheckerPostRunHook implementations: singleTargetCertRecheckerPostRunHook and multiTargetCertRecheckerPostRunHook. This way we can reuse CertRotationController for both single and multiple. NewCertRotationControllerMultipleTargets will use multiTargetCertRecheckerPostRunHook.

c) can we have a single channel <-chan time.Time to be shared by multiple instances of CertCreator (for example ServingRotation), then the logic inside targetCertRecheckerPostRunHook does not need to change at all.

I prefer c, if doable

Reworked this to make it look like b). Added a test which verifies goroutines don't leak

I don't quite understand what c) is meant for - make controller accept a single channel reused across all CertCreators? Not sure what's the benefit of that

tkashem · 2024-06-25T19:50:30Z

pkg/operator/certrotation/client_cert_rotation_controller.go

-	targetRefresh := refresher.RecheckChannel()
+	aggregateTargetRefresher := make(chan struct{})
+	for _, ch := range targetRefreshers {
+		go func(c <-chan struct{}) {


go func does not have the same guarantee as go wait.Until(func() {}, time.Minute, ctx.Done())

tkashem · 2024-06-25T20:05:18Z

pkg/operator/certrotation/client_cert_rotation_controller.go

+	for _, ch := range targetRefreshers {
+		go func(c <-chan struct{}) {
+			for msg := range c {
+				aggregateTargetRefresher <- msg


goroutine leaking: <-ctx.Done(), could be a problem for integration tests that check goroutine leakages?

ctx.Done would close them iiuc, but yeah, worth adding a unit test which uses runhook and ensures no goroutines are leaking

vrutkovs · 2024-06-25T20:16:57Z

can we explore the idea of the individual controllers?

That's possible, but the goal is to create target certs, signers and CA are merely prerequisites to it. We could have separate signer/ca controllers, but we might end up with signer certs not producing any target certs or CA bundles without any new signers etc.

openshift-ci · 2024-06-26T15:07:04Z

New changes are detected. LGTM label has been removed.

vrutkovs · 2024-06-26T15:09:30Z

pkg/operator/certrotation/client_cert_rotation_controller_test.go

+
+	// Ensure both target certs have been called exactly three times
+	// initial sync and two hook calls for target certs
+	// TODO[vrutkovs]: informers make unpredictable number of calls


Not sure how to tackle that - or how to make sure two hook syncs were included in regular informer sync. informerFactory and NewCertRotationControllerMultipleTargets promise to sync every minute, but it happens much more often

openshift-ci-robot · 2024-06-27T06:38:39Z

@vrutkovs: This pull request references API-1802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.

Tested with

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=300","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=600","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=900","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1200","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1500","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1800","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=2100","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=2400","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-06-27T07:41:22Z

@vrutkovs: This pull request references API-1802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.

Tested with

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=300","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=600","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=900","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1200","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1500","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1800","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=2100","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-kube-controller-manager-operator/pull/811,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=2400","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-06-27T22:32:54Z

@vrutkovs: This pull request references API-1802 which is a valid jira issue.

In response to this:

Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.

Tested with

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=300","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=600","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=900","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1200","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1500","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=1800","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=2100","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

workflow-test openshift-e2e-cert-rotation-suspend-sno 4.17,https://github.com/openshift/cluster-kube-apiserver-operator/pull/1669,https://github.com/openshift/cluster-etcd-operator/pull/1284 "CLUSTER_AGE_DAYS=2400","CLUSTER_AGE_STEP=300","PACKET_OS=rocky_9"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

p0lyn0mial · 2024-06-28T09:48:10Z

An alternative design would be to have a controller per certificate – this would actually preserve the current behaviour. I prefer having a single controller per certificate as it is easier to debug, report status, retry on error, and reason about. Thoughts?

Our issue is that both the RotatedSigningCASecret and CABundleConfigMap are reconciled by multiple controllers (NewCertRotationController) without any coordination.

I think the least invasive change would be to make both RotatedSigningCASecret and CABundleConfigMap thread-safe. Is there an easy way to achieve this?

Another idea that comes to mind (I think Abu suggested the same) is to turn both RotatedSigningCASecret and CABundleConfigMap into controllers and slightly modify NewCertRotationController to read secrets instead. Thoughts ?

vrutkovs · 2024-06-28T10:32:23Z

An alternative design would be to have a controller per certificate – this would actually preserve the current behaviour. I prefer having a single controller per certificate as it is easier to debug, report status, retry on error, and reason about. Thoughts?

Similar to Abu's idea in #1722 (review)? That would prevent races, but would make sequence of events to do a proper rollout complicated (three different controllers would need to be properly synced so that CA bundle would be updated before target cert etc.). Also may lead to "orphan" controllers managing CA bundle without a target cert.

Our issue is that both the RotatedSigningCASecret and CABundleConfigMap are reconciled by multiple controllers (NewCertRotationController) without any coordination.

Yes, this issue still potentially remains. The PR focuses on solving a much more widespread issue of multiple target certs

I think the least invasive change would be to make both RotatedSigningCASecret and CABundleConfigMap thread-safe. Is there an easy way to achieve this?

I don't know if its feasible, as its not just thread-safety but also process-safety we're concerned about

p0lyn0mial · 2024-06-28T11:29:56Z

would make sequence of events to do a proper rollout complicated (three different controllers would need to be properly synced so that CA bundle would be updated before target cert etc.)

Don't we do it all the time?

The aggregator controller waits until a service is created before it wires an HTTP handler.
The degraded webhook controller waits until a webhook is created before it can validate it.

What I like about these and other controllers is that when you look inside, you will see that these controllers are reconciling a single resource. They are simply reading their prerequisites and reacting to any changes before reconciling.

In our case, it would boil down to three separate controllers that reconcile their resources, where the second and the third controller read the crypto material from the lister before reconciling.

For example: NewCertRotationController would:

Read signerCA from the lister 
Parse the signerCA 
Stop on any error 

Read the caBundle from the lister 
Parse the caBundle 
Stop on any error 

Reconcile clientCertificate with signerCA and caBundle

When there are issues with the signerCA you go to RotatedSigningCASecret to debug it since it is responsible for generating the resource. Thoughts ?

vrutkovs · 2024-07-01T05:48:53Z

Don't we do it all the time?

Yes - this is what this controller is doing in the end. So adding three more controllers for each resource won't magically solve anything.

For example: NewCertRotationController would:

It also needs

get notified that signer was updated
hostnames have changed

otherwise it would rely on "lets wait a minute until next sync to catch", which means signer is already updated by CA is not - so its unclear when other components are allowed to use the updated signer

This won't qualify as a simple bugfix two weeks before feature freeze - this is a full blown rework tbf

p0lyn0mial · 2024-07-01T10:55:40Z

So adding three more controllers for each resource won't magically solve anything.

Having a single controller for reconciling a resource would solve the race condition we are facing.

It also needs

get notified that signer was updated

hostnames have changed

otherwise it would rely on "lets wait a minute until next sync to catch", which means signer is already updated by CA is not - so its unclear when other components are allowed to use the updated signer.

yes, usually controllers react to changes made to other resources and we should do the same.

This won't qualify as a simple bugfix two weeks before feature freeze - this is a full blown rework tbf

yes, I agree it would require more code. The race we are facing is not new, it has been introduced a long time ago and it wasn't obvious. I think that having a controller per resource leads to simpler program in the end.

vrutkovs · 2024-07-02T15:12:53Z

Discussed on the meeting - this looks good enough for 4.17, later need to be reworked into more granular controllers, CA bundle controller should support handling multiple signers, TargetCertRechecker interface should be reworked to avoid using channels

tkashem · 2024-07-02T15:48:24Z

pkg/operator/certrotation/client_cert_rotation_controller.go

+			go wait.Until(func() {
+				for {
+					select {
+					case <-refresher.RecheckChannel():


so if we have N targets, that means N goroutines, each of which is waiting to receive from a channel to be notified of an event.
The producer ( the entity that writes to a channel) will probably write once when the host name changes. it is also possible that a channel is shared among the N instances depending on how we instantiate the controller.

Let's say we have three targets, {a, b, and c}; so goroutine for c when it receives from its channel it will add a key to the queue, which will trigger the sync method, and the sync method applies to all targets. It's not a huge issue, but it seems to be an unnecessary complexity we can avoid?

a couple of questions with this mechanism:

what should happen when the channel is closed, should we treat this an a event occurred? probably not, maybe we should check with case _, ok := <-refresher.RecheckChannel()?

the controller syncs every 1m, do we really need this channel based mechanism to trigger sync, are we losing anything other than 'few seconds of promptness' if we don't have this channel based trigger?

If we really need this mechanism, i would suggest abstracting this mechanism:

// this abstracts the work queue used by the controller type queueAdder interface { Add() } type channelBasedTrigger struct { ch chan time.Time queue queueAdder // or we can even use 'f func() {}' } func (t channelBasedTrigger) Event() { t.ch <- time.Now() } // used by the producer func (t channelBasedTrigger) Run(ctx context.Context) { go wait.Until(func() { for { select { case t, ok := <-t.ch: if ok { t.queue.Add() // or f() } case <-ctx.Done(): return } } }, time.Minute, ctx.Done()) }

This way we can keep the logic of channel based trigger mechanism co-located, more testable, and more generic too. Also, we can use one instance of channelBasedTrigger to inform N controllers.
There might be some constructs in the apimachinery/util/wait package that will allow us to achieve this.

what should happen when the channel is closed, should we treat this an a event occurred?

iiuc both struct implementing TargetCertRechecker interface and the controllers are created in NewCertRotationController, so the channel is never closed when the controller is still running.

If we really need this mechanism, i would suggest abstracting this mechanism:

So far we use it in just one ServingRotation struct, so probably there is an easier way out

p0lyn0mial · 2024-07-03T15:08:34Z

@vrutkovs Do we know exactly where the race condition occurs?
I know we run multiple controllers for the same types, but I would like to understand how the race condition manifests.

Also, it would be nice to have a unit test so that we can validate whether the fix we are going to apply works.

I’ve opened #1753 as an alternative to this PR.
To be honest, I still think the best approach would be to split it into multiple controllers.

vrutkovs · 2024-07-07T13:45:56Z

Do we know exactly where the race condition occurs?

Usually its various CA bundles being updated incorrectly, especially those which are used in several targets.

I’ve opened #1753 as an alternative to this PR.

I don't think its alternative really, we can have both. 1753 alone won't solve it (you still need a process-safe way for it)

p0lyn0mial · 2024-07-09T07:28:40Z

Usually its various CA bundles being updated incorrectly, especially those which are used in several targets.

@vrutkovs could you be more specific ? Do we know what exactly is broken ?

I have checked the code for updating the CA bundles and it looks good.

vrutkovs · 2024-07-09T09:13:32Z

Do we know what exactly is broken?

See this job history - it intermittently passes or fails. This job is a good example. Here kube-apiserver won't start as aggregator-client-ca was updated in an unsafe manner.
After kube-control-plane-signer was rotated, the CA bundle was expected to have both previous and new values for these signers, but instead it has new signer added twice (note the notBefore dates):

tlsconfig.go:178] "Loaded client CA" index=5 certName="client-ca-bundle::/etc/kubernetes/static-pod-certs/configmaps/client-ca/ca-bundle.crt,request-header::/etc/kubernetes/static-pod-certs/configmaps/aggregator-client-ca/ca-bundle.crt" certDetail="\"openshift-kube-apiserver-operator_kube-control-plane-signer@1770952982\" [] issuer=\"<self>\" (2026-02-13 03:23:01 +0000 UTC to 2026-04-14 03:23:02 +0000 UTC (now=2026-02-13 03:43:34.322043373 +0000 UTC))"
tlsconfig.go:178] "Loaded client CA" index=6 certName="client-ca-bundle::/etc/kubernetes/static-pod-certs/configmaps/client-ca/ca-bundle.crt,request-header::/etc/kubernetes/static-pod-certs/configmaps/aggregator-client-ca/ca-bundle.crt" certDetail="\"openshift-kube-apiserver-operator_kube-control-plane-signer@1770952982\" [] issuer=\"<self>\" (2026-02-13 03:23:01 +0000 UTC to 2026-04-14 03:23:02 +0000 UTC (now=2026-02-13 03:43:34.322057069 +0000 UTC))"

So kube-apiserver-check-endpoints sidecar can't start:

F0213 03:54:32.947195       1 cmd.go:170] error initializing delegating authentication: unable to load configmap based request-header-client-ca-file: Unauthorized

and kube-apiserver log is full of

2026-02-13T03:43:41.311586909Z E0213 03:43:41.311514      16 authentication.go:73] "Unable to authenticate the request" err="[x509: certificate signed by unknown authority, verifying certificate SN=5335778093844856917, SKID=, AKID=54:83:47:1C:75:87:ED:E4:01:8C:25:18:0D:D8:CF:05:5E:2C:C4:FE failed: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"openshift-kube-apiserver-operator_kube-control-plane-signer@1770952982\")]"

I have checked the code for updating the CA bundles and it looks good.

The code is correct, but when its not thread-safe and also doesn't protect from several processes updating the same CA configmap. This PR would eliminate process races, #1753 would ensure CABundleConfigMap update is thread safe

p0lyn0mial · 2024-07-09T09:40:36Z

After kube-control-plane-signer was rotated, the CA bundle was expected to have both previous and new values for these signers, but instead it has new signer added twice (note the notBefore dates):

this is the part I would like to understand deeper, the updating code seems to be removing duplicates and expired certificates.

I mean, I understand how the code can lose a new certificate but i don't understand how could it drop the old certificates (non expired).

p0lyn0mial · 2024-07-09T09:51:26Z

This PR would eliminate process races, #1753 would ensure CABundleConfigMap update is thread safe

Both PR seem to be removing the race within a single process. I still thing the best way forward would be to create separate controllers.

tjungblu · 2024-07-09T10:25:32Z

I mean, I understand how the code can lose a new certificate but i don't understand how could it drop the old certificates (non expired).

I would rather bet on the apply to be the culprit here:
https://github.com/openshift/library-go/blob/master/pkg/operator/resource/resourceapply/core.go#L273-L364

Just to also leave a passive aggressive (wink) reminder that I would also like to see our tests going green as a confirmation our side works correctly to expedite this whole story...

My two cents on the overall discussion so far: we've tried to get away from the CertRotationController very early on because we needed multiple signers and multiple leaf certificates per signer in etcd.

We figured it's also cheap enough to always re-create the leaf cert configuration entirely by node listing:
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/operator/etcdcertsigner/etcdcertsignercontroller.go#L473-L517

Which invalidates the need for HostnamesChanged and RecheckChannel entirely, which simplifies things a lot.

I personally find the procedural and explicit one-to-many easier to debug, I don't quite get why you need to multi-thread those with goroutines or even need multiple control loops. How many leaf certs do you expect per signer CA that requires to parallelize and lock the reconciliation across the bundles?

vrutkovs · 2024-07-09T10:28:35Z

this is the part I would like to understand deeper, the updating code seems to be removing duplicates and expired certificates.

Correct, this code removes expired signers - but previous signer has not expired yet. Its being refreshed at 80% lifetime, so it still has 20% of a lifetime to be active (its 2.5 month of a year validity).

I still thing the best way forward would be to create separate controllers.

Sure, this PR is just interim fix for us to make it until 4.17 feature freeze. Once we establish one way of process and thread safety - and have e2e tests passing - we can experiment with more substantial code rework

vrutkovs · 2024-07-09T10:45:39Z

I don't quite get why you need to multi-thread those with goroutines or even need multiple control loops

We don't, no one is happy about current codebase. However I'd prefer to stabilize existing codebase before performing any significant rework

p0lyn0mial · 2024-07-09T10:48:41Z

I don't quite get why you need to multi-thread those with goroutines or even need multiple control loops

It is not about multi-threading for performance reasons. It is about having a single control loop per resource. I think this is already a well-established pattern upstream. I think that having a single controller that manages a resource is easy to understand and debug.

Sure, this PR is just interim fix for us to make it until 4.17 feature freeze. Once we establish one way of process and thread safety - and have e2e tests passing - we can experiment with more substantial code rework

This is a crucial piece of code, and I wouldn't rush it. Besides, we still need to fix kubelet, client-go, and tons of other things before the platform will be able to recover itself from expired certificates. Thus, I don't see the point in developing temporary solutions. We are not dealing with an escalation that requires an immediate fix. I would rather implement a proper fix or not fix it at all.

tjungblu · 2024-07-09T11:01:07Z

This is a crucial piece of code, and I wouldn't rush it.

so you propose to rewrite the entire codebase to fit some upstream pattern? :) sounds great 👍

p0lyn0mial · 2024-07-09T11:17:17Z

so you propose to rewrite the entire codebase to fit some upstream pattern? :) sounds great 👍

I'm proposing wrapping CABundleConfigMap and RotatedSigningCASecret into separate controllers (the controllers would simply call the Ensure... methods) and writing a new controller for CertRotation.

The new CertRotation wouldn't differ much from the existing one. The only difference would be reading the signer and the CA from the cache/lister. We could also consider removing the side chan for the hostname for simplicity - just as you did for your controller.

Then, when we need to compose these controllers we would only have a singe instance of CABundleConfigMap and RotatedSigningCASecret and many instances of CertRotation.

Does it make sense to you as well ?

vrutkovs · 2024-07-09T11:22:35Z

I'm proposing wrapping

This is what we agreed to few weeks back and noone is debating that choice. The immediate question is why all of this is being discussed in an unrelated PR with a temporary fix for 4.17 - and why is it being stalled for several weeks already

vrutkovs · 2024-07-09T12:36:19Z

This is a crucial piece of code, and I wouldn't rush it. Besides, we still need to fix kubelet, client-go, and tons of other things before the platform will be able to recover itself from expired certificates. Thus, I don't see the point in developing temporary solutions. We are not dealing with an escalation that requires an immediate fix. I would rather implement a proper fix or not fix it at all.

We kind of have to. If the rework is significant and 4.18 branches we won't be able to backport it.

Besides, we still need to fix kubelet, client-go, and tons of other things before the platform will be able to recover itself from expired certificates

For indefinite suspend period - yes. For 90 days / 1 year on SNO - no, not really, we already recover with approved manual steps.

We are not dealing with an escalation that requires an immediate fix

We do, this feature (limited to 90 days etc.) is on 4.17 plan

openshift-ci · 2024-07-11T15:20:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dinhxuanvu, vrutkovs
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/operator/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…Controller Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.

openshift-ci · 2024-07-12T07:45:58Z

@vrutkovs: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/unit	`6a2734e`	link	true	`/test unit`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

vrutkovs mentioned this pull request Apr 19, 2024

WIP certrotation: don't let multiple controllers rewrite metadata of the shared resource #1693

Closed

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 19, 2024

openshift-ci bot requested review from hexfusion and stlaz April 19, 2024 13:24

vrutkovs force-pushed the cert-rotation-multiple-targets branch 2 times, most recently from 9130e0c to d10d787 Compare April 22, 2024 12:17

vrutkovs changed the title ~~WIP cert-rotation: allow specifying multiple target certs in CertRotationController~~ cert-rotation: allow specifying multiple target certs in CertRotationController Apr 22, 2024

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2024

vrutkovs changed the title ~~cert-rotation: allow specifying multiple target certs in CertRotationController~~ API-1802: cert-rotation: allow specifying multiple target certs in CertRotationController Apr 24, 2024

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 24, 2024

openshift-ci bot requested review from p0lyn0mial and tkashem May 9, 2024 06:44

dinhxuanvu reviewed Jun 20, 2024

View reviewed changes

openshift-ci bot assigned dinhxuanvu Jun 20, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 20, 2024

tkashem reviewed Jun 25, 2024

View reviewed changes

vrutkovs force-pushed the cert-rotation-multiple-targets branch from d10d787 to a528215 Compare June 26, 2024 15:06

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2024

vrutkovs force-pushed the cert-rotation-multiple-targets branch from a528215 to d3c0949 Compare June 26, 2024 15:07

vrutkovs commented Jun 26, 2024

View reviewed changes

tkashem reviewed Jul 2, 2024

View reviewed changes

p0lyn0mial mentioned this pull request Jul 3, 2024

operator/certrotation: introduce an optional lock to CABundleConfigMap and RotatedSigningCASecret #1752

Closed

p0lyn0mial mentioned this pull request Jul 3, 2024

operator/certrotation: introduce an optional lock to CABundleConfigMap and RotatedSigningCASecret #1753

Closed

vrutkovs force-pushed the cert-rotation-multiple-targets branch from d3c0949 to 126e202 Compare July 11, 2024 15:19

cert-rotation: allow specifying multiple target certs in CertRotation…

ddd90e3

…Controller Instead of defining several controllers managing the same signer/CA bundle pair and different target certs the same controller can accept a list of target certs to create.

vrutkovs force-pushed the cert-rotation-multiple-targets branch from 126e202 to b2298d8 Compare July 11, 2024 16:21

certrotation: add certrotation controller unittests

6a2734e

vrutkovs force-pushed the cert-rotation-multiple-targets branch from b2298d8 to 6a2734e Compare July 12, 2024 07:27

API-1802: cert-rotation: allow specifying multiple target certs in CertRotationController #1722

Are you sure you want to change the base?

API-1802: cert-rotation: allow specifying multiple target certs in CertRotationController #1722

Conversation

vrutkovs commented Apr 19, 2024 • edited Loading

openshift-ci-robot commented Apr 24, 2024 • edited by openshift-ci bot Loading

vrutkovs commented May 9, 2024

dinhxuanvu left a comment

Choose a reason for hiding this comment

tkashem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vrutkovs commented Jun 25, 2024 • edited Loading

openshift-ci bot commented Jun 26, 2024

Choose a reason for hiding this comment

openshift-ci-robot commented Jun 27, 2024 • edited by openshift-ci bot Loading

openshift-ci-robot commented Jun 27, 2024 • edited by openshift-ci bot Loading

openshift-ci-robot commented Jun 27, 2024 • edited by openshift-ci bot Loading

p0lyn0mial commented Jun 28, 2024

vrutkovs commented Jun 28, 2024

p0lyn0mial commented Jun 28, 2024

vrutkovs commented Jul 1, 2024 • edited Loading

p0lyn0mial commented Jul 1, 2024

vrutkovs commented Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p0lyn0mial commented Jul 3, 2024 • edited Loading

vrutkovs commented Jul 7, 2024

p0lyn0mial commented Jul 9, 2024

vrutkovs commented Jul 9, 2024

p0lyn0mial commented Jul 9, 2024 • edited Loading

p0lyn0mial commented Jul 9, 2024

tjungblu commented Jul 9, 2024

vrutkovs commented Jul 9, 2024

vrutkovs commented Jul 9, 2024

p0lyn0mial commented Jul 9, 2024

tjungblu commented Jul 9, 2024

p0lyn0mial commented Jul 9, 2024

vrutkovs commented Jul 9, 2024

vrutkovs commented Jul 9, 2024

openshift-ci bot commented Jul 11, 2024

openshift-ci bot commented Jul 12, 2024

vrutkovs commented Apr 19, 2024 •

edited

Loading

openshift-ci-robot commented Apr 24, 2024 •

edited by openshift-ci bot

Loading

vrutkovs commented Jun 25, 2024 •

edited

Loading

openshift-ci-robot commented Jun 27, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jun 27, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jun 27, 2024 •

edited by openshift-ci bot

Loading

vrutkovs commented Jul 1, 2024 •

edited

Loading

vrutkovs commented Jul 2, 2024 •

edited

Loading

p0lyn0mial commented Jul 3, 2024 •

edited

Loading

p0lyn0mial commented Jul 9, 2024 •

edited

Loading