Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO-JIRA: certrotation: improve error messages #1720

Conversation

vrutkovs
Copy link
Member

@vrutkovs vrutkovs commented Apr 17, 2024

  • Explicitly log "secret/configmap doesn't exist" events
  • when CA bundle is updated log which signer update caused it

Tested in this job with shortened rotation periods.
This Loki query shows new events:

  • SignerUpdateRequired openshift-kube-apiserver-operator node-system-admin-signer requires a new signing cert/key pair: secret doesn't exist instead of "missing notBefore annotation"
  • SignerUpdateRequired openshift-kube-apiserver-operator node-system-admin-signer requires a new signing cert/key pair: past refresh time (80% of validity): 2024-06-24 11:46:13.8 +0000 UTC
  • CABundleUpdateRequired openshift-kube-controller-manager-operator csr-controller-signer-ca requires a new cert: signer update openshift-kube-controller-manager-operator/csr-signer

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2024
@openshift-ci openshift-ci bot requested review from hexfusion and stlaz April 17, 2024 08:25
@vrutkovs vrutkovs changed the title WIP certrotation: improve error messages NO-JIRA: certrotation: improve error messages Apr 17, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 17, 2024
@openshift-ci-robot
Copy link

@vrutkovs: This pull request explicitly references no jira issue.

In response to this:

  • Explicitly log "secret/configmap doesn't exist" events
  • when CA bundle is updated log which signer update caused it

Tested in this job with shortened rotation periods.
This Loki query shows new events:

  • SignerUpdateRequired openshift-kube-apiserver-operator node-system-admin-signer requires a new signing cert/key pair: secret doesn't exist instead of "missing notBefore annotation"
  • CABundleUpdateRequired openshift-etcd etcd-metrics-ca-bundle requires a new cert: signer update openshift-config/etcd-metric-signer

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 17, 2024
@vrutkovs vrutkovs force-pushed the certrotation-better-error-messages branch 2 times, most recently from f2f23fb to 85a077e Compare April 17, 2024 14:08
@@ -67,6 +67,7 @@ func (c RotatedSigningCASecret) EnsureSigningCertKeyPair(ctx context.Context) (*
if err != nil && !apierrors.IsNotFound(err) {
return nil, false, err
}
var signerExists = true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required to pass unit tests, which do delete+create (and don't set ResourceVersion afterwards). It would be reworked after ApplySecretDoNotUse is removed

@vrutkovs
Copy link
Member Author

vrutkovs commented May 9, 2024

/cc @tkashem @p0lyn0mial

@openshift-ci openshift-ci bot requested review from p0lyn0mial and tkashem May 9, 2024 06:44
Copy link
Contributor

@tkashem tkashem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend refactoring the code in a way that makes it easier to prepare the error messages, but that would be a long term goal.
This PR is a nice stop gap, it looks good to me except for a few minor comments

if originalCABundleConfigMap == nil {
reason = "configmap doesn't exist"
} else if originalCABundleConfigMap.Data == nil {
reason = "configmap is empty"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a minor suggestions:

  • can we show the name and namespace of the object that does not exist for reason "configmap doesn't exist" and "configmap is empty"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error would be appended to the message with namespace/name: openshift-kube-apiserver-operator node-system-admin-signer requires a new signing cert/key pair: secret doesn't exist in SignerUpdateRequired event

} else if !equality.Semantic.DeepEqual(originalCABundleConfigMap.Data, caBundleConfigMap.Data) {
reason = fmt.Sprintf("signer update %s", signingCertKeyPairLocation)
}
c.EventRecorder.Eventf("CABundleUpdateRequired", "%q in %q requires a new cert: %s", c.Name, c.Namespace, reason)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: what does the c.Name and c.Namespace mean in relation to the object in the reason, or are we saying that the CA bundle configmap located in c.Namespace/c.Name requires a new cert?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, correct, that would be object namespace / name of the target configmap/secret

@@ -153,7 +159,7 @@ func needNewSigningCertKeyPair(annotations map[string]string, refresh time.Durat
validity := notAfter.Sub(notBefore)
at80Percent := notAfter.Add(-validity / 5)
if time.Now().After(at80Percent) {
return true, fmt.Sprintf("past its latest possible time %v", at80Percent)
return true, fmt.Sprintf("past refresh time (80%% of validity): %v", at80Percent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do you think it will help if we also include the validity in the message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would show happy path, where certs are updated at 80%, as the cert can be force refreshed in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is dead code actually, I haven't seen this message at all

func needNewTargetCertKeyPair(annotations map[string]string, signer *crypto.CA, caBundleCerts []*x509.Certificate, refresh time.Duration, refreshOnlyWhenExpired bool) string {
func needNewTargetCertKeyPair(secret *corev1.Secret, signer *crypto.CA, caBundleCerts []*x509.Certificate, refresh time.Duration, refreshOnlyWhenExpired bool) string {
if secret.ResourceVersion == "" {
return "secret doesn't exist"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on the "ResourceVersion" being empty to indicate that the object does not exist may lead to bugs in the future, sometimes we set the resource version of an existing object to empty string before we send it to the apiserver for update
The more reliable way for us to know whether an object exists is when we get a NotFoundError from the apiserver, and use the not found error to produce this error message seems more intuitive to me, thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, implemented

@vrutkovs vrutkovs force-pushed the certrotation-better-error-messages branch from 85a077e to 6a9056b Compare May 9, 2024 13:26
* Explicitly log "secret/configmap doesn't exist" events
* when secret is refreshed differenciate between refresh on 80% of validity
  and custom refresh time
* when CA bundle is updated log which signer update caused it
@vrutkovs vrutkovs force-pushed the certrotation-better-error-messages branch from 6a9056b to 9a49878 Compare June 24, 2024 06:51
@vrutkovs
Copy link
Member Author

/cc @tjungblu

This changes signatures for EnsureConfigMapCABundle and NeedNewTargetCertKeyPair used in cluster-etcd-operator

@openshift-ci openshift-ci bot requested a review from tjungblu June 24, 2024 07:18
@tjungblu
Copy link
Contributor

ACK

@@ -41,7 +41,7 @@ type CABundleConfigMap struct {
EventRecorder events.Recorder
}

func (c CABundleConfigMap) EnsureConfigMapCABundle(ctx context.Context, signingCertKeyPair *crypto.CA) ([]*x509.Certificate, error) {
func (c CABundleConfigMap) EnsureConfigMapCABundle(ctx context.Context, signingCertKeyPair *crypto.CA, signingCertKeyPairLocation string) ([]*x509.Certificate, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

silly question while going through the breaking change in CEO, isn't that simply ca.Namespace/ca.Name at this point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a location of the signer used to build this CA, so its secret.Namespace/secret.Name

@tkashem
Copy link
Contributor

tkashem commented Jul 2, 2024

/lgtm

Thanks!

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 2, 2024
@tkashem
Copy link
Contributor

tkashem commented Jul 2, 2024

/cc @soltysh (for approval)

Copy link
Member

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Copy link
Contributor

openshift-ci bot commented Jul 11, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: soltysh, tkashem, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 11, 2024
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD b941412 and 2 for PR HEAD 9a49878 in total

Copy link
Contributor

openshift-ci bot commented Jul 11, 2024

@vrutkovs: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 737dc0f into openshift:master Jul 11, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants