Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconciler/managed: add crossplane_resource_drift_seconds metric #489

Closed
wants to merge 2 commits into from

Conversation

sttts
Copy link
Contributor

@sttts sttts commented Jul 27, 2023

Description of your changes

Adds a metric crossplane_resource_drift_seconds that records the time since the previous reconcile when a resource was found to be out of sync.
As we don't record the last reconcile on the object (for reasons, we cannot update an object on every observe), we count from the last reconcile in the same process. This means obviously that we don't emit any metric when the provider restarts.

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

@sttts sttts requested review from a team as code owners July 27, 2023 18:21
@sttts sttts force-pushed the sttts-drift-metric branch 2 times, most recently from 29e3853 to cc7e838 Compare July 27, 2023 18:32
@sttts sttts force-pushed the sttts-drift-metric branch 5 times, most recently from 22de3f8 to 40ac673 Compare August 2, 2023 08:54
Copy link
Member

@turkenh turkenh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sttts, this is a great step toward improving observability 💪

Left a couple of commands, but none is blocking.

Could you please also populate the How has this code been tested section in the PR description 🙏

},
})
if err != nil {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we wrap these (this and the above) errors with more details?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

drift = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Subsystem: subSystem,
Name: "resource_drift_seconds",
Help: "ALPHA: How long since the previous reconcile when a resource was found to be out of sync; excludes restart of the provider",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the previous reconcile

Should this be "since the previous successful reconcile" or even "since the previous successful reconcile where the resource was observed as synced" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -1106,6 +1121,9 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco
return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
}

// record the drift after the successful update.
r.driftRecorder.recordUpdate(managed.GetName())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also reset the timer here (i.e. call recordUnchanged)? I believe we rely on an upcoming reconciliation which we expect to happen after a successful update, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there would be an unchanged case afterwards. But I agree, we can directly reset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@sttts sttts force-pushed the sttts-drift-metric branch 4 times, most recently from f4d41ab to 6123931 Compare August 3, 2023 07:24
Copy link
Contributor

@ulucinar ulucinar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sttts. I will bump the crossplane-runtime version across the official providers once this PR is merged.

func (r *driftRecorder) Start(ctx context.Context) error {
inf, err := r.cluster.GetCache().GetInformerForKind(ctx, r.gvk)
if err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we wrap the error here for more context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1106,6 +1121,9 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco
return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
}

// record the drift after the successful update.
r.driftRecorder.recordUpdate(managed.GetName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upjet runtime signals a resource-up-to-date in certain circumstances without actually making sure that the external resource is actually up-to-date:

  • If an async operation is still ongoing at the time the resource is being reconciled
  • If the native provider's ttl has expired and we have not yet drained the runner

Especially the first case is common because we asynchronously run the Terraform operations. The recorded metrics will be reflecting a lower bound on the intended metric in this case.

The proposed changes are good from my perspective. The cases mentioned above can be considered as some violations of the contract between a provider and the managed reconciler. Just dropping this note here to increase awareness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, maybe we should add an error type to external.Update to "come back later" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

background: this metric is really only useful if it is an upper bound.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulucinar see second commit. Is that reasonable?

Signed-off-by: Dr. Stefan Schimanski <stefan.schimanski@upbound.io>
Signed-off-by: Dr. Stefan Schimanski <stefan.schimanski@upbound.io>
var subSystem = "crossplane"

var (
drift = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a precedent of using the prometheus client elsewhere? It might be nice to use Opencensus (e.g. to have a single metrics library if we wanted to add tracing in future). I'm guessing we may not get a choice in the matter, since we presumably need to extend the metrics controller-runtime is exposing (and I think it uses the Prometheus client per kubernetes-sigs/controller-runtime#305).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally did what controller-runtime does. Not sure what you are suggesting here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really suggesting anything. 🙂 I'm essentially asking whether using Opencensus instead would be possible and advisable, but I think I know the answer is "no, not possible".

@@ -681,6 +721,11 @@ func NewReconciler(m manager.Manager, of resource.ManagedKind, o ...ReconcilerOp
ro(r)
}

if err := m.Add(&r.driftRecorder); err != nil {
r.log.Info("unable to register drift recorder with controller manager", "error", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a debug log? What would a typical user do with/about this log when they saw it?

Copy link
Contributor Author

@sttts sttts Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would even promote it to an error if I could with our logging interface. This signals that something is broken. It's not debug output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative: we add an error as return value as a breaking change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some rationale as to why we don't have error log level: https://dave.cheney.net/2015/11/05/lets-talk-about-logging

Copy link
Member

@negz negz Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second part of my original comment still stands though - how would a user fix this? What would cause this? In the spirit of the better errors guide - what does "unable to register drift recorder" mean?

Keeping at info level is fine with me if it's meaningful and actionable to the user. Channeling the good errors guide, I don't personally think I'd know what "unable to register drift recorder" meant if I hadn't read this PR, or what I should do about it.

@@ -892,7 +937,13 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco
log = log.WithValues("deletion-timestamp", managed.GetDeletionTimestamp())

if observation.ResourceExists && policy.ShouldDelete() {
if err := external.Delete(externalCtx, managed); err != nil {
err := external.Delete(externalCtx, managed)
var retryAfterErr RetryAfterError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This long variable name is redundant given its limited scope and the fact that its type RetryAfterError is written immediately after its name.

https://github.com/crossplane/crossplane/tree/master/contributing#use-descriptive-variable-names-sparingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

cluster cluster.Cluster
}

var _ manager.Runnable = &driftRecorder{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I prefer to add these "does it satisfy the interface" checks to test files, since they are a kind of test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would argue the inverse: they are for the reader aka documentation. Testing is a side-effect.

Copy link
Member

@negz negz Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're both, and besides documentation (e.g. Godoc examples) can go in test files too. Not a hill I'll die on, but precedent elsewhere in Crossplane is to put these in test files.

@negz
Copy link
Member

negz commented May 31, 2024

This was merged via #683.

@negz negz closed this May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants