reconciler/managed: add crossplane_resource_drift_seconds metric #489

sttts · 2023-07-27T18:21:42Z

Description of your changes

Adds a metric crossplane_resource_drift_seconds that records the time since the previous reconcile when a resource was found to be out of sync.
As we don't record the last reconcile on the object (for reasons, we cannot update an object on every observe), we count from the last reconcile in the same process. This means obviously that we don't emit any metric when the provider restarts.

I have:

Read and followed Crossplane's contribution process.
Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

turkenh

Thank you @sttts, this is a great step toward improving observability 💪

Left a couple of commands, but none is blocking.

Could you please also populate the How has this code been tested section in the PR description 🙏

turkenh · 2023-08-03T05:38:11Z

pkg/reconciler/managed/metrics.go

+		},
+	})
+	if err != nil {
+		return err


nit: should we wrap these (this and the above) errors with more details?

turkenh · 2023-08-03T05:43:21Z

pkg/reconciler/managed/metrics.go

+	drift = prometheus.NewHistogramVec(prometheus.HistogramOpts{
+		Subsystem: subSystem,
+		Name:      "resource_drift_seconds",
+		Help:      "ALPHA: How long since the previous reconcile when a resource was found to be out of sync; excludes restart of the provider",


since the previous reconcile

Should this be "since the previous successful reconcile" or even "since the previous successful reconcile where the resource was observed as synced" ?

turkenh · 2023-08-03T05:48:19Z

pkg/reconciler/managed/reconciler.go

@@ -1106,6 +1121,9 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco
 		return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
 	}

+	// record the drift after the successful update.
+	r.driftRecorder.recordUpdate(managed.GetName())


Should we also reset the timer here (i.e. call recordUnchanged)? I believe we rely on an upcoming reconciliation which we expect to happen after a successful update, right?

there would be an unchanged case afterwards. But I agree, we can directly reset.

ulucinar

Thanks @sttts. I will bump the crossplane-runtime version across the official providers once this PR is merged.

ulucinar · 2023-08-03T08:11:55Z

pkg/reconciler/managed/metrics.go

+func (r *driftRecorder) Start(ctx context.Context) error {
+	inf, err := r.cluster.GetCache().GetInformerForKind(ctx, r.gvk)
+	if err != nil {
+		return err


Should we wrap the error here for more context?

ulucinar · 2023-08-03T08:22:37Z

pkg/reconciler/managed/reconciler.go

@@ -1106,6 +1121,9 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco
 		return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
 	}

+	// record the drift after the successful update.
+	r.driftRecorder.recordUpdate(managed.GetName())


Upjet runtime signals a resource-up-to-date in certain circumstances without actually making sure that the external resource is actually up-to-date:

If an async operation is still ongoing at the time the resource is being reconciled

If the native provider's ttl has expired and we have not yet drained the runner

Especially the first case is common because we asynchronously run the Terraform operations. The recorded metrics will be reflecting a lower bound on the intended metric in this case.

The proposed changes are good from my perspective. The cases mentioned above can be considered as some violations of the contract between a provider and the managed reconciler. Just dropping this note here to increase awareness.

hmm, maybe we should add an error type to external.Update to "come back later" ?

background: this metric is really only useful if it is an upper bound.

@ulucinar see second commit. Is that reasonable?

Signed-off-by: Dr. Stefan Schimanski <stefan.schimanski@upbound.io>

negz · 2023-08-07T23:43:42Z

pkg/reconciler/managed/metrics.go

+var subSystem = "crossplane"
+
+var (
+	drift = prometheus.NewHistogramVec(prometheus.HistogramOpts{


Do we have a precedent of using the prometheus client elsewhere? It might be nice to use Opencensus (e.g. to have a single metrics library if we wanted to add tracing in future). I'm guessing we may not get a choice in the matter, since we presumably need to extend the metrics controller-runtime is exposing (and I think it uses the Prometheus client per kubernetes-sigs/controller-runtime#305).

I intentionally did what controller-runtime does. Not sure what you are suggesting here?

I'm not really suggesting anything. 🙂 I'm essentially asking whether using Opencensus instead would be possible and advisable, but I think I know the answer is "no, not possible".

negz · 2023-08-07T23:48:24Z

pkg/reconciler/managed/reconciler.go

@@ -681,6 +721,11 @@ func NewReconciler(m manager.Manager, of resource.ManagedKind, o ...ReconcilerOp
 		ro(r)
 	}

+	if err := m.Add(&r.driftRecorder); err != nil {
+		r.log.Info("unable to register drift recorder with controller manager", "error", err)


Should this be a debug log? What would a typical user do with/about this log when they saw it?

Would even promote it to an error if I could with our logging interface. This signals that something is broken. It's not debug output.

Alternative: we add an error as return value as a breaking change.

Some rationale as to why we don't have error log level: https://dave.cheney.net/2015/11/05/lets-talk-about-logging

I think the second part of my original comment still stands though - how would a user fix this? What would cause this? In the spirit of the better errors guide - what does "unable to register drift recorder" mean?

Keeping at info level is fine with me if it's meaningful and actionable to the user. Channeling the good errors guide, I don't personally think I'd know what "unable to register drift recorder" meant if I hadn't read this PR, or what I should do about it.

negz · 2023-08-07T23:50:12Z

pkg/reconciler/managed/reconciler.go

@@ -892,7 +937,13 @@ func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reco
 		log = log.WithValues("deletion-timestamp", managed.GetDeletionTimestamp())

 		if observation.ResourceExists && policy.ShouldDelete() {
-			if err := external.Delete(externalCtx, managed); err != nil {
+			err := external.Delete(externalCtx, managed)
+			var retryAfterErr RetryAfterError


Nit: This long variable name is redundant given its limited scope and the fact that its type RetryAfterError is written immediately after its name.

https://github.com/crossplane/crossplane/tree/master/contributing#use-descriptive-variable-names-sparingly

negz · 2023-08-07T23:52:13Z

pkg/reconciler/managed/metrics.go

+	cluster cluster.Cluster
+}
+
+var _ manager.Runnable = &driftRecorder{}


Nit: I prefer to add these "does it satisfy the interface" checks to test files, since they are a kind of test.

Would argue the inverse: they are for the reader aka documentation. Testing is a side-effect.

They're both, and besides documentation (e.g. Godoc examples) can go in test files too. Not a hill I'll die on, but precedent elsewhere in Crossplane is to put these in test files.

pkg/reconciler/managed/metrics.go

negz · 2024-05-31T00:07:51Z

This was merged via #683.

sttts requested review from a team as code owners July 27, 2023 18:21

sttts requested review from bobh66 and nullable-eth July 27, 2023 18:21

sttts force-pushed the sttts-drift-metric branch 2 times, most recently from 29e3853 to cc7e838 Compare July 27, 2023 18:32

sttts assigned ulucinar Jul 27, 2023

sttts force-pushed the sttts-drift-metric branch 5 times, most recently from 22de3f8 to 40ac673 Compare August 2, 2023 08:54

turkenh approved these changes Aug 3, 2023

View reviewed changes

sttts force-pushed the sttts-drift-metric branch 4 times, most recently from f4d41ab to 6123931 Compare August 3, 2023 07:24

ulucinar approved these changes Aug 3, 2023

View reviewed changes

sttts added 2 commits August 3, 2023 11:14

reconciler/managed: add crossplane_resource_drift_seconds metric

cc9e486

Signed-off-by: Dr. Stefan Schimanski <stefan.schimanski@upbound.io>

reconciler/managed: add RetryAfterError to external client contract

938e13e

Signed-off-by: Dr. Stefan Schimanski <stefan.schimanski@upbound.io>

sttts force-pushed the sttts-drift-metric branch from 6123931 to 938e13e Compare August 3, 2023 09:35

negz reviewed Aug 7, 2023

View reviewed changes

pkg/reconciler/managed/metrics.go Show resolved Hide resolved

jbw976 mentioned this pull request Sep 14, 2023

High level metrics crossplane/crossplane#4620

Open

7 tasks

ezgidemirel mentioned this pull request Apr 3, 2024

Introduce High Level MR metrics #683

Merged

2 tasks

negz closed this May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reconciler/managed: add crossplane_resource_drift_seconds metric #489

reconciler/managed: add crossplane_resource_drift_seconds metric #489

sttts commented Jul 27, 2023 •

edited

Loading

turkenh left a comment •

edited

Loading

turkenh Aug 3, 2023

sttts Aug 3, 2023

turkenh Aug 3, 2023

sttts Aug 3, 2023

turkenh Aug 3, 2023

sttts Aug 3, 2023

sttts Aug 9, 2023

ulucinar left a comment

ulucinar Aug 3, 2023

sttts Aug 3, 2023

ulucinar Aug 3, 2023

sttts Aug 3, 2023

sttts Aug 3, 2023

sttts Aug 3, 2023

negz Aug 7, 2023

sttts Aug 9, 2023

negz Aug 10, 2023

negz Aug 7, 2023

sttts Aug 9, 2023 •

edited

Loading

sttts Aug 9, 2023

negz Aug 10, 2023

negz Aug 10, 2023 •

edited

Loading

negz Aug 7, 2023

sttts Aug 9, 2023

negz Aug 7, 2023

sttts Aug 9, 2023

negz Aug 10, 2023 •

edited

Loading

negz commented May 31, 2024

reconciler/managed: add crossplane_resource_drift_seconds metric #489

reconciler/managed: add crossplane_resource_drift_seconds metric #489

Conversation

sttts commented Jul 27, 2023 • edited Loading

Description of your changes

How has this code been tested

turkenh left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ulucinar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sttts Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

negz Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

negz Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

negz commented May 31, 2024

sttts commented Jul 27, 2023 •

edited

Loading

turkenh left a comment •

edited

Loading

sttts Aug 9, 2023 •

edited

Loading

negz Aug 10, 2023 •

edited

Loading

negz Aug 10, 2023 •

edited

Loading