Add upjet runtime Prometheus metrics #170

ulucinar · 2023-03-03T10:08:45Z

Description of your changes

Fixes #167

This PR adds the following Prometheus metrics to the upjet runtime. These are upjet runtime metrics, meaning that they are exposed by a provider while reconciling its managed resources via upjet:

upjet_terraform_cli_duration: This is a histogram metric and reports statistics, in seconds, on how long it takes a Terraform CLI invocation to complete.
upjet_terraform_active_cli_invocations: This is a gauge metric and it's the number of active (running) Terraform CLI invocations.
upjet_terraform_running_processes: This is a gauge metric and it's the number of running Terraform CLI and Terraform provider processes.
upjet_resource_ttr: This is a histogram metric and it measures, in seconds, the time-to-readiness for managed resources.

terraform.Operation.MarkStart now atomically checks for any previous ongoing operation before starting a new one, and
terraform.Operation.{Start,End}Time no longer return pointers that could potentially be used to modify the shared state outside of critical sections.

The following labels are available for the exposed runtime metrics:

upjet_terraform_cli_duration: subcommand and mode.
- subcommand: The terraform subcommand that's run, e.g., init, apply, plan, destroy, etc.
- mode: The execution mode of the Terraform CLI, one of sync (so that the CLI was invoked synchronously as part of a reconcile loop), async (so that the CLI was invoked asynchronously, the reconciler goroutine will poll and collect results in future).
upjet_terraform_active_cli_invocations: subcommand and mode.
- subcommand: The terraform subcommand that's run, e.g., init, apply, plan, destroy, etc.
- mode: The execution mode of the Terraform CLI, one of sync (so that the CLI was invoked synchronously as part of a reconcile loop), async (so that the CLI was invoked asynchronously, the reconciler goroutine will poll and collect results in future).
upjet_terraform_running_processes: type
- type: Either cli for Terraform CLI (the terraform process) processes or provider for the Terraform provider processes. Please note that this is a best effort metric that may not be able to precisely catch & report all relevant processes. We may, in the future, improve this if needed by for example watching the fork system calls. But currently, it may prove to be useful to watch rouge Terraform provider processes.
upjet_resource_ttr: group, version, kind
- group, version, kind labels record the API group, version and kind for the managed resource, whose time-to-readiness measurement is captured.

Notes on the concurrency-related changes:

terraform.Operation.MarkStart now atomically checks for ongoing async operations and reserves the "operation slot" (by recording the start time): We were previously checking whether there's an ongoing async operation in a critical section, exiting out of the critical section and then entering another section where we do the reservation like follows:

	if w.LastOperation.IsRunning() {
		return errors.Errorf("%s operation that started at %s is still running", w.LastOperation.Type, w.LastOperation.StartTime().String())
	}
	w.LastOperation.MarkStart("apply")

From a theoretical perspective this does not look right but in fact, the above section is never executed by two concurrent goroutines (on the same operation) and thus is safe, as long as the controller-runtime behaves according to this assumption. But nevertheless, this PR proposes to change MarkStart so that it atomically checks and reserves the slot because:

It's conceptually simpler. The relevant section above now looks like:

	if !w.LastOperation.MarkStart("apply") {
		return errors.Errorf("%s operation that started at %s is still running", w.LastOperation.Type, w.LastOperation.StartTime().String())
	}

It's easier to reason about and to prove its correctness as we align with the theory. You don't need to make assumptions about how the controller-runtime behaves and in the very unlikely case that this assumption does not hold (because of a bug in client-go or controller-runtime, or because of a change in upjet), we will still be safe.

terraform.Operation.{Start,End}Time no longer return pointers that could potentially be used to modify the shared state outside of critical sections: Not sure if this has practical implications but again from a theoretical point of view, it's good practice to read the data in a critical section, make a copy of it, and return that snapshot copy so that its clients will not have a chance to modify the shared state outside of a critical section.

I have:

Read and followed Crossplane's contribution process.
Run make reviewable to ensure this PR is ready for review.
We'd also like to document what custom metrics are exposed from upjet runtime
Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

10 userpool.cognitoidp resources from upbound/provider-aws were provisioned, reconciled with a poll interval of 1m twice after acquiring the Ready=True status condition, and they were finally destroyed. Here are some sample screenshots from the Prometheus UI:

upjet_terraform_active_cli_invocations gauge metric showing the sync & async terraform init/apply/plan/destroy invocations:

upjet_terraform_running_processes gauge metric showing both cli and provider labels:

upjet_terraform_cli_duration histogram metric, showing average Terraform CLI running times for the last 5m:

The medians (0.5-quantiles) for these observations aggregated by the mode and Terraform subcommand being invoked:

upjet_resource_ttr histogram metric, showing average resource TTR for the last 10m:

The median (0.5-quantile) for these TTR observations:

- upjet_terraform_cli_duration: Reports statistics, in seconds, on how long it takes a Terraform CLI invocation to complete - upjet_terraform_active_cli_invocations: The number of active (running) Terraform CLI invocations - upjet_terraform_running_processes: The number of running Terraform CLI and Terraform provider processes - upjet_resource_ttr: Measures, in seconds, the time-to-readiness for managed resources - terraform.Operation.MarkStart now atomically checks for any previous ongoing operation before starting a new one - terraform.Operation.{Start,End}Time no longer return pointers that could potentially be used to modify the shared state outside of critical sections. Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>

sergenyalcin

Thanks @ulucinar LGTM!

Piotr1215

@ulucinar this looks really great! I'm however by no means a Prometheus expert.
I asked @AaronME could you take a look as well and tagged him for the review.
I think the question we had was around high cardinality metrics, especially the gauge ones that collect frequently. What I can recommend is to see how the metrics are collected and if some of them are too frequent, maybe aggregate over them.
I'm approving to get this unblocked and we can keep iterating.

ulucinar · 2023-03-07T15:19:33Z

Hi @sergenyalcin, @Piotr1215,
Thank you for the reviews.

I think the question we had was around high cardinality metrics, especially the gauge ones that collect frequently.

Yes, correct. When the PR was first opened, it contained workspace and uid labels for the relevant metrics (for example the upjet_terraform_active_cli_invocations metric, which is always in the context of a Terraform workspace). But in a cluster with 1000s of MRs, this would result in 1000s of data series for the same metric. After experimenting more with the metrics, I've decided to remove those labels.

AaronME · 2023-03-07T18:59:05Z

Thanks for looping me in, @Piotr1215 !

@ulucinar This looks good to me! Thank you for tackling this.

negz · 2023-03-07T20:12:40Z

pkg/metrics/metrics.go

+package metrics
+
+import (
+	"github.com/prometheus/client_golang/prometheus"


Consider this non-blocking, but I'd prefer to use OpenTelemetry to expose Prom metrics. I believe we mostly use Otel for Upbound things internally, and it would open a path to use one SDK for all observability (i.e. traces and logs too).

We've held off on this in the past waiting to see what controller-runtime would do per kubernetes-sigs/controller-runtime#305.

Thanks @negz for the pointer. Makes sense to me.
Let's proceed with the Prometheus metrics for now as the controller-runtime still makes use of them. I have not checked if it's possible with OpenTelemetry metrics but it was convenient to register upjet's custom metrics with the controller-runtime's registry. What do you think?

Opened #171 to track this. Thank you @negz for bringing this up. Let's track it there.

ulucinar requested review from sergenyalcin and Piotr1215 March 3, 2023 10:08

ulucinar force-pushed the fix-167 branch from ca8151e to a2c0934 Compare March 7, 2023 02:11

ulucinar force-pushed the fix-167 branch from a2c0934 to 7a3f20a Compare March 7, 2023 03:10

sergenyalcin approved these changes Mar 7, 2023

View reviewed changes

Piotr1215 requested a review from AaronME March 7, 2023 11:48

Piotr1215 approved these changes Mar 7, 2023

View reviewed changes

AaronME approved these changes Mar 7, 2023

View reviewed changes

negz reviewed Mar 7, 2023

View reviewed changes

ulucinar mentioned this pull request Mar 8, 2023

Use OpenTelemetry to expose custom Prometheus metrics #171

Closed

ulucinar merged commit 5377e5d into crossplane:main Mar 8, 2023

ulucinar deleted the fix-167 branch March 8, 2023 10:32

ulucinar mentioned this pull request Mar 8, 2023

Consume upjet with custom metrics crossplane-contrib/provider-upjet-aws#597

Merged

1 task

This was referenced Mar 8, 2023

Consume upjet with custom metrics crossplane-contrib/provider-upjet-azure#406

Merged

Consume upjet with custom metrics crossplane-contrib/provider-upjet-gcp#250

Merged

Consume upjet with custom metrics crossplane-contrib/provider-upjet-azuread#30

Merged

ulucinar mentioned this pull request Mar 31, 2023

Add an introductory Upjet-based provider monitoring guide #181

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add upjet runtime Prometheus metrics #170

Add upjet runtime Prometheus metrics #170

ulucinar commented Mar 3, 2023 •

edited

Loading

sergenyalcin left a comment

Piotr1215 left a comment

ulucinar commented Mar 7, 2023

AaronME commented Mar 7, 2023

negz Mar 7, 2023

ulucinar Mar 8, 2023 •

edited

Loading

Add upjet runtime Prometheus metrics #170

Add upjet runtime Prometheus metrics #170

Conversation

ulucinar commented Mar 3, 2023 • edited Loading

Description of your changes

Notes on the concurrency-related changes:

How has this code been tested

sergenyalcin left a comment

Choose a reason for hiding this comment

Piotr1215 left a comment

Choose a reason for hiding this comment

ulucinar commented Mar 7, 2023

AaronME commented Mar 7, 2023

negz Mar 7, 2023

Choose a reason for hiding this comment

ulucinar Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

ulucinar commented Mar 3, 2023 •

edited

Loading

ulucinar Mar 8, 2023 •

edited

Loading