Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configsync_sync_generation metric resource tag #763

Merged
merged 1 commit into from
Jul 27, 2023

Conversation

karlkfi
Copy link
Contributor

@karlkfi karlkfi commented Jul 25, 2023

  • Add configsync.gke.io/sync-generation reconciler deployment/pod label, populated by the reconciler-manager
  • Add CONFIGSYNC_SYNC_GENERATION env var on the otel-agent container, populated by the k8s downward API from the pod label
  • Add configsync.sync.generation metric resource attribute on the otel-agent config, populated from the env var
  • Add attribute filter to delete the configsync.sync.generation label in the otel-collector config, when sending to Monarch
  • Change the e2e tests to use the sync metric labels, instead of the pod name. This allows the metrics validation to tolerate pod replacement due to rescheduling, oomkill, or autoscaling

This PR blocks VPA: #691

Motivation

The e2e tests have been using pod name + source commit to filter metrics queries. But pod name was really just a proxy for RSync generation, so that we could match expected metrics with the most recent RSync spec and Git/OCI/Helm source changes. However, it turns out using pod name is still occasionally flaky, because the pod might be replaced or rescheduled even when the RSync spec didn't change (e.g. node replacement, rescheduling, oomkill, or autoscaling). Using RSync generation instead of pod name makes the e2e tests less flaky and makes it possible to add VPA and memory limits, which would otherwise cause most e2e tests to fail metrics validation.

Note: configsync.sync.generation is a metric resource attribute, NOT a normal metric label. So it will not be sent to either Monarch or Cloud Monitoring, because those providers do not support custom resource attributes without hard-coded support and a custom resource type registered with Google. However, custom resources attributes ARE being sent to Prometheus, because of the resource_to_telemetry_conversion: { enabled: true } setting, which converts them to metric labels. So configsync.sync.generation becomes configsync_sync_generation in Prometheus metrics. This is how the e2e tests use them for query filtering.

@karlkfi
Copy link
Contributor Author

karlkfi commented Jul 25, 2023

We could aggregate generation to hide it from GCM, but I'm waiting for #748 to implement that for type first. I think we can merge this and add that aggregation later.

pkg/metrics/otel.go Show resolved Hide resolved
pkg/metrics/tagkeys.go Show resolved Hide resolved
@karlkfi karlkfi force-pushed the karl-gen-label branch 2 times, most recently from 7b0d50a to 58d8e34 Compare July 27, 2023 02:54
- Add configsync.gke.io/sync-generation reconciler deployment/pod label,
  populated by the reconciler-manager
- Add CONFIGSYNC_SYNC_GENERATION env var on the otel-agent container,
  populated by the k8s downward API from the pod label
- Add configsync.sync.generation metric resource label on the otel-agent
  config, populated from the env var. This resource label is then
  converted to a metric label for Prometheus, but not Monarch or
  Cloud Monitoring, which ignore custom resource labels.
- Change the e2e tests to use the sync metric labels, instead of the
  pod name. This allows the metrics validation to tolerate pod
  replacement due to rescheduling, oomkill, or autoscaling
Copy link
Contributor

@tiffanny29631 tiffanny29631 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tiffanny29631

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit b7ac44e into GoogleContainerTools:main Jul 27, 2023
6 checks passed
@karlkfi karlkfi deleted the karl-gen-label branch July 27, 2023 21:24
@janetkuo
Copy link
Contributor

@karlkfi would you update https://cloud.google.com/anthos-config-management/docs/reference/labels-and-annotations to include configsync.gke.io/sync-generation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants