Check stable metrics aren't changed #1836

CatherineF-dev · 2022-09-17T15:39:58Z

Check stable metrics are stable

What this PR does / why we need it: Guarantee metric name, labels aren't changed for stable metrics.

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
does not change cardinality

Which issue(s) this PR fixes:
Fixes #1833

CatherineF-dev · 2022-09-20T18:05:34Z

cc @dgrisonnet

dgrisonnet

I generally like the idea behind that PR but this is quite challenging to bring to kube-state-metrics since the metrics are generated at runtime and so you might not get all the labels by default.

One solution could be to pass a runtime object to the generator that satisfies all the conditions to generate metrics with all the labels, but that means that we should make sure that the object is updated every time a new condition is added so it isn't very safe.

Another solution that would be safer is to statically visit the generators to build a metrics object. From the generator itself, we could get all the metadata of the metrics and we could get the full list of labels by walking through the generator function and finding assignation to LabelKeys. That would be a beefy work to do, but it could be reused to both generate and test the stable metrics, so that would be my suggestion. (maybe also to generate some doc in the future)

--

With regards to the actual content of the PR, would you mind first opening a PR to introduce the stability framework and the help text annotation? That would make it easier to review. As I commented in the PR, I am generally more in favor of using the Kubernetes framework compare to the very basic one we have in KSM, so feel free to go with the component-base one.

dgrisonnet · 2022-09-23T12:14:18Z

internal/store/stable_metrics_test.go

@@ -0,0 +1,108 @@
+/*
+Copyright 2016 The Kubernetes Authors All rights reserved.


update to 2022

dgrisonnet · 2022-09-23T12:15:52Z

pkg/metric_generator/generator.go

+const (
+	// For metrics haven't migrated into NewFamilyGeneratorWithStability. Its
+	// level is UNKNOWN. You can find its metric level from documentation.
+	UNKNOWN StabilityLevel = "UNKNOWN"


Instead of adding an unknown stability level, I would be in favor of defaulting to experimental in a similar way as what we are doing in component-base.

dgrisonnet · 2022-09-23T12:42:15Z

pkg/metric_generator/generator.go

-// NewFamilyGenerator creates new FamilyGenerator instances.
-func NewFamilyGenerator(name string, help string, metricType metric.Type, deprecatedVersion string, generateFunc func(obj interface{}) *metric.Family) *FamilyGenerator {
+// StabilityLevel represents the API guarantees for a given defined metric.
+type StabilityLevel string


To align with Kubernetes metrics stability framework we could directly reuse the upstream stability type: https://github.com/kubernetes/kubernetes/blob/ea0764452222146c47ec826977f49d7001b0ea8c/staging/src/k8s.io/component-base/metrics/opts.go#L64-L74

CatherineF-dev · 2022-09-23T17:43:06Z

Sure! Thanks!

Agree on visiting the generators to build a metrics object.

CatherineF-dev · 2022-10-15T14:54:47Z

It's not easy to do static analysis for KSM metrics. ~~But e2e tests expose all stable metrics, as shown in #1853.~~ 152/172 stable metrics are exposed, as shown in #1853.

So plan to use e2e tests to validate stable metrics (suggested by han). Also will support SKIP_METRICS_LIST, given future stable metrics might be only exposed when certain conditions happen. For these metrics which aren't exposed in e2e tests, we can use unit-test.

These metrics aren't exposed in e2e test:

kube_cronjob_spec_starting_deadline_seconds
kube_cronjob_status_last_schedule_time
kube_ingress_tls
kube_job_complete
kube_job_failed
kube_job_spec_active_deadline_seconds
kube_job_status_completion_time
kube_node_spec_taint
kube_persistentvolume_claim_ref
kube_pod_completion_time
kube_pod_init_container_info
kube_pod_init_container_status_ready
kube_pod_init_container_status_restarts_total
kube_pod_init_container_status_running
kube_pod_init_container_status_terminated
kube_pod_init_container_status_waiting
kube_pod_spec_volumes_persistentvolumeclaims_info
kube_pod_spec_volumes_persistentvolumeclaims_readonly
kube_pod_status_unschedulable
kube_service_spec_external_ip
kube_service_status_load_balancer_ingress

Add presubmit to check stable metrics are stable

This reverts commit 68c8bda.

CatherineF-dev · 2022-10-19T19:24:55Z

cc @rexagod @dgrisonnet @logicalhan it's ready to review this PR which adds presubmit check for stable metrics.

logicalhan · 2022-10-19T19:28:09Z

tests/stable/check_test.go

+		t.Fatalf("Can't read stable metrics from file. err = %v", err)
+	}
+
+	err = compare(collectedStableMetrics, *expectedStableMetrics, skipStableMetrics)


Honestly, i would just sort the metrics, output a yaml string and do a simple string comparison.

This is not only simpler, but when you generate the yaml string, you'll end up with deterministic ordering which will make the diff more pleasant.

tests/e2e.sh

tests/stable/check_test.go

CatherineF-dev · 2022-10-31T15:15:55Z

Ping~

tests/stable/check_test.go

logicalhan · 2022-10-31T16:00:41Z

tests/stable/testdata/README.md

@@ -0,0 +1,23 @@
+These stable metrics aren't covered in stablemetrics.yaml, because they aren't


Can you add more documentation on how to run this stuff?

logicalhan

/lgtm
/approve

CatherineF-dev · 2022-11-03T15:51:44Z

/hold

In updating documents.

k8s-ci-robot · 2022-12-20T18:30:04Z

New changes are detected. LGTM label has been removed.

k8s-ci-robot · 2022-12-20T18:30:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: CatherineF-dev, logicalhan
Once this PR has been reviewed and has the lgtm label, please assign mrueg for approval by writing /assign @mrueg in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

CatherineF-dev · 2023-01-09T16:39:50Z

Hi Damien, could you help review as well? Thanks! cc @dgrisonnet

dgrisonnet · 2023-01-10T16:35:04Z

tests/stable/testdata/README.md

@@ -0,0 +1,23 @@
+These stable metrics aren't covered in stablemetrics.yaml, because they aren't


This is a bit problematic since ideally we would want a system that verifies that none of the stable metrics change.

Another problem we might encounter is that some labels will not verified because they are not present in the e2e tests. I don't have any example of stable metrics like that in mind, but we have for experimental ones: https://github.com/kubernetes/kube-state-metrics/blob/main/internal/store/endpointslice.go#L77-L148.
If we were to verify that metric today in the e2e tests, we would have to make sure that the endpointslice object is complete which is very prone to errors.

I recall we discussed implementing a static analysis solution to make sure that we were going over all the metrics and labels, was that not possible in the end?

a static analysis solution

It's hard to use static analysis for current kube-state-metrics repo, because labels are too dynamic.
For examples,

It useslabelKeys = append(labelKeys, "ready") in kube_endpointslice_endpoints.

Labels are from another function.

One possible way to make static analysis easier is to require defining all possible labels first.

const labels []string = {"label1", "label2", ...} // const. So that it can't be appended later.

There are two things that are difficult to handle in that example:
1.1 Having an intermediate labelKeys variable mean that we can't simply have the tool look for the labelKeys field assignation but might need to also build a constant version of the variable assigned to the field. An idea could be to have the tool first look for the labelKeys field assignation and in case the RValue is a variable and not a constant, iterate over the code again to build a const version of that variable.
1.2 append should be the only callExp that we will ever see so having a case in the tool just for it would make sense to me. The idea would to replace append(labelKeys, "ready") but building a temporary slice and putting the "ready" string inside of it to convert the variable array to a constant one for our static analysis.

We won't have to handle this scenario because it only affects _labels and _annotations metrics which have variable labels by design and as such shouldn't have any stability guarantees.

One possible way to make static analysis easier is to require defining all possible labels first.

This could make sense but I am afraid that this might make label values assignation error-prone since we would have to use the indexes instead of just appending the new labels.

The only option I see today in order to catch all the labels of all the metrics is via static analysis and if we don't want to make that intrusive to the developers to avoid errors, we will need to make the tool intelligent.

For the static analysis I am thinking that the following could be viable:

Create an AST of each store and iterate over every generator code

Find the labelKeys field assignation

If the RValue isn't constant, reiterate over the generators to build a constant version of the variable. That should be possible because even though the slices are dynamic, there values are all constant.

Writing this I noticed that there is another scenario that we would need to handle which is the one of default labels:

kube-state-metrics/internal/store/endpointslice.go

Line 38 in 559bb28

descEndpointSliceLabelsDefaultLabels = []string{"endpointslice"}

These should always be define at the top-level of each store as a constant slice so we should be able to easily add them to each metrics.

This is becoming quite complex so if you have any other suggestions I would gladly hear them but I don't think that we can avoid static analysis if we want to build consistent guarantees.

I don't think that we can avoid static analysis if we want to build consistent guarantees.

Agree

This could make sense but I am afraid that this might make label values assignation error-prone since we would have to use the indexes instead of just appending the new labels.

Could you provide an example for this case?

Could you provide an example for this case?

I think my reasoning was based on examples such as https://github.com/kubernetes/kube-state-metrics/blob/main/internal/store/endpointslice.go#L77-L148. In these cases order is super important since we build the list of label values one label at a time. I thought that if we were to define label keys upfront, future addition of labels might mess with existing ordering.

But thinking about it again we could transform the code into something like:

const labelKeys []string = {"ready", "serving", ...} ready := "" if ep.Conditions.Ready != nil { ready = ep.Conditions.Ready } serving := "" if ep.Conditions.Serving != nil { serving = ep.Conditions.Serving } m = append(m, &metric.Metric, LabelKeys: labelKeys, LabelValues: []string{ready, serving}, Value: 1, })

This original reasoning for how it is coded in the example I shared above was to reduce the network footprint by removing empty labels. But it could be argued that it doesn't have much impact + we could do processing later on when we build the HTTP response.

So I think your idea of defining const label keys everywhere makes sense and would ease very much static analysis.

k8s-triage-robot · 2023-05-10T16:25:46Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-06-09T16:46:56Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-07-09T16:58:49Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2023-07-09T16:58:54Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 17, 2022

k8s-ci-robot requested review from dgrisonnet and fpetkovski September 17, 2022 15:40

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 17, 2022

CatherineF-dev force-pushed the check-stable-metrics branch from 72fa970 to b4d71ec Compare September 17, 2022 15:42

dgrisonnet reviewed Sep 23, 2022

View reviewed changes

CatherineF-dev mentioned this pull request Oct 11, 2022

Presubmit checks for stable metrics #1833

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2022

Add presubmit to check stable metrics are stable

68c8bda

Add presubmit to check stable metrics are stable

CatherineF-dev force-pushed the check-stable-metrics branch from b4d71ec to 68c8bda Compare October 19, 2022 18:33

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 19, 2022

Revert "Add presubmit to check stable metrics are stable"

abeb519

This reverts commit 68c8bda.

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 19, 2022

CatherineF-dev force-pushed the check-stable-metrics branch 8 times, most recently from 3b84d9b to f16ee89 Compare October 19, 2022 19:22

Add presubmit to check stable metrics aren't changed

3f7cbbd

CatherineF-dev force-pushed the check-stable-metrics branch from f16ee89 to 3f7cbbd Compare October 19, 2022 19:28

logicalhan reviewed Oct 19, 2022

View reviewed changes

Sort metric name

9127641

rexagod requested changes Oct 20, 2022

View reviewed changes

CatherineF-dev force-pushed the check-stable-metrics branch from bd70811 to 53bfd5f Compare October 20, 2022 14:55

CatherineF-dev force-pushed the check-stable-metrics branch 2 times, most recently from 4737bb2 to 000a03a Compare October 31, 2022 15:14

CatherineF-dev force-pushed the check-stable-metrics branch from 000a03a to 25397eb Compare October 31, 2022 15:25

logicalhan reviewed Oct 31, 2022

View reviewed changes

CatherineF-dev force-pushed the check-stable-metrics branch 2 times, most recently from 01d1960 to db9c59a Compare November 1, 2022 01:03

logicalhan approved these changes Nov 3, 2022

View reviewed changes

k8s-ci-robot assigned logicalhan Nov 3, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 3, 2022

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2022

CatherineF-dev force-pushed the check-stable-metrics branch from db9c59a to 987670e Compare December 20, 2022 18:30

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 20, 2022

refactoring

1bd6901

CatherineF-dev force-pushed the check-stable-metrics branch from 987670e to 1bd6901 Compare December 20, 2022 18:33

dgrisonnet reviewed Jan 10, 2023

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 10, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 9, 2023

k8s-ci-robot closed this Jul 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check stable metrics aren't changed #1836

Check stable metrics aren't changed #1836

CatherineF-dev commented Sep 17, 2022

CatherineF-dev commented Sep 20, 2022

dgrisonnet left a comment

dgrisonnet Sep 23, 2022

dgrisonnet Sep 23, 2022

dgrisonnet Sep 23, 2022

CatherineF-dev commented Sep 23, 2022

CatherineF-dev commented Oct 15, 2022 •

edited

Loading

CatherineF-dev commented Oct 19, 2022

logicalhan Oct 19, 2022

CatherineF-dev commented Oct 31, 2022

logicalhan Oct 31, 2022

logicalhan left a comment

CatherineF-dev commented Nov 3, 2022 •

edited

Loading

k8s-ci-robot commented Dec 20, 2022

k8s-ci-robot commented Dec 20, 2022

CatherineF-dev commented Jan 9, 2023

dgrisonnet Jan 10, 2023 •

edited

Loading

CatherineF-dev Jan 11, 2023 •

edited

Loading

dgrisonnet Jan 11, 2023

CatherineF-dev Feb 6, 2023

dgrisonnet Feb 9, 2023

k8s-triage-robot commented May 10, 2023

k8s-triage-robot commented Jun 9, 2023

k8s-triage-robot commented Jul 9, 2023

k8s-ci-robot commented Jul 9, 2023

		@@ -0,0 +1,108 @@
		/*
		Copyright 2016 The Kubernetes Authors All rights reserved.

		@@ -0,0 +1,23 @@
		These stable metrics aren't covered in stablemetrics.yaml, because they aren't

Check stable metrics aren't changed #1836

Check stable metrics aren't changed #1836

Conversation

CatherineF-dev commented Sep 17, 2022

CatherineF-dev commented Sep 20, 2022

dgrisonnet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CatherineF-dev commented Sep 23, 2022

CatherineF-dev commented Oct 15, 2022 • edited Loading

CatherineF-dev commented Oct 19, 2022

Choose a reason for hiding this comment

CatherineF-dev commented Oct 31, 2022

Choose a reason for hiding this comment

logicalhan left a comment

Choose a reason for hiding this comment

CatherineF-dev commented Nov 3, 2022 • edited Loading

k8s-ci-robot commented Dec 20, 2022

k8s-ci-robot commented Dec 20, 2022

CatherineF-dev commented Jan 9, 2023

dgrisonnet Jan 10, 2023 • edited Loading

Choose a reason for hiding this comment

CatherineF-dev Jan 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-triage-robot commented May 10, 2023

k8s-triage-robot commented Jun 9, 2023

k8s-triage-robot commented Jul 9, 2023

k8s-ci-robot commented Jul 9, 2023

CatherineF-dev commented Oct 15, 2022 •

edited

Loading

CatherineF-dev commented Nov 3, 2022 •

edited

Loading

dgrisonnet Jan 10, 2023 •

edited

Loading

CatherineF-dev Jan 11, 2023 •

edited

Loading