Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check stable metrics aren't changed #1836

Closed

Conversation

CatherineF-dev
Copy link
Contributor

Check stable metrics are stable

What this PR does / why we need it: Guarantee metric name, labels aren't changed for stable metrics.

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
does not change cardinality

Which issue(s) this PR fixes:
Fixes #1833

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 17, 2022
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 17, 2022
@CatherineF-dev
Copy link
Contributor Author

cc @dgrisonnet

Copy link
Member

@dgrisonnet dgrisonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally like the idea behind that PR but this is quite challenging to bring to kube-state-metrics since the metrics are generated at runtime and so you might not get all the labels by default.

One solution could be to pass a runtime object to the generator that satisfies all the conditions to generate metrics with all the labels, but that means that we should make sure that the object is updated every time a new condition is added so it isn't very safe.

Another solution that would be safer is to statically visit the generators to build a metrics object. From the generator itself, we could get all the metadata of the metrics and we could get the full list of labels by walking through the generator function and finding assignation to LabelKeys. That would be a beefy work to do, but it could be reused to both generate and test the stable metrics, so that would be my suggestion. (maybe also to generate some doc in the future)

--

With regards to the actual content of the PR, would you mind first opening a PR to introduce the stability framework and the help text annotation? That would make it easier to review. As I commented in the PR, I am generally more in favor of using the Kubernetes framework compare to the very basic one we have in KSM, so feel free to go with the component-base one.

@@ -0,0 +1,108 @@
/*
Copyright 2016 The Kubernetes Authors All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update to 2022

const (
// For metrics haven't migrated into NewFamilyGeneratorWithStability. Its
// level is UNKNOWN. You can find its metric level from documentation.
UNKNOWN StabilityLevel = "UNKNOWN"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding an unknown stability level, I would be in favor of defaulting to experimental in a similar way as what we are doing in component-base.

// NewFamilyGenerator creates new FamilyGenerator instances.
func NewFamilyGenerator(name string, help string, metricType metric.Type, deprecatedVersion string, generateFunc func(obj interface{}) *metric.Family) *FamilyGenerator {
// StabilityLevel represents the API guarantees for a given defined metric.
type StabilityLevel string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To align with Kubernetes metrics stability framework we could directly reuse the upstream stability type: https://github.com/kubernetes/kubernetes/blob/ea0764452222146c47ec826977f49d7001b0ea8c/staging/src/k8s.io/component-base/metrics/opts.go#L64-L74

@CatherineF-dev
Copy link
Contributor Author

Sure! Thanks!

Agree on visiting the generators to build a metrics object.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2022
@CatherineF-dev
Copy link
Contributor Author

CatherineF-dev commented Oct 15, 2022

It's not easy to do static analysis for KSM metrics. But e2e tests expose all stable metrics, as shown in #1853. 152/172 stable metrics are exposed, as shown in #1853.

So plan to use e2e tests to validate stable metrics (suggested by han). Also will support SKIP_METRICS_LIST, given future stable metrics might be only exposed when certain conditions happen. For these metrics which aren't exposed in e2e tests, we can use unit-test.

These metrics aren't exposed in e2e test:

  • kube_cronjob_spec_starting_deadline_seconds
  • kube_cronjob_status_last_schedule_time
  • kube_ingress_tls
  • kube_job_complete
  • kube_job_failed
  • kube_job_spec_active_deadline_seconds
  • kube_job_status_completion_time
  • kube_node_spec_taint
  • kube_persistentvolume_claim_ref
  • kube_pod_completion_time
  • kube_pod_init_container_info
  • kube_pod_init_container_status_ready
  • kube_pod_init_container_status_restarts_total
  • kube_pod_init_container_status_running
  • kube_pod_init_container_status_terminated
  • kube_pod_init_container_status_waiting
  • kube_pod_spec_volumes_persistentvolumeclaims_info
  • kube_pod_spec_volumes_persistentvolumeclaims_readonly
  • kube_pod_status_unschedulable
  • kube_service_spec_external_ip
  • kube_service_status_load_balancer_ingress

Add presubmit to check stable metrics are stable
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 19, 2022
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 19, 2022
@CatherineF-dev CatherineF-dev force-pushed the check-stable-metrics branch 8 times, most recently from 3b84d9b to f16ee89 Compare October 19, 2022 19:22
@CatherineF-dev
Copy link
Contributor Author

cc @rexagod @dgrisonnet @logicalhan it's ready to review this PR which adds presubmit check for stable metrics.

t.Fatalf("Can't read stable metrics from file. err = %v", err)
}

err = compare(collectedStableMetrics, *expectedStableMetrics, skipStableMetrics)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, i would just sort the metrics, output a yaml string and do a simple string comparison.

This is not only simpler, but when you generate the yaml string, you'll end up with deterministic ordering which will make the diff more pleasant.

tests/e2e.sh Outdated Show resolved Hide resolved
tests/e2e.sh Outdated Show resolved Hide resolved
tests/stable/check_test.go Outdated Show resolved Hide resolved
tests/stable/check_test.go Outdated Show resolved Hide resolved
tests/stable/check_test.go Outdated Show resolved Hide resolved
tests/stable/check_test.go Outdated Show resolved Hide resolved
tests/stable/check_test.go Outdated Show resolved Hide resolved
tests/stable/check_test.go Show resolved Hide resolved
@CatherineF-dev
Copy link
Contributor Author

Ping~

tests/stable/check_test.go Outdated Show resolved Hide resolved
tests/stable/check_test.go Outdated Show resolved Hide resolved
tests/stable/check_test.go Outdated Show resolved Hide resolved
tests/stable/check_test.go Outdated Show resolved Hide resolved
@@ -0,0 +1,23 @@
These stable metrics aren't covered in stablemetrics.yaml, because they aren't
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more documentation on how to run this stuff?

@CatherineF-dev CatherineF-dev force-pushed the check-stable-metrics branch 2 times, most recently from 01d1960 to db9c59a Compare November 1, 2022 01:03
Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 3, 2022
@CatherineF-dev
Copy link
Contributor Author

CatherineF-dev commented Nov 3, 2022

/hold

In updating documents.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 20, 2022
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: CatherineF-dev, logicalhan
Once this PR has been reviewed and has the lgtm label, please assign mrueg for approval by writing /assign @mrueg in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@CatherineF-dev
Copy link
Contributor Author

Hi Damien, could you help review as well? Thanks! cc @dgrisonnet

@@ -0,0 +1,23 @@
These stable metrics aren't covered in stablemetrics.yaml, because they aren't
Copy link
Member

@dgrisonnet dgrisonnet Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit problematic since ideally we would want a system that verifies that none of the stable metrics change.

Another problem we might encounter is that some labels will not verified because they are not present in the e2e tests. I don't have any example of stable metrics like that in mind, but we have for experimental ones: https://github.com/kubernetes/kube-state-metrics/blob/main/internal/store/endpointslice.go#L77-L148.
If we were to verify that metric today in the e2e tests, we would have to make sure that the endpointslice object is complete which is very prone to errors.

I recall we discussed implementing a static analysis solution to make sure that we were going over all the metrics and labels, was that not possible in the end?

Copy link
Contributor Author

@CatherineF-dev CatherineF-dev Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a static analysis solution

It's hard to use static analysis for current kube-state-metrics repo, because labels are too dynamic.
For examples,

  1. It useslabelKeys = append(labelKeys, "ready") in kube_endpointslice_endpoints.
  2. Labels are from another function.

One possible way to make static analysis easier is to require defining all possible labels first.

const labels []string = {"label1", "label2", ...} // const. So that it can't be appended later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There are two things that are difficult to handle in that example:
    1.1 Having an intermediate labelKeys variable mean that we can't simply have the tool look for the labelKeys field assignation but might need to also build a constant version of the variable assigned to the field. An idea could be to have the tool first look for the labelKeys field assignation and in case the RValue is a variable and not a constant, iterate over the code again to build a const version of that variable.
    1.2 append should be the only callExp that we will ever see so having a case in the tool just for it would make sense to me. The idea would to replace append(labelKeys, "ready") but building a temporary slice and putting the "ready" string inside of it to convert the variable array to a constant one for our static analysis.
  2. We won't have to handle this scenario because it only affects _labels and _annotations metrics which have variable labels by design and as such shouldn't have any stability guarantees.

One possible way to make static analysis easier is to require defining all possible labels first.

This could make sense but I am afraid that this might make label values assignation error-prone since we would have to use the indexes instead of just appending the new labels.

The only option I see today in order to catch all the labels of all the metrics is via static analysis and if we don't want to make that intrusive to the developers to avoid errors, we will need to make the tool intelligent.

For the static analysis I am thinking that the following could be viable:

  1. Create an AST of each store and iterate over every generator code
  2. Find the labelKeys field assignation
  3. If the RValue isn't constant, reiterate over the generators to build a constant version of the variable. That should be possible because even though the slices are dynamic, there values are all constant.

Writing this I noticed that there is another scenario that we would need to handle which is the one of default labels:

descEndpointSliceLabelsDefaultLabels = []string{"endpointslice"}

These should always be define at the top-level of each store as a constant slice so we should be able to easily add them to each metrics.

This is becoming quite complex so if you have any other suggestions I would gladly hear them but I don't think that we can avoid static analysis if we want to build consistent guarantees.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we can avoid static analysis if we want to build consistent guarantees.

Agree

This could make sense but I am afraid that this might make label values assignation error-prone since we would have to use the indexes instead of just appending the new labels.

Could you provide an example for this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide an example for this case?

I think my reasoning was based on examples such as https://github.com/kubernetes/kube-state-metrics/blob/main/internal/store/endpointslice.go#L77-L148. In these cases order is super important since we build the list of label values one label at a time. I thought that if we were to define label keys upfront, future addition of labels might mess with existing ordering.

But thinking about it again we could transform the code into something like:

const labelKeys []string = {"ready", "serving", ...} 

ready := ""
if ep.Conditions.Ready != nil {
  ready = ep.Conditions.Ready
}

serving := ""
if ep.Conditions.Serving != nil {
  serving = ep.Conditions.Serving
}

m = append(m, &metric.Metric,
  LabelKeys:   labelKeys,
  LabelValues: []string{ready, serving},
  Value:       1,
})

This original reasoning for how it is coded in the example I shared above was to reduce the network footprint by removing empty labels. But it could be argued that it doesn't have much impact + we could do processing later on when we build the HTTP response.

So I think your idea of defining const label keys everywhere makes sense and would ease very much static analysis.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 10, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 9, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Presubmit checks for stable metrics
6 participants