Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support empty groups again, e.g. Node #2435

Closed
wants to merge 1 commit into from

Conversation

robbat2
Copy link

@robbat2 robbat2 commented Jun 28, 2024

PR#1851 introduced accidental breakage for custom metrics on resources with an empty empty, e.g. Node.

Fix it, and add tests to catch the next breakage.

Reference: #1851
Reference: #2434
Signed-off-by: Robin H. Johnson rjohnson@coreweave.com

demo_timestamp{customresource_group="",customresource_kind="Node",customresource_version="v1",node="kind-control-plane"} 1.719618433e+09

What this PR does / why we need it:
Fixes breakage introduced in v2.9.0

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
Does not change cardinality.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2434

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 28, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 28, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @robbat2!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: robbat2
Once this PR has been reviewed and has the lgtm label, please assign mrueg for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 28, 2024
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 28, 2024
PR#1851 introduced accidental breakage for custom metrics on resources
with an empty empty, e.g. Node.

Fix it, and add tests to catch the next breakage.

```
demo_timestamp{customresource_group="",customresource_kind="Node",customresource_version="v1",node="kind-control-plane"} 1.719618433e+09
```

Reference: kubernetes#1851
Reference: kubernetes#2434
Signed-off-by: Robin H. Johnson <rjohnson@coreweave.com>
@dgrisonnet
Copy link
Member

All custom resources have an API group. Resources who don't are native Kubernetes resources for which kube-state-metrics already provides a safe set of metrics that is frequently updated to support new users requests. The ability to extend the metrics exposed by kube-state-metrics was never meant for natives resources and that is not something we wanted to support going forward.

Some additional context about that decision was added in #2044.

@rexagod
Copy link
Member

rexagod commented Jul 2, 2024

@daveoy
Copy link

daveoy commented Jul 2, 2024

Could you point me in the direction of docs showing functionality such as this, then?

          - groupVersionKind:
              kind: Node
              version: v1
            labelsFromPath:
              node:
              - metadata
              - name
            metricNamePrefix: ""
            metrics:
            - name: last_timestamp_of_thing_that_happened
              help: last timestamp thing
              each:
                gauge:
                  path:
                  - metadata
                  - annotations
                  valueFrom:
                  - timestamp-value-annotation
                type: Gauge

this would twist the dimensions of a label from a node resource to a timestamp gauge value in a new metric which is extremely useful, in addition to it being both possible AND encouraged for custom resources. seems like a bit of a shame to not allow users to do this sort of thing with core/builtin resources?

@rexagod
Copy link
Member

rexagod commented Jul 2, 2024

Native resources are supported in-tree only, by design, and thus cannot leak into the CRS (Custom Resource State) machinery. If you believe something needs to be added for them, please open a PR, just as the docs say in the aforementioned link. I'd be happy to take a look.

@daveoy
Copy link

daveoy commented Jul 2, 2024

just so im clear, this functionality isn't available for native resources somewhere else within ksm and we've just been trying to shoehorn the custom resource version of it into native resources?

is there any plan to replicate this functionality for native resources? or are we limited to the metrics and their dimensions as provided by this project for native resources?

again, it seems a bit of a waste of great available functionality "restricting" this sort of metrics manipulation to only custom resources

@robbat2
Copy link
Author

robbat2 commented Jul 2, 2024

As @daveoy notes, one of our use cases is exporting a timestamp value from a Node annotation. This is not possible via KSM since v2.9.0. It was previously available and worked in in v2.8.2 - from that perspective you broke existing functionality.

Another use case we have is splitting out a very large set of Node labels.

We have Nodes with 100+ labels, over multiple namespaces, and even with metric-labels-allowlist mechanism, this causes cardinality problems in Prometheus.

We want to export each of those namespaces of labels as a distinct metric - to reduce the complexity of kube_node_labels, as the metrics consumers don't need all of them, just their own namespaces.

@mrueg
Copy link
Member

mrueg commented Jul 2, 2024

As @daveoy notes, one of our use cases is exporting a timestamp value from a Node annotation. This is not possible via KSM since v2.9.0. It was previously available and worked in in v2.8.2 - from that perspective you broke existing functionality.

The custom resource metrics feature is still experimental (See:
https://github.com/kubernetes/kube-state-metrics/blob/main/docs/README.md#metrics-from-custom-resources ) so functionality might change over time as the feature evolves.

Please do not rely on an experimental feature in production systems.

@robbat2
Copy link
Author

robbat2 commented Jul 2, 2024

Here's example of what we're doing to export the labels in something beyond kube_node_labels

resources:
  - groupVersionKind:
      kind: Node
      version: v1
    labelsFromPath:
      node: [metadata, name]
    metricNamePrefix: ""
    metrics:
      - name: kube_node_ns1_info
        help: ns1 example labels
        each:
          type: Info
          info:
            path: [metadata]
            labelsFromPath:
              ns1_example_com_foo: [labels, "ns1.example.com/foo"]
              ns1_example_com_bar: [labels, "ns1.example.com/bar"]
              ns1_example_com_baz: [labels, "ns1.example.com/baz"]
      - name: kube_node_ns2_info
        help: ns2 example labels
        each:
          type: Info
          info:
            path: [metadata]
            labelsFromPath:
              ns2_example_com_a: [labels, "ns2.example.com/a"]
              ns2_example_com_b: [labels, "ns2.example.com/b"]
              ns2_example_com_c: [labels, "ns2.example.com/c"]

@robbat2
Copy link
Author

robbat2 commented Jul 4, 2024

@mrueg the text says See [Custom Resource State Metrics](https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/extend/customresourcestate-metrics.md) for experimental support for custom resources.

These are core resources, and therefore not included in that definition.

If there's a better way to export the metrics from our use cases, please do share, because right now the v2.8.2 functionality looks to be the most fitting solution.

To recap:

  • export parts of a core resource as a custom metric series: subsets of node & pod labels & annotations as metric labels in a distinct series, node annotation values as gauge values

What else does the PR need, in terms of supporting code and tests, to show this is valid & safe functionality? I did already include unit testing & E2E testing to show it works within the scope of the existing test suites.

@bh-tt
Copy link

bh-tt commented Jul 5, 2024

I opened the initial discussion asking how to get metrics for the core ResourceQuota resources. That resource specifically is designed to be extendable to all kinds of resource types (not just the standard cpu, memory and ephemeralStorage but for example GPU cores). In that discussion I also mentioned that supporting all those different options would likely be a significant maintenance burden on the project, aside from creating a variable number of metric names.

I understand the logic of not allowing the customresource setup to use core resources, but this usecase absolutely needs support. We frequently run into problems with resourcequotas blocking pod creation as our developers migrate more stuff to kubernetes. We would like to have monitoring alerting our IT department whenever resource usage goes > 80% of the quota, but this is currently not possible with KSM.

@daveoy
Copy link

daveoy commented Jul 10, 2024

Are there office hours we can join to discuss this? Would love to come to some sort of agreement on how to move forward with allowing custom metrics for core resources. @mrueg / @rexagod please let me know.

@mrueg
Copy link
Member

mrueg commented Jul 10, 2024

Are there office hours we can join to discuss this? Would love to come to some sort of agreement on how to move forward with allowing custom metrics for core resources. @mrueg / @rexagod please let me know.

There are SIG-Instrumentation office hours that you can join if you want to discuss this further. I am unable to attend those though as they conflict with my day job.

@mrueg
Copy link
Member

mrueg commented Jul 10, 2024

@mrueg the text says See [Custom Resource State Metrics](https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/extend/customresourcestate-metrics.md) for experimental support for custom resources.

These are core resources, and therefore not included in that definition.

Core Resources are not experimental, but the whole feature custom resource metrics feature is marked experimental.

      --custom-resource-state-config string        Inline Custom Resource State Metrics config YAML (experimental)
      --custom-resource-state-config-file string   Path to a Custom Resource State Metrics config file (experimental)
      --custom-resource-state-only                 Only provide Custom Resource State metrics (experimental)

If there's a better way to export the metrics from our use cases, please do share, because right now the v2.8.2 functionality looks to be the most fitting solution.

As Kubernetes API exposes a json endpoint, an implementation with https://github.com/prometheus-community/json_exporter might be possible.

@daveoy
Copy link

daveoy commented Jul 10, 2024

@dgrisonnet
Copy link
Member

Based on the discussion we had on slack, are we ok with closing this?

@logicalhan
Copy link
Member

/close

@k8s-ci-robot
Copy link
Contributor

@logicalhan: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

custom metrics for resources w/ group="" broken, e.g. Node
8 participants