Duplicate kube_horizontalpodautoscaler_spec_target_metric #2403

dmitriishaburov · 2024-05-28T07:23:37Z

What happened:

Duplicate kube_horizontalpodautoscaler_spec_target_metric causing issues with Prometheus 2.52.0+
Due to new duplicates detection mechanism, I'm seeing following errors in Prometheus:

ts=2024-05-28T07:15:26.730Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/prometheus/kube-prometheus-stack-kube-state-metrics/0 target=http://172.17.60.213:8080/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=12

After checking kube-state metrics I've found that metrics for HPA are duplicated:

cat metrics | grep 'kube_horizontalpodautoscaler_spec_target_metric{namespace="dummy",horizontalpodautoscaler="dummy",metric_name="cpu",metric_target_type="utilization"}'
kube_horizontalpodautoscaler_spec_target_metric{namespace="dummy",horizontalpodautoscaler="dummy",metric_name="cpu",metric_target_type="utilization"} 100
kube_horizontalpodautoscaler_spec_target_metric{namespace="dummy",horizontalpodautoscaler="dummy",metric_name="cpu",metric_target_type="utilization"} 100
kube_horizontalpodautoscaler_spec_target_metric{namespace="dummy",horizontalpodautoscaler="dummy",metric_name="cpu",metric_target_type="utilization"} 100

HPA itself have 3 separate entries for CPU, but there's different on container:

k get hpa dummy -oyaml
  metrics:
  - containerResource:
      container: dummy-0
      name: cpu
      target:
        averageUtilization: 100
        type: Utilization
    type: ContainerResource
  - containerResource:
      container: dummy-1
      name: cpu
      target:
        averageUtilization: 100
        type: Utilization
    type: ContainerResource
  - containerResource:
      container: dummy-2
      name: cpu
      target:
        averageUtilization: 100
        type: Utilization
    type: ContainerResource

What you expected to happen:
kube-state-metrics not producing duplicate metrics, probably by adding container label (?)

How to reproduce it (as minimally and precisely as possible):

Create HPA as posted above
Use Prometheus 2.52.0+ as scraper

Environment:

kube-state-metrics version: v2.12.0
Kubernetes version (use kubectl version): v1.28.9-eks-036c24b
Cloud provider or hardware configuration: EKS

The text was updated successfully, but these errors were encountered:

sansalva · 2024-06-07T12:30:50Z

The main issue is that you don't get a specific metric for each ContainerResources trigger when using multiple (assuming the same resource name and target type). It generates the same metric with the same label values. e.g:

metrics:
    - type: ContainerResource
      containerResource:
        name: cpu
        container: main
        target:
          type: Utilization
          averageUtilization: 40
    - type: ContainerResource
      containerResource:
        name: cpu
        container: istio-proxy
        target:
          type: Utilization
          averageUtilization: 90

you get all target_metric specific metrics duplicated, making it impossible to identify which one applies to each. Also when Prometheus scrapes these, only take one of each, losing information.

kube_horizontalpodautoscaler_spec_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="utilization"} 40
kube_horizontalpodautoscaler_spec_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="utilization"} 90
kube_horizontalpodautoscaler_status_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="average"} 1.442
kube_horizontalpodautoscaler_status_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="utilization"} 24
kube_horizontalpodautoscaler_status_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="average"} 0.836
kube_horizontalpodautoscaler_status_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="utilization"} 69

As @dmitriishaburov suggested, one option would be to have a container label. However, we need to consider what value it would have when using type=Resource. Additionally, with this option, there is a chance that a container name could collide with that specific value. We could also add a type label to solve this, with values resource and container_resource.

Another option would be to have separate metrics for type=Resources and type=ContainerResources. The problem I see with this approach is that it splits the information about the same concept into 2 different metrics, also making building charts more complicated because you will have to duplicate the queries to consider both types in every scenario

dgrisonnet · 2024-08-08T16:51:21Z

/assign
/triage accepted

dgrisonnet · 2024-08-08T18:14:27Z

Looking at the HPA API the problem should exist for ContainerResource and Object metrics because in both scenario we don't differentiate based on the object that is targeted by the rule.
For ContainerResource we would need to separate by containers: https://github.com/kubernetes/kubernetes/blob/v1.30.3/pkg/apis/autoscaling/types.go#L303 and for the Object metrics we will need to separate by its object fully qualified name: https://github.com/kubernetes/kubernetes/blob/v1.30.3/pkg/apis/autoscaling/types.go#L262.

As you rightfully mentionned @sansalva there are two options, either we keep the same metrics and add new labels or split the metric into new ones for each metric type. For the first one, the problem with some metric types not referencing objects is fine since we don't have to expose these labels for them. But, it would still kind of add some level of complexity to the metric.

I like the second option since the metric is still experimental so we can make changes to it. It will surely break some users but we didn't provide guarantees for this metrics. So as long as we make a smooth transition it should be fine.
The benefit of this approach is that the metrics will be specialized to the metric type they are targeting, so the labels will make sense for it and we won't run in scenarios where we have labels that are not used by some types.

it splits the information about the same concept into 2 different metrics

Although I can see your point with sharing a mutual base concept, the metric types are essentially different APIs at this point. Each have their own way to referencing their target resource. For Pods, Custom and External it is still very similar APIs but for the newer Object and ContainerResource they are completely different. Because of that, I think it is reasonable to split the metric in more dedicated ones.

making building charts more complicated because you will have to duplicate the queries to consider both types in every scenario

Not necessarily, you could use a promQL query to perform a join of the different metrics.

dgrisonnet · 2024-08-08T18:16:25Z

@dmitriishaburov @sansalva if any of you want to try to take a stab at this and fix the issue, feel free to assign it to yourself by typing /assign. If you need any help along the way feel free to reach out. I will put this issue help wanted for now.

/help
/unassign

k8s-ci-robot · 2024-08-08T18:16:26Z

@dgrisonnet:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

@dmitriishaburov @sansalva if any of you want to try to take a stab at this and fix the issue, feel free to assign it to yourself by typing /assign. If you need any help along the way feel free to reach out. I will put this issue help wanted for now.

/help
/unassign

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

hsunwenfang · 2024-09-03T07:12:59Z

Looking at the HPA API the problem should exist for ContainerResource and Object metrics because in both scenario we don't differentiate based on the object that is targeted by the rule. For ContainerResource we would need to separate by containers: https://github.com/kubernetes/kubernetes/blob/v1.30.3/pkg/apis/autoscaling/types.go#L303 and for the Object metrics we will need to separate by its object fully qualified name: https://github.com/kubernetes/kubernetes/blob/v1.30.3/pkg/apis/autoscaling/types.go#L262.

As you rightfully mentionned @sansalva there are two options, either we keep the same metrics and add new labels or split the metric into new ones for each metric type. For the first one, the problem with some metric types not referencing objects is fine since we don't have to expose these labels for them. But, it would still kind of add some level of complexity to the metric.

I like the second option since the metric is still experimental so we can make changes to it. It will surely break some users but we didn't provide guarantees for this metrics. So as long as we make a smooth transition it should be fine. The benefit of this approach is that the metrics will be specialized to the metric type they are targeting, so the labels will make sense for it and we won't run in scenarios where we have labels that are not used by some types.

it splits the information about the same concept into 2 different metrics

Although I can see your point with sharing a mutual base concept, the metric types are essentially different APIs at this point. Each have their own way to referencing their target resource. For Pods, Custom and External it is still very similar APIs but for the newer Object and ContainerResource they are completely different. Because of that, I think it is reasonable to split the metric in more dedicated ones.

making building charts more complicated because you will have to duplicate the queries to consider both types in every scenario

Not necessarily, you could use a promQL query to perform a join of the different metrics.

Hi,

For the second option
Does that mean instead of referencing the same
metricName and metricTarget like Pods, Custom and External
the newer Object and ContainerResource should use some additional variables like
containerName = string(m.ContainerResource.Container)
to separate the metrics usage?

thx!

dmitriishaburov added the kind/bug Categorizes issue or PR as related to a bug. label May 28, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 28, 2024

k8s-ci-robot assigned dgrisonnet Aug 8, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 8, 2024

k8s-ci-robot unassigned dgrisonnet Aug 8, 2024

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate kube_horizontalpodautoscaler_spec_target_metric #2403

Duplicate kube_horizontalpodautoscaler_spec_target_metric #2403

dmitriishaburov commented May 28, 2024

sansalva commented Jun 7, 2024

dgrisonnet commented Aug 8, 2024

dgrisonnet commented Aug 8, 2024 •

edited

Loading

dgrisonnet commented Aug 8, 2024

k8s-ci-robot commented Aug 8, 2024

hsunwenfang commented Sep 3, 2024

Duplicate kube_horizontalpodautoscaler_spec_target_metric #2403

Duplicate kube_horizontalpodautoscaler_spec_target_metric #2403

Comments

dmitriishaburov commented May 28, 2024

sansalva commented Jun 7, 2024

dgrisonnet commented Aug 8, 2024

dgrisonnet commented Aug 8, 2024 • edited Loading

dgrisonnet commented Aug 8, 2024

k8s-ci-robot commented Aug 8, 2024

Guidelines

hsunwenfang commented Sep 3, 2024

dgrisonnet commented Aug 8, 2024 •

edited

Loading