Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate kube_horizontalpodautoscaler_spec_target_metric #2403

Open
dmitriishaburov opened this issue May 28, 2024 · 6 comments
Open

Duplicate kube_horizontalpodautoscaler_spec_target_metric #2403

dmitriishaburov opened this issue May 28, 2024 · 6 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@dmitriishaburov
Copy link

What happened:

Duplicate kube_horizontalpodautoscaler_spec_target_metric causing issues with Prometheus 2.52.0+
Due to new duplicates detection mechanism, I'm seeing following errors in Prometheus:

ts=2024-05-28T07:15:26.730Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/prometheus/kube-prometheus-stack-kube-state-metrics/0 target=http://172.17.60.213:8080/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=12

After checking kube-state metrics I've found that metrics for HPA are duplicated:

cat metrics | grep 'kube_horizontalpodautoscaler_spec_target_metric{namespace="dummy",horizontalpodautoscaler="dummy",metric_name="cpu",metric_target_type="utilization"}'
kube_horizontalpodautoscaler_spec_target_metric{namespace="dummy",horizontalpodautoscaler="dummy",metric_name="cpu",metric_target_type="utilization"} 100
kube_horizontalpodautoscaler_spec_target_metric{namespace="dummy",horizontalpodautoscaler="dummy",metric_name="cpu",metric_target_type="utilization"} 100
kube_horizontalpodautoscaler_spec_target_metric{namespace="dummy",horizontalpodautoscaler="dummy",metric_name="cpu",metric_target_type="utilization"} 100

HPA itself have 3 separate entries for CPU, but there's different on container:

k get hpa dummy -oyaml
  metrics:
  - containerResource:
      container: dummy-0
      name: cpu
      target:
        averageUtilization: 100
        type: Utilization
    type: ContainerResource
  - containerResource:
      container: dummy-1
      name: cpu
      target:
        averageUtilization: 100
        type: Utilization
    type: ContainerResource
  - containerResource:
      container: dummy-2
      name: cpu
      target:
        averageUtilization: 100
        type: Utilization
    type: ContainerResource

What you expected to happen:
kube-state-metrics not producing duplicate metrics, probably by adding container label (?)

How to reproduce it (as minimally and precisely as possible):

  • Create HPA as posted above
  • Use Prometheus 2.52.0+ as scraper

Environment:

  • kube-state-metrics version: v2.12.0
  • Kubernetes version (use kubectl version): v1.28.9-eks-036c24b
  • Cloud provider or hardware configuration: EKS
@dmitriishaburov dmitriishaburov added the kind/bug Categorizes issue or PR as related to a bug. label May 28, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 28, 2024
@sansalva
Copy link

sansalva commented Jun 7, 2024

The main issue is that you don't get a specific metric for each ContainerResources trigger when using multiple (assuming the same resource name and target type). It generates the same metric with the same label values. e.g:

metrics:
    - type: ContainerResource
      containerResource:
        name: cpu
        container: main
        target:
          type: Utilization
          averageUtilization: 40
    - type: ContainerResource
      containerResource:
        name: cpu
        container: istio-proxy
        target:
          type: Utilization
          averageUtilization: 90

you get all target_metric specific metrics duplicated, making it impossible to identify which one applies to each. Also when Prometheus scrapes these, only take one of each, losing information.

kube_horizontalpodautoscaler_spec_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="utilization"} 40
kube_horizontalpodautoscaler_spec_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="utilization"} 90
kube_horizontalpodautoscaler_status_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="average"} 1.442
kube_horizontalpodautoscaler_status_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="utilization"} 24
kube_horizontalpodautoscaler_status_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="average"} 0.836
kube_horizontalpodautoscaler_status_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="utilization"} 69

As @dmitriishaburov suggested, one option would be to have a container label. However, we need to consider what value it would have when using type=Resource. Additionally, with this option, there is a chance that a container name could collide with that specific value. We could also add a type label to solve this, with values resource and container_resource.

Another option would be to have separate metrics for type=Resources and type=ContainerResources. The problem I see with this approach is that it splits the information about the same concept into 2 different metrics, also making building charts more complicated because you will have to duplicate the queries to consider both types in every scenario

@dgrisonnet
Copy link
Member

/assign
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 8, 2024
@dgrisonnet
Copy link
Member

dgrisonnet commented Aug 8, 2024

Looking at the HPA API the problem should exist for ContainerResource and Object metrics because in both scenario we don't differentiate based on the object that is targeted by the rule.
For ContainerResource we would need to separate by containers: https://github.com/kubernetes/kubernetes/blob/v1.30.3/pkg/apis/autoscaling/types.go#L303 and for the Object metrics we will need to separate by its object fully qualified name: https://github.com/kubernetes/kubernetes/blob/v1.30.3/pkg/apis/autoscaling/types.go#L262.

As you rightfully mentionned @sansalva there are two options, either we keep the same metrics and add new labels or split the metric into new ones for each metric type. For the first one, the problem with some metric types not referencing objects is fine since we don't have to expose these labels for them. But, it would still kind of add some level of complexity to the metric.

I like the second option since the metric is still experimental so we can make changes to it. It will surely break some users but we didn't provide guarantees for this metrics. So as long as we make a smooth transition it should be fine.
The benefit of this approach is that the metrics will be specialized to the metric type they are targeting, so the labels will make sense for it and we won't run in scenarios where we have labels that are not used by some types.

it splits the information about the same concept into 2 different metrics

Although I can see your point with sharing a mutual base concept, the metric types are essentially different APIs at this point. Each have their own way to referencing their target resource. For Pods, Custom and External it is still very similar APIs but for the newer Object and ContainerResource they are completely different. Because of that, I think it is reasonable to split the metric in more dedicated ones.

making building charts more complicated because you will have to duplicate the queries to consider both types in every scenario

Not necessarily, you could use a promQL query to perform a join of the different metrics.

@dgrisonnet
Copy link
Member

@dmitriishaburov @sansalva if any of you want to try to take a stab at this and fix the issue, feel free to assign it to yourself by typing /assign. If you need any help along the way feel free to reach out. I will put this issue help wanted for now.

/help
/unassign

@k8s-ci-robot
Copy link
Contributor

@dgrisonnet:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

@dmitriishaburov @sansalva if any of you want to try to take a stab at this and fix the issue, feel free to assign it to yourself by typing /assign. If you need any help along the way feel free to reach out. I will put this issue help wanted for now.

/help
/unassign

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 8, 2024
@hsunwenfang
Copy link

Looking at the HPA API the problem should exist for ContainerResource and Object metrics because in both scenario we don't differentiate based on the object that is targeted by the rule. For ContainerResource we would need to separate by containers: https://github.com/kubernetes/kubernetes/blob/v1.30.3/pkg/apis/autoscaling/types.go#L303 and for the Object metrics we will need to separate by its object fully qualified name: https://github.com/kubernetes/kubernetes/blob/v1.30.3/pkg/apis/autoscaling/types.go#L262.

As you rightfully mentionned @sansalva there are two options, either we keep the same metrics and add new labels or split the metric into new ones for each metric type. For the first one, the problem with some metric types not referencing objects is fine since we don't have to expose these labels for them. But, it would still kind of add some level of complexity to the metric.

I like the second option since the metric is still experimental so we can make changes to it. It will surely break some users but we didn't provide guarantees for this metrics. So as long as we make a smooth transition it should be fine. The benefit of this approach is that the metrics will be specialized to the metric type they are targeting, so the labels will make sense for it and we won't run in scenarios where we have labels that are not used by some types.

it splits the information about the same concept into 2 different metrics

Although I can see your point with sharing a mutual base concept, the metric types are essentially different APIs at this point. Each have their own way to referencing their target resource. For Pods, Custom and External it is still very similar APIs but for the newer Object and ContainerResource they are completely different. Because of that, I think it is reasonable to split the metric in more dedicated ones.

making building charts more complicated because you will have to duplicate the queries to consider both types in every scenario

Not necessarily, you could use a promQL query to perform a join of the different metrics.

Hi,

For the second option
Does that mean instead of referencing the same
metricName and metricTarget like Pods, Custom and External
the newer Object and ContainerResource should use some additional variables like
containerName = string(m.ContainerResource.Container)
to separate the metrics usage?

thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants