Incorrect Metric Type for HPA Scaling #3286

liaddrori1 · 2024-08-15T12:44:44Z

📚 The doc issue

In the kubernetes/autoscale.md file, the current implementation uses the ts_queue_latency_microseconds metric for scaling the Horizontal Pod Autoscaler (HPA). This metric is a counter, which only increases over time and does not decrease, leading to a potential issue where the HPA will continually scale up the number of pods without scaling them down when the load decreases.

Suggest a potential alternative/fix

To resolve this issue, it is recommended to use the rate of the counter metric over a time interval to enable both scaling up and down effectively.

Use the Rate Function:
- Utilize the rate function in Prometheus to calculate the rate of change of the ts_queue_latency_microseconds metric. This provides a per-second average rate of increase over a specified time window (e.g., 5 minutes).

Modify the Prometheus Adapter Configuration:

Update the configuration to transform the counter metric into a rate-based metric. Here’s how the configuration should look:

rules:
- seriesQuery: 'ts_queue_latency_microseconds'
  resources:
    overrides:
      namespace:
        resource: namespace
      pod:
        resource: pod
  name:
    matches: "^(.*)_microseconds$"
    as: "${1}_per_second"
  metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[5m])) by (<<.GroupBy>>)'

Modify the HPA Configuration:

Update the metrics section in the hpa.yaml file to use the rate of the metric:

metrics:
  - type: Pods
    pods:
      metric:
        name: ts_queue_latency_per_second
      target:
        type: AverageValue
        averageValue: 1000000  # Set your desired threshold here

Update Documentation:
- Update the documentation in kubernetes/autoscale.md to reflect these changes and provide guidance on selecting appropriate target values based on the rate metric.

Why This Is Better

Using the rate of the counter metric allows the HPA to make scaling decisions based on the actual rate of change in queue latency rather than the cumulative value. This approach enables the HPA to scale pods up when the rate of incoming requests increases and scale down when the rate decreases, providing more responsive and efficient scaling behavior.

Example:

Current Configuration: If ts_queue_latency_microseconds is used directly, the HPA will see the metric as always increasing, causing continuous scaling up.
Proposed Configuration: By using sum(rate(ts_queue_latency_microseconds[5m])), the HPA can see the rate at which latency is increasing. For instance, if the rate increases to 7000 per second, the HPA will add pods. If the rate decreases to below the target value, it will scale down, allowing the system to adapt dynamically to load changes.

This improvement ensures better resource utilization and cost efficiency by aligning the number of pods with the actual workload.

@yardenhoch

The text was updated successfully, but these errors were encountered:

mreso · 2024-08-19T16:25:55Z

Thanks for flagging this @liaddrori1
@namannandan do you have bandwidth to look at this?

GaetanBaert · 2024-11-27T12:09:39Z

Hello,

I'm trying to use this configuration since I got my autoscaler constantly scaling up.
But it seems that rules in prometheus adapter configuration must be a table (with custom, external, resource and default) and not an array.

So, I'm not sure about where to put the metric (in external ? in custom ?) and how to specify the HPA configuration.

mreso assigned namannandan Aug 19, 2024

mreso added the bug Something isn't working label Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Metric Type for HPA Scaling #3286

Incorrect Metric Type for HPA Scaling #3286

liaddrori1 commented Aug 15, 2024

mreso commented Aug 19, 2024

GaetanBaert commented Nov 27, 2024

Incorrect Metric Type for HPA Scaling #3286

Incorrect Metric Type for HPA Scaling #3286

Comments

liaddrori1 commented Aug 15, 2024

📚 The doc issue

Suggest a potential alternative/fix

Why This Is Better

mreso commented Aug 19, 2024

GaetanBaert commented Nov 27, 2024