ScaleNone expected but scaleIn occurs #529

jonathanlambert-iadvize · 2021-09-28T15:00:52Z

Hello,

We encountered an issue about scenario with 2 checks in same policy : ScaleIn and ScaleNone => ScaleNone (https://www.nomadproject.io/docs/autoscaling/internals/checks#scalein-and-scalenone-scalenone)

My policy :

scaling "cluster_node_app_LOW" {
  enabled = true
  min     = 6
  max     = 40
  policy {
    cooldown            = "5m"
    evaluation_interval = "2m"

    check "LOW_memory_allocated_percentage" {
      source = "nomad-apm"
      query  = "percentage-allocated_memory"
      strategy "threshold" {
        within_bounds_trigger = 1
        upper_bound           = 75
        lower_bound           = 0
        delta            = -1
      }
    }

    check "LOW_cpu_allocated_percentage" {
      source = "nomad-apm"
      query  = "percentage-allocated_cpu"
      strategy "threshold" {
        within_bounds_trigger = 1
        upper_bound           = 75
        lower_bound           = 0
        delta            = -1
      }
    }

      target "aws-asg" {
        dry-run             = "false"
        aws_asg_name        = "node_app"
        node_class          = "app"
        node_drain_deadline = "2m"
        node_purge          = "true"
    }
  }
}

Here is an extract of the logs :

2021-09-28T13:43:16.197Z [DEBUG] internal_plugin.nomad-apm: collected node pool resource data: allocated_cpu=307666 allocated_memory=536466 allocatable_cpu=332500 allocatable_memory=1076880
2021-09-28T13:43:16.198Z [TRACE] policy_eval.worker.check_handler: metric result: check=LOW_cpu_allocated_percentage id=1a4b47f2-026e-6cf8-aa0a-6117abbbd741 policy_id=badcf8cd-1373-efc7-7444-a6aa0d1aa134 queue=cluster source=nomad-apm strategy=threshold target=aws-asg ts="2021-09-28 13:43:16.198019281 +0000 UTC m=+622.419012236" value=92.53112781954887
2021-09-28T13:43:16.198Z [DEBUG] policy_eval.worker.check_handler: calculating new count: check=LOW_cpu_allocated_percentage id=1a4b47f2-026e-6cf8-aa0a-6117abbbd741 policy_id=badcf8cd-1373-efc7-7444-a6aa0d1aa134 queue=cluster source=nomad-apm strategy=threshold target=aws-asg count=36
2021-09-28T13:43:16.198Z [TRACE] internal_plugin.threshold: checking how many data points are within bounds: actionType=delta check_name=LOW_cpu_allocated_percentage current_count=36 lower_bound=0 upper_bound=75
2021-09-28T13:43:16.198Z [TRACE] internal_plugin.threshold: found 0 data points within bounds: actionType=delta check_name=LOW_cpu_allocated_percentage current_count=36 lower_bound=0 upper_bound=75
2021-09-28T13:43:16.198Z [TRACE] internal_plugin.threshold: not enough data points within bounds: actionType=delta check_name=LOW_cpu_allocated_percentage current_count=36 lower_bound=0 upper_bound=75
2021-09-28T13:43:16.198Z [DEBUG] policy_eval.worker.check_handler: received policy check for evaluation: check=LOW_memory_allocated_percentage id=1a4b47f2-026e-6cf8-aa0a-6117abbbd741 policy_id=badcf8cd-1373-efc7-7444-a6aa0d1aa134 queue=cluster source=nomad-apm strategy=threshold target=aws-asg
2021-09-28T13:43:16.198Z [DEBUG] policy_eval.worker.check_handler: querying source: check=LOW_memory_allocated_percentage id=1a4b47f2-026e-6cf8-aa0a-6117abbbd741 policy_id=badcf8cd-1373-efc7-7444-a6aa0d1aa134 queue=cluster source=nomad-apm strategy=threshold target=aws-asg query=node_percentage-allocated_memory/app/class source=nomad-apm
2021-09-28T13:43:16.198Z [DEBUG] internal_plugin.nomad-apm: performing node pool APM query: query=node_percentage-allocated_memory/app/class
2021-09-28T13:43:17.535Z [DEBUG] internal_plugin.nomad-apm: collected node pool resource data: allocated_cpu=307666 allocated_memory=536466 allocatable_cpu=332500 allocatable_memory=1076880
2021-09-28T13:43:17.535Z [TRACE] policy_eval.worker.check_handler: metric result: check=LOW_memory_allocated_percentage id=1a4b47f2-026e-6cf8-aa0a-6117abbbd741 policy_id=badcf8cd-1373-efc7-7444-a6aa0d1aa134 queue=cluster source=nomad-apm strategy=threshold target=aws-asg ts="2021-09-28 13:43:17.53552563 +0000 UTC m=+623.756518575" value=49.81669266770671
2021-09-28T13:43:17.535Z [DEBUG] policy_eval.worker.check_handler: calculating new count: check=LOW_memory_allocated_percentage id=1a4b47f2-026e-6cf8-aa0a-6117abbbd741 policy_id=badcf8cd-1373-efc7-7444-a6aa0d1aa134 queue=cluster source=nomad-apm strategy=threshold target=aws-asg count=36
2021-09-28T13:43:17.535Z [TRACE] internal_plugin.threshold: checking how many data points are within bounds: actionType=delta check_name=LOW_memory_allocated_percentage current_count=36 lower_bound=0 upper_bound=75
2021-09-28T13:43:17.535Z [TRACE] internal_plugin.threshold: found 1 data points within bounds: actionType=delta check_name=LOW_memory_allocated_percentage current_count=36 lower_bound=0 upper_bound=75
2021-09-28T13:43:17.535Z [TRACE] internal_plugin.threshold: calculating new count: actionType=delta check_name=LOW_memory_allocated_percentage current_count=36 lower_bound=0 upper_bound=75
2021-09-28T13:43:17.535Z [TRACE] internal_plugin.threshold: calculated scaling strategy results: actionType=delta check_name=LOW_memory_allocated_percentage current_count=36 lower_bound=0 upper_bound=75 new_count=35 direction=down
2021-09-28T13:43:17.535Z [TRACE] policy_eval.worker: check LOW_memory_allocated_percentage selected: id=1a4b47f2-026e-6cf8-aa0a-6117abbbd741 policy_id=badcf8cd-1373-efc7-7444-a6aa0d1aa134 queue=cluster target=aws-asg direction=down count=35
2021-09-28T13:43:17.535Z [INFO]  policy_eval.worker: scaling target: id=1a4b47f2-026e-6cf8-aa0a-6117abbbd741 policy_id=badcf8cd-1373-efc7-7444-a6aa0d1aa134 queue=cluster target=aws-asg from=36 to=35 reason="scaling down because metric is within bounds" meta=map[nomad_policy_id:badcf8cd-1373-efc7-7444-a6aa0d1aa134]

We are not expecting scale In here because cpu check has no datapoints within bounds.

We think problem could be here :

nomad-autoscaler/plugins/builtin/strategy/threshold/plugin/plugin.go

Line 105 in 8daf57d

return nil, nil

Something like this would help to have a eval with a direction instead of null value :

eval.Action.Direction = sdk.ScaleDirectionNone
return eval, nil

Let me know if you need more infos.

Thank you for your support.

The text was updated successfully, but these errors were encountered:

jrasell · 2021-10-28T13:39:09Z

Hi @jonathanlambert-iadvize and thanks for the amazing detail in this issue. I'll label it up and hopefully we can take a look into this soon.

lgfa29 · 2021-11-12T23:44:52Z

Thank you @jonathanlambert-iadvize, your analysis was spot on 🙂

jrasell added stage/needs-investigation theme/cluster-scaling type/bug labels Oct 28, 2021

lgfa29 mentioned this issue Nov 12, 2021

Fix threhold plugin logic when there are not enough metrics within bounds #537

Merged

lgfa29 added stage/accepted and removed stage/needs-investigation theme/cluster-scaling labels Nov 12, 2021

lgfa29 closed this as completed in #537 Nov 15, 2021

lgfa29 mentioned this issue Feb 16, 2022

No scaling execution with multiple checks #565

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ScaleNone expected but scaleIn occurs #529

ScaleNone expected but scaleIn occurs #529

jonathanlambert-iadvize commented Sep 28, 2021 •

edited

Loading

jrasell commented Oct 28, 2021

lgfa29 commented Nov 12, 2021 •

edited

Loading

ScaleNone expected but scaleIn occurs #529

ScaleNone expected but scaleIn occurs #529

Comments

jonathanlambert-iadvize commented Sep 28, 2021 • edited Loading

jrasell commented Oct 28, 2021

lgfa29 commented Nov 12, 2021 • edited Loading

jonathanlambert-iadvize commented Sep 28, 2021 •

edited

Loading

lgfa29 commented Nov 12, 2021 •

edited

Loading