Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No scaling execution with multiple checks #565

Closed
gjpin opened this issue Feb 15, 2022 · 5 comments · Fixed by #567
Closed

No scaling execution with multiple checks #565

gjpin opened this issue Feb 15, 2022 · 5 comments · Fixed by #567
Labels
stage/accepted theme/policy-eval Policy broker, workers and evaluation type/bug

Comments

@gjpin
Copy link

gjpin commented Feb 15, 2022

Hello everyone!

My issue seems similar to #560 , but since it does not provide full logs I'm not exactly sure it is.

Nomad version: 1.2.6
Nomad autoscaler version: 0.3.5
hcloud (Hetzner) scaling plugin

Issue description:
I've set 2 checks within the same policy, both with the threshold strategy. However, the checks do not get executed.
If I set a single check instead, it does get executed.

These two checks do not create a conflict between each other. They both look at the same prometheus query result, but have different lower/upper bounds (0-30 vs 70-100).

Setting the agent log level to "TRACE" and the query result being roughly ~3, it shows that the "high-cpu-allocated' check does not return any result as expected:

[TRACE] internal_plugin.threshold: checking how many data points are within bounds: actionType=delta check_name=high_cpu_allocated current_count=2 lower_bound=70 upper_bound=100

[TRACE] internal_plugin.threshold: found 0 data points within bounds: actionType=delta check_name=high_cpu_allocated current_count=2 lower_bound=70 upper_bound=100

[DEBUG] policy_eval.worker.check_handler: nothing to do: check=high_cpu_allocated id=79da3c93-d92b-0ad6-caad-9fa1a71f65a0 policy_id=f17aeb9d-177c-4414-f4bc-201118128d25 queue=cluster source=prometheus strategy=threshold target=nomad-hcloud-autoscaler

However, the "low-cpu-allocated" check should have led to a scaling execution:

[TRACE] policy_eval.worker.check_handler: metric result: check=low_cpu_allocated id=79da3c93-d92b-0ad6-caad-9fa1a71f65a0 policy_id=f17aeb9d-177c-4414-f4bc-201118128d25 queue=cluster source=prometheus strategy=threshold target=nomad-hcloud-autoscaler ts="2022-02-15 22:20:54 +0000 UTC" value=2.7120908483633936

[TRACE] internal_plugin.threshold: checking how many data points are within bounds: actionType=delta check_name=low_cpu_allocated current_count=2 lower_bound=0 upper_bound=30

[TRACE] internal_plugin.threshold: found 61 data points within bounds: actionType=delta check_name=low_cpu_allocated current_count=2 lower_bound=0 upper_bound=30

[TRACE] internal_plugin.threshold: calculating new count: actionType=delta check_name=low_cpu_allocated current_count=2 lower_bound=0 upper_bound=30

[TRACE] internal_plugin.threshold: calculated scaling strategy results: actionType=delta check_name=low_cpu_allocated current_count=2 lower_bound=0 upper_bound=30 new_count=1 direction=down

[DEBUG] policy_eval.worker: no checks need to be executed: id=79da3c93-d92b-0ad6-caad-9fa1a71f65a0 policy_id=f17aeb9d-177c-4414-f4bc-201118128d25 queue=cluster target=nomad-hcloud-autoscaler

[DEBUG] policy_eval.broker: ack eval: eval_id=fd2b3d03-9607-d32b-3f96-700afb338cde token=859d45ff-d936-58d8-1cb4-f72bb2d19b03 eval_id=fd2b3d03-9607-d32b-3f96-700afb338cde token=859d45ff-d936-58d8-1cb4-f72bb2d19b03

[DEBUG] policy_eval.broker: eval ack'd: policy_id=f17aeb9d-177c-4414-f4bc-201118128d25

[DEBUG] policy_eval.broker: dequeue eval: queue=cluster

[DEBUG] policy_eval.broker: waiting for eval: queue=cluster

Agent configuration:

plugin_dir = "/local/plugins/"

http {
  bind_address = "0.0.0.0"
  bind_port    = {{ env "NOMAD_PORT_nomad_autoscaler" }}
}

policy {
  dir                         = "/local/policies"
  default_cooldown            = "5m"
  default_evaluation_interval = "10s"
}

nomad {
  address     = "https://nomad.service.nbg1.consul:4646"
  region      = "germany"
  namespace   = "default"

  ca_cert     = "/secrets/nomad/ca.crt"
  client_cert = "/secrets/nomad/agent.crt"
  client_key  = "/secrets/nomad/agent.key"
  
  token       = "NOMAD-ACL-TOKEN"
}

telemetry {
    prometheus_metrics = true
    disable_hostname   = true
}

apm "prometheus" {
  driver = "prometheus"
  config = {
    address = "http://{{ range service "prometheus" }}{{ .Address }}:{{ .Port }}{{ end }}"
  }
}

strategy "threshold" {
  driver = "threshold"
}

target "nomad-hcloud-autoscaler" {
  driver = "nomad-hcloud-autoscaler"
  config = {
    hcloud_token = "TOKEN"
  }
}

Policy:

scaling "hcloud_cluster" {
    enabled = true
    min     = 1
    max    = 3

    policy {
        cooldown            = "5m"
        evaluation_interval = "5m"

        check "high-cpu-allocated" {
        source       = "prometheus"
        query        = "sum(nomad_client_allocated_cpu{node_class=\"nomad_clients\"} * 100 / (nomad_client_allocated_cpu{node_class=\"nomad_clients\"} + nomad_client_unallocated_cpu{node_class=\"nomad_clients\"})) / count(nomad_client_unallocated_cpu{node_class=\"nomad_clients\"})"
        query_window = "1m"

        strategy "threshold" {
            upper_bound = 100
            lower_bound = 70
            delta       = 1
        }
        }

        check "low-cpu-allocated" {
        source       = "prometheus"
        query        = "sum(nomad_client_allocated_cpu{node_class=\"nomad_clients\"} * 100 / (nomad_client_allocated_cpu{node_class=\"nomad_clients\"} + nomad_client_unallocated_cpu{node_class=\"nomad_clients\"})) / count(nomad_client_unallocated_cpu{node_class=\"nomad_clients\"})"
        query_window = "1m"

        strategy "threshold" {
            upper_bound = 30
            lower_bound = 0
            delta       = -1
        }
        }

        target "nomad-hcloud-autoscaler" {
        dry-run       = false
        node_class    = "nomad_clients"
        datacenter    = "nbg1"

        hcloud_name_prefix = "nomad-autoscaler-node"
        hcloud_location    = "nbg1"
        hcloud_server_type = "cpx11"
        hcloud_image       = "IMAGE-ID"
        hcloud_user_data   = ""
        hcloud_ssh_keys    = "default"
        hcloud_networks    = "hashi_network"
        hcloud_labels      = "type=hashi-client,node=autoscaler"

        node_drain_deadline           = "15m"
        node_drain_ignore_system_jobs = true
        node_purge                    = true
        node_selector_strategy        = "least_busy"
        }
    }
}

What could be the reason for the scaling not to be executed?
I've read the documentation multiple times and I'm not sure what am I missing.

Thank you in advance!

@lgfa29 lgfa29 added stage/accepted theme/policy-eval Policy broker, workers and evaluation type/bug labels Feb 16, 2022
@lgfa29
Copy link
Contributor

lgfa29 commented Feb 16, 2022

Thanks for the extra log @gjpin, they helped me understand the problem better. First, a bit of context.

The Autoscaler operates on the assumption that checks are independent, and a final decision is made based on looking at their impact in isolation. Each check can return four results: scale up, down, none, or nil.

up, down, and none follow our safest choice approach, where the action that will result in the most infrastructure left wins. The nil result is a little different, and basically means that the check result should be ignored.

There was a bug in the threshold strategy where it would return nil instead of none when there wasn't enough data to make a decision, but that's not safe! If we don't have enough data to make a decision we should try to keep things as they are (or at least increase).

The problem with this change is that it assumed, as I mentioned earlier, that all checks are independent, but that's not always true. In your use case (and also the example we have in our docs 😬), the checks are not independent, since if one is triggered then the other one is, unavoidably, not. It will not have enough data to make a decision because all observations are in another check.

So I think we need to make this configurable in the threshold strategy. If you have checks that are looking at the same data (or even just correlated data) it should return nil like it was doing before to allow the check with the actual data to succeed.

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 18, 2022

Hi @gjpin

0.3.6 is released with a change that I think will fix this problem. I'm still writing the docs, but what you need to change is add a group config to your checks that query the same data. For example:

check "high-cpu-allocated" {
  source       = "prometheus"
  query        = "..."
  query_window = "1m"
  group        = "cpu-allocated"

  strategy "threshold" {
    upper_bound = 100
    lower_bound = 70
    delta       = 1
  }
}

check "low-cpu-allocated" {
  source       = "prometheus"
  query        = "..."
  query_window = "1m"
  group        = "cpu-allocated"

  strategy "threshold" {
    upper_bound = 30
    lower_bound = 0
    delta       = -1
  }
}

Give it a try and let us know how it goes 🙂

@goutham-sabapathy
Copy link

@lgfa29 thanks it did help but its not mentioned anywhere in the Hashi documentation yet (quite fix and hard to find 🥇 ).

Hashi need do a lot better in terms of documentation.

@lgfa29
Copy link
Contributor

lgfa29 commented Sep 22, 2022

Hi @goutham-sabapathy 👋

I'm glad to hear the new feature worked for you. We are aware of documentation gaps and we're tracking that in #574. While that's definitely not ideal, the assertion that we need to do "a lot better" seems a little exaggerated and non-productive to the discussion. Writing thoughtful and well organized documentation takes time and effort and, unfortunately, we haven't had the chance to take on these items.

As we approach the general availability of the Nomad 1.4.0 I'm planning on closing these documentation gaps. If there's anything else you would like to see documented, feel free to add a comment in #574.

@goutham-sabapathy
Copy link

🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted theme/policy-eval Policy broker, workers and evaluation type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants