Add policy mutators to prevent cluster scaling to zero with the Nomad APM #534

lgfa29 · 2021-11-11T19:51:22Z

The Nomad APM uses the Nomad API to retrieve information about allocations and nodes. Usually, the choice of metric source should not impact the choice of target defined in a policy, but when using the Nomad APM to perform cluster scaling to 0 presents there is a problem to scale back up since, once the cluster reaches zero clients, there's nothing to query for metrics.

There is also no meaningful default value to return in this case. The metrics supported by the Nomad APM for cluster scaling are percentage of CPU and memory allocated. For zero clients these values are 0% and 100% at the same time, and different policy strategies would require interpreting in one or another. Or not mater at all!

For example, using the target-value strategy with target = 70 (i.e., 70% CPU usage as target) would never scale back from zero since the next count to be calculated is also always zero (next count = 0/70 * 0 or next count = 100/70 * 0)

The root of the problem is that the Nomad APM was never intended to be used in this scenario. Its goal is to provide a quick way to scale apps and clusters based on memory and CPU usage. More advanced scenarios, where these metrics are not useful, require a different APM.

The On-demand Batch Job Cluster Autoscaling tutorial provides an example on how to scale a Nomad cluster to zero and back up using the Prometheus APM.

There are a few possible solutions to avoid this problem.

Validate policies for this condition

Policy validation is currently done by policy sources, but don't take plugin-specific aspects into consideration. Validating a policy to cover this scenario would be possible, but a failed validation would prevent the policy from being evaluated altogether. A log message is emitted in these situations, but a re easy to be missed by operators.

Query servers for metrics

With no clients to be queried, servers present an option for metric source. But this setup requires significant additional logic since the /v1/metrics endpoint only returns values specific to the agent being queried. Iterating over all servers scraping their metrics using tools like Datadog or Prometheus is the recommended way to consume these values, and so in this scenario, a different APM plugin should be used.

Another potential problem is that, the Nomad Autoscaler, when running as an allocation within Nomad, may not have direct access to servers.

Modify policies to never scale to zero

This option is similar to the validation approach, in which plugin-aware checks are performed in each policy, but with the advantage that, instead of never performing any evaluations at all, modifying the policy to change min = 0 to min = 1 would keep the policy active, but without ever reaching zero clients.

Given that expanding the Nomad APM plugin to query more metrics is not feasible, modifying the policies to prevent scaling to zero seems to be the best approach.

This PR adds the concept of policy mutators that are used by policy handlers to modify incoming policies if necessary. If a policy is modified, a log line is emitted to notify operators:

2021-11-11T13:37:54.946-0500 [INFO]  file_policy_source: starting file policy monitor: file=bin/policies/cluster.hcl name=policy-test policy_id=9d121526-7660-dc19-463b-5b2fee1a6b6d
2021-11-11T13:37:54.946-0500 [INFO]  policy_manager.policy_handler: policy modified: policy_id=9d121526-7660-dc19-463b-5b2fee1a6b6d modification="min value set to 1 since scaling cluster to 0 is not supported by the Nomad APM"

Despite the potential for low visibility into catching this log line, mutators don't prevent the policy from being evaluated, so missing these lines should be safe.

Closes #530 #424

… APM

jrasell

Wonderful change and comments, thanks!

add policy mutators to prevent cluster scaling to zero with the Nomad…

34879d8

… APM

lgfa29 requested review from gogococo, jazzyfresh and jrasell as code owners November 11, 2021 19:51

changelog: add entry for #534

51ade01

This was referenced Nov 11, 2021

docs: add note about the Nomad APM autoscaling plugin and scaling cluster to zero hashicorp/nomad#11494

Merged

Cluster autoscale fails to scale up from 0 nodes #530

Closed

Scaling up from 0 not possible #424

Closed

jrasell approved these changes Nov 15, 2021

View reviewed changes

fix unfinished comment

f187785

lgfa29 merged commit 247d706 into main Nov 15, 2021

lgfa29 deleted the fix-nomad-apm-zero-cluster-scaling branch November 15, 2021 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add policy mutators to prevent cluster scaling to zero with the Nomad APM #534

Add policy mutators to prevent cluster scaling to zero with the Nomad APM #534

lgfa29 commented Nov 11, 2021 •

edited

Loading

jrasell left a comment

Add policy mutators to prevent cluster scaling to zero with the Nomad APM #534

Add policy mutators to prevent cluster scaling to zero with the Nomad APM #534

Conversation

lgfa29 commented Nov 11, 2021 • edited Loading

Validate policies for this condition

Query servers for metrics

Modify policies to never scale to zero

jrasell left a comment

Choose a reason for hiding this comment

lgfa29 commented Nov 11, 2021 •

edited

Loading