Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add policy mutators to prevent cluster scaling to zero with the Nomad APM #534

Merged
merged 3 commits into from
Nov 15, 2021

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Nov 11, 2021

The Nomad APM uses the Nomad API to retrieve information about allocations and nodes. Usually, the choice of metric source should not impact the choice of target defined in a policy, but when using the Nomad APM to perform cluster scaling to 0 presents there is a problem to scale back up since, once the cluster reaches zero clients, there's nothing to query for metrics.

There is also no meaningful default value to return in this case. The metrics supported by the Nomad APM for cluster scaling are percentage of CPU and memory allocated. For zero clients these values are 0% and 100% at the same time, and different policy strategies would require interpreting in one or another. Or not mater at all!

For example, using the target-value strategy with target = 70 (i.e., 70% CPU usage as target) would never scale back from zero since the next count to be calculated is also always zero (next count = 0/70 * 0 or next count = 100/70 * 0)

The root of the problem is that the Nomad APM was never intended to be used in this scenario. Its goal is to provide a quick way to scale apps and clusters based on memory and CPU usage. More advanced scenarios, where these metrics are not useful, require a different APM.

The On-demand Batch Job Cluster Autoscaling tutorial provides an example on how to scale a Nomad cluster to zero and back up using the Prometheus APM.

There are a few possible solutions to avoid this problem.

Validate policies for this condition

Policy validation is currently done by policy sources, but don't take plugin-specific aspects into consideration. Validating a policy to cover this scenario would be possible, but a failed validation would prevent the policy from being evaluated altogether. A log message is emitted in these situations, but a re easy to be missed by operators.

Query servers for metrics

With no clients to be queried, servers present an option for metric source. But this setup requires significant additional logic since the /v1/metrics endpoint only returns values specific to the agent being queried. Iterating over all servers scraping their metrics using tools like Datadog or Prometheus is the recommended way to consume these values, and so in this scenario, a different APM plugin should be used.

Another potential problem is that, the Nomad Autoscaler, when running as an allocation within Nomad, may not have direct access to servers.

Modify policies to never scale to zero

This option is similar to the validation approach, in which plugin-aware checks are performed in each policy, but with the advantage that, instead of never performing any evaluations at all, modifying the policy to change min = 0 to min = 1 would keep the policy active, but without ever reaching zero clients.

Given that expanding the Nomad APM plugin to query more metrics is not feasible, modifying the policies to prevent scaling to zero seems to be the best approach.

This PR adds the concept of policy mutators that are used by policy handlers to modify incoming policies if necessary. If a policy is modified, a log line is emitted to notify operators:

2021-11-11T13:37:54.946-0500 [INFO]  file_policy_source: starting file policy monitor: file=bin/policies/cluster.hcl name=policy-test policy_id=9d121526-7660-dc19-463b-5b2fee1a6b6d
2021-11-11T13:37:54.946-0500 [INFO]  policy_manager.policy_handler: policy modified: policy_id=9d121526-7660-dc19-463b-5b2fee1a6b6d modification="min value set to 1 since scaling cluster to 0 is not supported by the Nomad APM"

Despite the potential for low visibility into catching this log line, mutators don't prevent the policy from being evaluated, so missing these lines should be safe.

Closes #530 #424

Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful change and comments, thanks!

@lgfa29 lgfa29 merged commit 247d706 into main Nov 15, 2021
@lgfa29 lgfa29 deleted the fix-nomad-apm-zero-cluster-scaling branch November 15, 2021 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster autoscale fails to scale up from 0 nodes
2 participants