Add policy mutators to prevent cluster scaling to zero with the Nomad APM #534
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The Nomad APM uses the Nomad API to retrieve information about allocations and nodes. Usually, the choice of metric source should not impact the choice of target defined in a policy, but when using the Nomad APM to perform cluster scaling to
0
presents there is a problem to scale back up since, once the cluster reaches zero clients, there's nothing to query for metrics.There is also no meaningful default value to return in this case. The metrics supported by the Nomad APM for cluster scaling are percentage of CPU and memory allocated. For zero clients these values are 0% and 100% at the same time, and different policy strategies would require interpreting in one or another. Or not mater at all!
For example, using the
target-value
strategy withtarget = 70
(i.e., 70% CPU usage as target) would never scale back from zero since the next count to be calculated is also always zero (next count = 0/70 * 0
ornext count = 100/70 * 0
)The root of the problem is that the Nomad APM was never intended to be used in this scenario. Its goal is to provide a quick way to scale apps and clusters based on memory and CPU usage. More advanced scenarios, where these metrics are not useful, require a different APM.
The On-demand Batch Job Cluster Autoscaling tutorial provides an example on how to scale a Nomad cluster to zero and back up using the Prometheus APM.
There are a few possible solutions to avoid this problem.
Validate policies for this condition
Policy validation is currently done by policy sources, but don't take plugin-specific aspects into consideration. Validating a policy to cover this scenario would be possible, but a failed validation would prevent the policy from being evaluated altogether. A log message is emitted in these situations, but a re easy to be missed by operators.
Query servers for metrics
With no clients to be queried, servers present an option for metric source. But this setup requires significant additional logic since the
/v1/metrics
endpoint only returns values specific to the agent being queried. Iterating over all servers scraping their metrics using tools like Datadog or Prometheus is the recommended way to consume these values, and so in this scenario, a different APM plugin should be used.Another potential problem is that, the Nomad Autoscaler, when running as an allocation within Nomad, may not have direct access to servers.
Modify policies to never scale to zero
This option is similar to the validation approach, in which plugin-aware checks are performed in each policy, but with the advantage that, instead of never performing any evaluations at all, modifying the policy to change
min = 0
tomin = 1
would keep the policy active, but without ever reaching zero clients.Given that expanding the Nomad APM plugin to query more metrics is not feasible, modifying the policies to prevent scaling to zero seems to be the best approach.
This PR adds the concept of policy mutators that are used by policy handlers to modify incoming policies if necessary. If a policy is modified, a log line is emitted to notify operators:
Despite the potential for low visibility into catching this log line, mutators don't prevent the policy from being evaluated, so missing these lines should be safe.
Closes #530 #424