You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are a few API calls within the nomad-autoscaler codebase that leverage the blocking queries feature of the Nomad API. Specifically around listing scaling policies, reading scaling policies, and reading jobs' scaling statuses:
As-is, these WaitTime params have a static defined value of five minutes / 300000ms.
nomad-autoscaler => Nomad API Connectivity
Our deployment of nomad-autoscaler communicates with its associated Nomad clusters via an AWS load balancer that has an "idle timeout" of 60 seconds (AWS's default).
Issue
This static five minute wait time, combined with our specific LB idle timeout arrangements, leads to failed requests anytime that there are no changes to the underlying Nomad state within the timeout duration. E.g., no scaling policy modifications or the like within a minute of initiating a list scaling policy request.
Within our autoscaler deployment's logs, this issue manifest in entires like the following:
List Policies
{
"message": "encountered an error monitoring policy IDs",
"attributes": {
"@level": "error",
"service": "nomad-autoscaler",
"@module": "policy_manager",
"source": "stderr",
"error": "failed to call the Nomad list policies API: Get \"https://<our_nomad>:4646/v1/scaling/policies?index=47368483&namespace=<our_namespace>&wait=300000ms\": EOF",
}
}
Get Policy
{
"message": "encountered an error monitoring policy",
"attributes": {
"@level": "error",
"@timestamp": "2023-11-02T15:14:09.268724Z",
"policy_id": "83e13b15-0150-4333-2ed8-f9aa2f7b74c4",
"service": "nomad-autoscaler",
"@module": "policy_manager.policy_handler",
"source": "stderr",
"error": "failed to get policy: Get \"https://<our_nomad>:4646/v1/scaling/policy/83e13b15-0150-4333-2ed8-f9aa2f7b74c4?index=47299148&namespace=<our_namespace>&wait=300000ms\": EOF",
}
}
There is the added nuance of the random wait / 16 "jitter" delay called out in the Nomad API docs that is also relevant. If the LB idle timeout were five minutes rather than 1 minute, we might still see occasional errors as Nomad might have changes that "unblock" a query after 4.9999 minutes of waiting and then take an additional 0-18.75 seconds to return a response 🙃. Which may sound silly to call out, but it did lead to some confusion on my part when my first attempt at updating WaitTime param values to 1 minute didn't resolve all our logged autoscaler errors.
Desired Outcome
Allow the nomad-autoscaler WaitTime values to be configurable. Perhaps as some sort of blocking_query_wait_time_duration option under the nomad {} configuration block?
The text was updated successfully, but these errors were encountered:
Hi @jeffwecan and thanks for raising this issue with the great detail included. I would agree this would be a useful addition, to help operators better control this aspect of the autoscaler internals.
Background Context
Blocking Queries + WaitTime
There are a few API calls within the nomad-autoscaler codebase that leverage the blocking queries feature of the Nomad API. Specifically around listing scaling policies, reading scaling policies, and reading jobs' scaling statuses:
As-is, these
WaitTime
params have a static defined value of five minutes / 300000ms.nomad-autoscaler => Nomad API Connectivity
Our deployment of nomad-autoscaler communicates with its associated Nomad clusters via an AWS load balancer that has an "idle timeout" of 60 seconds (AWS's default).
Issue
This static five minute wait time, combined with our specific LB idle timeout arrangements, leads to failed requests anytime that there are no changes to the underlying Nomad state within the timeout duration. E.g., no scaling policy modifications or the like within a minute of initiating a list scaling policy request.
Within our autoscaler deployment's logs, this issue manifest in entires like the following:
List Policies
Get Policy
Get Job Scale Status
There is the added nuance of the random
wait / 16
"jitter" delay called out in the Nomad API docs that is also relevant. If the LB idle timeout were five minutes rather than 1 minute, we might still see occasional errors as Nomad might have changes that "unblock" a query after 4.9999 minutes of waiting and then take an additional 0-18.75 seconds to return a response 🙃. Which may sound silly to call out, but it did lead to some confusion on my part when my first attempt at updatingWaitTime
param values to 1 minute didn't resolve all our logged autoscaler errors.Desired Outcome
Allow the nomad-autoscaler
WaitTime
values to be configurable. Perhaps as some sort ofblocking_query_wait_time_duration
option under thenomad {}
configuration block?The text was updated successfully, but these errors were encountered: