Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static WaitTime Query Values + Server-side Timeouts Lead to Spurious "EOF" Errors #754

Closed
jeffwecan opened this issue Nov 2, 2023 · 1 comment · Fixed by #755
Closed

Comments

@jeffwecan
Copy link
Contributor

jeffwecan commented Nov 2, 2023

Background Context

Blocking Queries + WaitTime

There are a few API calls within the nomad-autoscaler codebase that leverage the blocking queries feature of the Nomad API. Specifically around listing scaling policies, reading scaling policies, and reading jobs' scaling statuses:

~/src/hashicorp find nomad-autoscaler -name '*.go' -exec grep -nr -C2 WaitTime {} \;
nomad-autoscaler/plugins/builtin/target/nomad/plugin/state.go-169-      q := &api.QueryOptions{
nomad-autoscaler/plugins/builtin/target/nomad/plugin/state.go-170-              Namespace: jsh.namespace,
nomad-autoscaler/plugins/builtin/target/nomad/plugin/state.go:171:              WaitTime:  5 * time.Minute,
nomad-autoscaler/plugins/builtin/target/nomad/plugin/state.go-172-              WaitIndex: 1,
nomad-autoscaler/plugins/builtin/target/nomad/plugin/state.go-173-      }
nomad-autoscaler/policy/nomad/source.go-90-     s.log.Debug("starting policy blocking query watcher")
nomad-autoscaler/policy/nomad/source.go-91-
nomad-autoscaler/policy/nomad/source.go:92:     q := &api.QueryOptions{WaitTime: 5 * time.Minute, WaitIndex: 1}
nomad-autoscaler/policy/nomad/source.go-93-
nomad-autoscaler/policy/nomad/source.go-94-     for {
--
nomad-autoscaler/policy/nomad/source.go-180-    log.Trace("starting policy blocking query watcher")
nomad-autoscaler/policy/nomad/source.go-181-
nomad-autoscaler/policy/nomad/source.go:182:    q := &api.QueryOptions{WaitTime: 5 * time.Minute, WaitIndex: 1}
nomad-autoscaler/policy/nomad/source.go-183-    for {
nomad-autoscaler/policy/nomad/source.go-184-            var (

As-is, these WaitTime params have a static defined value of five minutes / 300000ms.

nomad-autoscaler => Nomad API Connectivity

Our deployment of nomad-autoscaler communicates with its associated Nomad clusters via an AWS load balancer that has an "idle timeout" of 60 seconds (AWS's default).

Issue

This static five minute wait time, combined with our specific LB idle timeout arrangements, leads to failed requests anytime that there are no changes to the underlying Nomad state within the timeout duration. E.g., no scaling policy modifications or the like within a minute of initiating a list scaling policy request.

Within our autoscaler deployment's logs, this issue manifest in entires like the following:

List Policies
{
	"message": "encountered an error monitoring policy IDs",
	"attributes": {
		"@level": "error",
		"service": "nomad-autoscaler",
		"@module": "policy_manager",
		"source": "stderr",
		"error": "failed to call the Nomad list policies API: Get \"https://<our_nomad>:4646/v1/scaling/policies?index=47368483&namespace=<our_namespace>&wait=300000ms\": EOF",
	}
}
Get Policy
{
	"message": "encountered an error monitoring policy",
	"attributes": {
		"@level": "error",
		"@timestamp": "2023-11-02T15:14:09.268724Z",
		"policy_id": "83e13b15-0150-4333-2ed8-f9aa2f7b74c4",
		"service": "nomad-autoscaler",
		"@module": "policy_manager.policy_handler",
		"source": "stderr",
		"error": "failed to get policy: Get \"https://<our_nomad>:4646/v1/scaling/policy/83e13b15-0150-4333-2ed8-f9aa2f7b74c4?index=47299148&namespace=<our_namespace>&wait=300000ms\": EOF",
	}
}
Get Job Scale Status
{
	"message": "failed to get target status",
	"attributes": {
		"@level": "warn",
		"@timestamp": "2023-11-02T11:30:27.287723Z",
		"policy_id": "83e13b15-0150-4333-2ed8-f9aa2f7b74c4",
		"service": "nomad-autoscaler",
		"@module": "policy_manager.policy_handler",
		"source": "stderr",
		"error": "Get \"https://<our_nomad>:4646/v1/job/infra-experiencer/scale?index=47299822&namespace=cloud-infrastructure-api&wait=300000ms\": EOF",
	}
}
{
	"message": "Get \"https://<our_nomad>:4646/v1/job/infra-experiencer/scale?index=47299822&namespace=<our_namespace>&wait=300000ms\": EOF",
	"attributes": {
		"@level": "error",
		"@timestamp": "2023-11-02T11:30:27.287780Z",
		"policy_id": "83e13b15-0150-4333-2ed8-f9aa2f7b74c4",
		"service": "nomad-autoscaler",
		"@module": "policy_manager.policy_handler",
		"source": "stderr",
	}
}

There is the added nuance of the random wait / 16 "jitter" delay called out in the Nomad API docs that is also relevant. If the LB idle timeout were five minutes rather than 1 minute, we might still see occasional errors as Nomad might have changes that "unblock" a query after 4.9999 minutes of waiting and then take an additional 0-18.75 seconds to return a response 🙃. Which may sound silly to call out, but it did lead to some confusion on my part when my first attempt at updating WaitTime param values to 1 minute didn't resolve all our logged autoscaler errors.

Desired Outcome

Allow the nomad-autoscaler WaitTime values to be configurable. Perhaps as some sort of blocking_query_wait_time_duration option under the nomad {} configuration block?

@jrasell
Copy link
Member

jrasell commented Nov 3, 2023

Hi @jeffwecan and thanks for raising this issue with the great detail included. I would agree this would be a useful addition, to help operators better control this aspect of the autoscaler internals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants