Static `WaitTime` Query Values + Server-side Timeouts Lead to Spurious "EOF" Errors #754

jeffwecan · 2023-11-02T19:12:30Z

Background Context

Blocking Queries + WaitTime

There are a few API calls within the nomad-autoscaler codebase that leverage the blocking queries feature of the Nomad API. Specifically around listing scaling policies, reading scaling policies, and reading jobs' scaling statuses:

➜  ~/src/hashicorp find nomad-autoscaler -name '*.go' -exec grep -nr -C2 WaitTime {} \;
nomad-autoscaler/plugins/builtin/target/nomad/plugin/state.go-169-      q := &api.QueryOptions{
nomad-autoscaler/plugins/builtin/target/nomad/plugin/state.go-170-              Namespace: jsh.namespace,
nomad-autoscaler/plugins/builtin/target/nomad/plugin/state.go:171:              WaitTime:  5 * time.Minute,
nomad-autoscaler/plugins/builtin/target/nomad/plugin/state.go-172-              WaitIndex: 1,
nomad-autoscaler/plugins/builtin/target/nomad/plugin/state.go-173-      }
nomad-autoscaler/policy/nomad/source.go-90-     s.log.Debug("starting policy blocking query watcher")
nomad-autoscaler/policy/nomad/source.go-91-
nomad-autoscaler/policy/nomad/source.go:92:     q := &api.QueryOptions{WaitTime: 5 * time.Minute, WaitIndex: 1}
nomad-autoscaler/policy/nomad/source.go-93-
nomad-autoscaler/policy/nomad/source.go-94-     for {
--
nomad-autoscaler/policy/nomad/source.go-180-    log.Trace("starting policy blocking query watcher")
nomad-autoscaler/policy/nomad/source.go-181-
nomad-autoscaler/policy/nomad/source.go:182:    q := &api.QueryOptions{WaitTime: 5 * time.Minute, WaitIndex: 1}
nomad-autoscaler/policy/nomad/source.go-183-    for {
nomad-autoscaler/policy/nomad/source.go-184-            var (

As-is, these WaitTime params have a static defined value of five minutes / 300000ms.

nomad-autoscaler => Nomad API Connectivity

Our deployment of nomad-autoscaler communicates with its associated Nomad clusters via an AWS load balancer that has an "idle timeout" of 60 seconds (AWS's default).

Issue

This static five minute wait time, combined with our specific LB idle timeout arrangements, leads to failed requests anytime that there are no changes to the underlying Nomad state within the timeout duration. E.g., no scaling policy modifications or the like within a minute of initiating a list scaling policy request.

Within our autoscaler deployment's logs, this issue manifest in entires like the following:

List Policies

{
	"message": "encountered an error monitoring policy IDs",
	"attributes": {
		"@level": "error",
		"service": "nomad-autoscaler",
		"@module": "policy_manager",
		"source": "stderr",
		"error": "failed to call the Nomad list policies API: Get \"https://<our_nomad>:4646/v1/scaling/policies?index=47368483&namespace=<our_namespace>&wait=300000ms\": EOF",
	}
}

Get Policy

{
	"message": "encountered an error monitoring policy",
	"attributes": {
		"@level": "error",
		"@timestamp": "2023-11-02T15:14:09.268724Z",
		"policy_id": "83e13b15-0150-4333-2ed8-f9aa2f7b74c4",
		"service": "nomad-autoscaler",
		"@module": "policy_manager.policy_handler",
		"source": "stderr",
		"error": "failed to get policy: Get \"https://<our_nomad>:4646/v1/scaling/policy/83e13b15-0150-4333-2ed8-f9aa2f7b74c4?index=47299148&namespace=<our_namespace>&wait=300000ms\": EOF",
	}
}

Get Job Scale Status

{
	"message": "failed to get target status",
	"attributes": {
		"@level": "warn",
		"@timestamp": "2023-11-02T11:30:27.287723Z",
		"policy_id": "83e13b15-0150-4333-2ed8-f9aa2f7b74c4",
		"service": "nomad-autoscaler",
		"@module": "policy_manager.policy_handler",
		"source": "stderr",
		"error": "Get \"https://<our_nomad>:4646/v1/job/infra-experiencer/scale?index=47299822&namespace=cloud-infrastructure-api&wait=300000ms\": EOF",
	}
}
{
	"message": "Get \"https://<our_nomad>:4646/v1/job/infra-experiencer/scale?index=47299822&namespace=<our_namespace>&wait=300000ms\": EOF",
	"attributes": {
		"@level": "error",
		"@timestamp": "2023-11-02T11:30:27.287780Z",
		"policy_id": "83e13b15-0150-4333-2ed8-f9aa2f7b74c4",
		"service": "nomad-autoscaler",
		"@module": "policy_manager.policy_handler",
		"source": "stderr",
	}
}

There is the added nuance of the random wait / 16 "jitter" delay called out in the Nomad API docs that is also relevant. If the LB idle timeout were five minutes rather than 1 minute, we might still see occasional errors as Nomad might have changes that "unblock" a query after 4.9999 minutes of waiting and then take an additional 0-18.75 seconds to return a response 🙃. Which may sound silly to call out, but it did lead to some confusion on my part when my first attempt at updating WaitTime param values to 1 minute didn't resolve all our logged autoscaler errors.

Desired Outcome

Allow the nomad-autoscaler WaitTime values to be configurable. Perhaps as some sort of blocking_query_wait_time_duration option under the nomad {} configuration block?

The text was updated successfully, but these errors were encountered:

jrasell · 2023-11-03T15:36:17Z

Hi @jeffwecan and thanks for raising this issue with the great detail included. I would agree this would be a useful addition, to help operators better control this aspect of the autoscaler internals.

jeffwecan mentioned this issue Nov 3, 2023

Add BlockQueryWaitTime Nomad Configuration Option #755

Merged

jrasell added theme/core stage/accepted type/enhancement labels Nov 3, 2023

lgfa29 closed this as completed in #755 Dec 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static `WaitTime` Query Values + Server-side Timeouts Lead to Spurious "EOF" Errors #754

Static `WaitTime` Query Values + Server-side Timeouts Lead to Spurious "EOF" Errors #754

jeffwecan commented Nov 2, 2023 •

edited

Loading

jrasell commented Nov 3, 2023

Static WaitTime Query Values + Server-side Timeouts Lead to Spurious "EOF" Errors #754

Static WaitTime Query Values + Server-side Timeouts Lead to Spurious "EOF" Errors #754

Comments

jeffwecan commented Nov 2, 2023 • edited Loading

Background Context

Blocking Queries + WaitTime

nomad-autoscaler => Nomad API Connectivity

Issue

Desired Outcome

jrasell commented Nov 3, 2023

Static `WaitTime` Query Values + Server-side Timeouts Lead to Spurious "EOF" Errors #754

Static `WaitTime` Query Values + Server-side Timeouts Lead to Spurious "EOF" Errors #754

jeffwecan commented Nov 2, 2023 •

edited

Loading