Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler blocked #303

Closed
jorgemarey opened this issue Oct 29, 2020 · 3 comments
Closed

Autoscaler blocked #303

jorgemarey opened this issue Oct 29, 2020 · 3 comments
Labels

Comments

@jorgemarey
Copy link
Contributor

Autoscaler version: 0.1.0

This is a log post, but I'll try to give as much detail as I can.

Hi, we found that sometimes the autoscaler gets blocked and doesn't perform scaling actions. We don't know if this is general or only for one policy as what I'm going to describe happened when we're testing only with one job and one policy who had two checks. Maybe it's related to #218, but I'm not sure. After we saw the autoscaler blocked (no more logs in the output) we killed it and get the goroutine dump to see what happened. We saw the following (removed some goroutines that we're not important):

goroutine 11 [select, 41 minutes]:
github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler(0xc00016f340, 0x156eba0, 0xc00016f380, 0xc000288600)
	/home/circleci/project/project/agent/agent.go:75 +0x119
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).Run
	/home/circleci/project/project/agent/agent.go:66 +0x35c

goroutine 606885 [chan send, 1439 minutes]:
github.com/hashicorp/nomad-autoscaler/policy/nomad.(*Source).MonitorPolicy(0xc0002352e0, 0x156eba0, 0xc00016e580, 0xc00041e0c0, 0x24, 0xc00009e360, 0xc00009e480, 0xc00009e300)
	/home/circleci/project/project/policy/nomad/source.go:157 +0x67f
created by github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run
	/home/circleci/project/project/policy/handler.go:107 +0x250

goroutine 615171 [select, 1397 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy(0xc0002963e0, 0x156eba0, 0xc00016f380, 0xc00049bce0)
	/home/circleci/project/project/policy/worker.go:114 +0xd4a
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler
	/home/circleci/project/project/agent/agent.go:81 +0x75

goroutine 615891 [select, 1386 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy(0xc00000d520, 0x156eba0, 0xc00016f380, 0xc00049bce0)
	/home/circleci/project/project/policy/worker.go:114 +0xd4a
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler
	/home/circleci/project/project/agent/agent.go:81 +0x75

goroutine 621239 [select, 1297 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy(0xc000297560, 0x156eba0, 0xc00016f380, 0xc00049bce0)
	/home/circleci/project/project/policy/worker.go:114 +0xd4a
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler
	/home/circleci/project/project/agent/agent.go:81 +0x75

goroutine 621270 [select, 1297 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*checkHandler).start(0xc00016ee00, 0x156eba0, 0xc00016ed40)
	/home/circleci/project/project/policy/worker.go:312 +0x10ac
created by github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy
	/home/circleci/project/project/policy/worker.go:51 +0x298

goroutine 618878 [select, 1331 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy(0xc000297280, 0x156eba0, 0xc00016f380, 0xc00049bce0)
	/home/circleci/project/project/policy/worker.go:114 +0xd4a
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler
	/home/circleci/project/project/agent/agent.go:81 +0x75

goroutine 611890 [select, 1443 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy(0xc0003c2be0, 0x156eba0, 0xc00016f380, 0xc0000311a0)
	/home/circleci/project/project/policy/worker.go:114 +0xd4a
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler
	/home/circleci/project/project/agent/agent.go:81 +0x75

goroutine 611955 [select, 1443 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*checkHandler).start(0xc00039fbc0, 0x156eba0, 0xc00039fb40)
	/home/circleci/project/project/policy/worker.go:312 +0x10ac
created by github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy
	/home/circleci/project/project/policy/worker.go:51 +0x298

goroutine 619739 [select, 1320 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy(0xc0005fae40, 0x156eba0, 0xc00016f380, 0xc00049bce0)
	/home/circleci/project/project/policy/worker.go:114 +0xd4a
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler
	/home/circleci/project/project/agent/agent.go:81 +0x75

goroutine 621268 [select, 1297 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy(0xc0003c2360, 0x156eba0, 0xc00016f380, 0xc00049bce0)
	/home/circleci/project/project/policy/worker.go:114 +0xd4a
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler
	/home/circleci/project/project/agent/agent.go:81 +0x75

goroutine 615893 [select, 1386 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*checkHandler).start(0xc00039e700, 0x156eba0, 0xc00039e640)
	/home/circleci/project/project/policy/worker.go:312 +0x10ac
created by github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy
	/home/circleci/project/project/policy/worker.go:51 +0x298

goroutine 611953 [select, 1443 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy(0xc0003c3380, 0x156eba0, 0xc00016f380, 0xc0000311a0)
	/home/circleci/project/project/policy/worker.go:114 +0xd4a
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler
	/home/circleci/project/project/agent/agent.go:81 +0x75

goroutine 618854 [select, 1330 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*checkHandler).start(0xc00039e280, 0x156eba0, 0xc00039e200)
	/home/circleci/project/project/policy/worker.go:312 +0x10ac
created by github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy
	/home/circleci/project/project/policy/worker.go:51 +0x298

goroutine 618852 [select, 1330 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy(0xc000508020, 0x156eba0, 0xc00016f380, 0xc00049bce0)
	/home/circleci/project/project/policy/worker.go:114 +0xd4a
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler
	/home/circleci/project/project/agent/agent.go:81 +0x75

goroutine 615173 [select, 1397 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*checkHandler).start(0xc00031c580, 0x156eba0, 0xc00031c500)
	/home/circleci/project/project/policy/worker.go:312 +0x10ac
created by github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy
	/home/circleci/project/project/policy/worker.go:51 +0x298

goroutine 620967 [select, 1298 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*checkHandler).start(0xc00003f8c0, 0x156eba0, 0xc00003f800)
	/home/circleci/project/project/policy/worker.go:312 +0x10ac
created by github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy
	/home/circleci/project/project/policy/worker.go:51 +0x298

goroutine 620965 [select, 1298 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy(0xc0003c2040, 0x156eba0, 0xc00016f380, 0xc00049bce0)
	/home/circleci/project/project/policy/worker.go:114 +0xd4a
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler
	/home/circleci/project/project/agent/agent.go:81 +0x75

goroutine 621245 [select, 1297 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy(0xc0002976c0, 0x156eba0, 0xc00016f380, 0xc00049bce0)
	/home/circleci/project/project/policy/worker.go:114 +0xd4a
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler
	/home/circleci/project/project/agent/agent.go:81 +0x75

goroutine 621241 [select, 1297 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*checkHandler).start(0xc00031cd00, 0x156eba0, 0xc00031cc80)
	/home/circleci/project/project/policy/worker.go:312 +0x10ac
created by github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy
	/home/circleci/project/project/policy/worker.go:51 +0x298

goroutine 618961 [select, 1330 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*checkHandler).start(0xc00016ea00, 0x156eba0, 0xc00016e980)
	/home/circleci/project/project/policy/worker.go:312 +0x10ac
created by github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy
	/home/circleci/project/project/policy/worker.go:51 +0x298

goroutine 618880 [select, 1331 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*checkHandler).start(0xc00016fc40, 0x156eba0, 0xc00016fac0)
	/home/circleci/project/project/policy/worker.go:312 +0x10ac
created by github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy
	/home/circleci/project/project/policy/worker.go:51 +0x298

goroutine 618927 [select, 1330 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy(0xc0003c2920, 0x156eba0, 0xc00016f380, 0xc00049bce0)
	/home/circleci/project/project/policy/worker.go:114 +0xd4a
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler
	/home/circleci/project/project/agent/agent.go:81 +0x75

goroutine 619741 [select, 1320 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*checkHandler).start(0xc00016edc0, 0x156eba0, 0xc00016ec40)
	/home/circleci/project/project/policy/worker.go:312 +0x10ac
created by github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy
	/home/circleci/project/project/policy/worker.go:51 +0x298

goroutine 611918 [select, 1443 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*checkHandler).start(0xc0002e2cc0, 0x156eba0, 0xc0002e2c40)
	/home/circleci/project/project/policy/worker.go:312 +0x10ac
created by github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy
	/home/circleci/project/project/policy/worker.go:51 +0x298

goroutine 611892 [select, 1443 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*checkHandler).start(0xc00039f440, 0x156eba0, 0xc00039f3c0)
	/home/circleci/project/project/policy/worker.go:312 +0x10ac
created by github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy
	/home/circleci/project/project/policy/worker.go:51 +0x298

goroutine 611916 [select, 1443 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy(0xc00000dbc0, 0x156eba0, 0xc00016f380, 0xc0000311a0)
	/home/circleci/project/project/policy/worker.go:114 +0xd4a
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).runEvalHandler
	/home/circleci/project/project/agent/agent.go:81 +0x75

goroutine 621247 [select, 1297 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*checkHandler).start(0xc00031d080, 0x156eba0, 0xc00031d000)
	/home/circleci/project/project/policy/worker.go:312 +0x10ac
created by github.com/hashicorp/nomad-autoscaler/policy.(*Worker).HandlePolicy
	/home/circleci/project/project/policy/worker.go:51 +0x298

Here there're 4 kinds on goroutine blocks.

There're 13 goroutines blocked on policy/worker.go:114 (Worker.HandlePolicy) and another 13 on worker.go:312 (checkHandler.start)

Looking at the code it seems like if the worker reached past the proceedCh send select before the checkHandler reached the proceedCh receive.

for _, handler := range checks {
select {
case handler.proceedCh <- handler == winningHandler:
default:
}
}
// Block until winning handler returns.
select {
case <-ctx.Done():
logger.Info("policy evaluation canceled")

// Send result back and wait to see if we should proceed.
h.resultCh <- result
select {
case <-ctx.Done():
return
case proceed := <-h.proceedCh:
if !proceed {
logger.Debug("check not selected")
return
}
}

That it's what makes both goroutines block there indefinitely. It's kind of weird because for the worker to reach that select it has to previously receive a result from each check and that's made just before the checkHandler select on proceedCh but I guess that depending on the execution it could happen.


Other of the blocking points it's on source.go:157 (Source.MonitorPolicy)

if err != nil {
req.ErrCh <- fmt.Errorf("failed to get policy: %v", err)
time.Sleep(10 * time.Second)
continue
}

And the last one is on agent.go:75 (Agent.runEvalHandler)

func (a *Agent) runEvalHandler(ctx context.Context, evalCh chan *policy.Evaluation) {
for {
select {
case <-ctx.Done():
a.logger.Info("context closed, shutting down eval handler")
return
case policyEval := <-evalCh:
w := policy.NewWorker(a.logger, a.pluginManager, a.policyManager)
go w.HandlePolicy(ctx, policyEval.Policy)
}
}
}

It seems that it can't send the the error on the channel (line 118) and no evaluations are received (line 155) because something is blocking on this select

case err := <-h.errCh:
// In case of error, log the error message and loop around.
// Handlers never stop running unless ctx.Done() or doneCh is
// closed.
// multierror.Error objects are logged differently to allow for a
// more structured output.
merr, ok := err.(*multierror.Error)
if ok && len(merr.Errors) > 1 {
// Transform Errors into a slice of strings to avoid logging
// empty objects when using JSON format.
errors := make([]string, len(merr.Errors))
for i, e := range merr.Errors {
errors[i] = e.Error()
}
h.log.Error(errors[0], "errors", errors[1:])
} else {
h.log.Error(err.Error())
}
continue
case p := <-h.ch:
h.updateHandler(currentPolicy, &p)
currentPolicy = &p
case <-h.ticker.C:
eval, err := h.handleTick(ctx, currentPolicy)
if err != nil {
if err == context.Canceled {
// Context was canceled, return to stop the handler.
return
}
h.log.Error(err.Error())
continue
}
if eval != nil {
evalCh <- eval
}


These are the last logs of the autoscaler. Aftter 2020-10-28T14:13:53.013Z there're no more logs until I sent the kill signal at
around 2020-10-28T14:54:00.

2020-10-28T14:13:43.003Z [INFO]  agent.worker: received policy for evaluation: policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf target=nomad-target
2020-10-28T14:13:43.003Z [INFO]  agent.worker.check_handler: received policy check for evaluation: check=requests_per_instance policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target
2020-10-28T14:13:43.004Z [INFO]  agent.worker.check_handler: fetching current count: check=requests_per_instance policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target
2020-10-28T14:13:43.004Z [INFO]  agent.worker.check_handler: querying source: check=requests_per_instance policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target query="scalar(ceil(service:http_requests_per_instance:rate30s{nomad_job='demo-scaling-test', nomad_group='api',local_cluster='demo-scaling-test'}))"
2020-10-28T14:13:43.004Z [INFO]  agent.worker.check_handler: received policy check for evaluation: check=cpu_usage policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target
2020-10-28T14:13:43.004Z [INFO]  agent.worker.check_handler: fetching current count: check=cpu_usage policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target
2020-10-28T14:13:43.004Z [INFO]  agent.worker.check_handler: querying source: check=cpu_usage policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target query="scalar(avg(avg_over_time(allocation:cpu_usage:percent{exported_job='demo-scaling-test', task='api'}[1m])))"
2020-10-28T14:13:43.007Z [INFO]  agent.worker.check_handler: calculating new count: check=requests_per_instance policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target count=2 metric=0
2020-10-28T14:13:43.007Z [INFO]  agent.worker.check_handler: calculating new count: check=cpu_usage policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target count=2 metric=0.4280823916196823
2020-10-28T14:13:43.007Z [INFO]  agent.worker.check_handler: nothing to do: check=cpu_usage policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target from=2 to=2
2020-10-28T14:13:43.007Z [INFO]  agent.worker.check_handler: nothing to do: check=requests_per_instance policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target from=2 to=2
2020-10-28T14:13:43.007Z [INFO]  agent.worker: no checks need to be executed: policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf target=nomad-target
2020-10-28T14:13:53.004Z [INFO]  agent.worker: received policy for evaluation: policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf target=nomad-target
2020-10-28T14:13:53.004Z [INFO]  agent.worker.check_handler: received policy check for evaluation: check=requests_per_instance policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target
2020-10-28T14:13:53.004Z [INFO]  agent.worker.check_handler: fetching current count: check=requests_per_instance policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target
2020-10-28T14:13:53.004Z [INFO]  agent.worker.check_handler: querying source: check=requests_per_instance policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target query="scalar(ceil(service:http_requests_per_instance:rate30s{nomad_job='demo-scaling-test', nomad_group='api',local_cluster='demo-scaling-test'}))"
2020-10-28T14:13:53.004Z [INFO]  agent.worker.check_handler: received policy check for evaluation: check=cpu_usage policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target
2020-10-28T14:13:53.004Z [INFO]  agent.worker.check_handler: fetching current count: check=cpu_usage policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target
2020-10-28T14:13:53.004Z [INFO]  agent.worker.check_handler: querying source: check=cpu_usage policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target query="scalar(avg(avg_over_time(allocation:cpu_usage:percent{exported_job='demo-scaling-test', task='api'}[1m])))"
2020-10-28T14:13:53.013Z [INFO]  agent.worker.check_handler: calculating new count: check=requests_per_instance policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target count=2 metric=0
2020-10-28T14:13:53.013Z [INFO]  agent.worker.check_handler: nothing to do: check=requests_per_instance policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target from=2 to=2
2020-10-28T14:13:53.013Z [INFO]  agent.worker.check_handler: calculating new count: check=cpu_usage policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target count=2 metric=0.359704169134299
2020-10-28T14:13:53.013Z [INFO]  agent.worker.check_handler: nothing to do: check=cpu_usage policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf source=prometheus strategy=target-value target=nomad-target from=2 to=2
2020-10-28T14:13:53.013Z [INFO]  agent.worker: no checks need to be executed: policy_id=2270f70e-ffed-780f-81f1-fbf4303cceaf target=nomad-target

I don't know if there're two different issues here or it's all the same. I'll comment if we see something else.

@jorgemarey
Copy link
Contributor Author

Looking a bit more in detail the goroutine dump. I found out this ones:

goroutine 8 [semacquire, 9 minutes]:
sync.runtime_SemacquireMutex(0xc00016f3e4, 0x0, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00016f3e0)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0xc00016f3e0)
	/usr/local/go/src/sync/rwmutex.go:98 +0x97
github.com/hashicorp/nomad-autoscaler/plugins/builtin/target/nomad/plugin.(*TargetPlugin).garbageCollect(0xc00016f3c0)
	/home/circleci/project/project/plugins/builtin/target/nomad/plugin/plugin.go:185 +0xd3
github.com/hashicorp/nomad-autoscaler/plugins/builtin/target/nomad/plugin.(*TargetPlugin).garbageCollectionLoop(0xc00016f3c0)
	/home/circleci/project/project/plugins/builtin/target/nomad/plugin/plugin.go:173 +0xa6
created by github.com/hashicorp/nomad-autoscaler/plugins/builtin/target/nomad/plugin.(*TargetPlugin).SetConfig
	/home/circleci/project/project/plugins/builtin/target/nomad/plugin/plugin.go:83 +0x11c

goroutine 10 [select, 9 minutes]:
github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run(0xc0000b9900, 0x156eba0, 0xc00016f380, 0xc000288600)
	/home/circleci/project/project/policy/manager.go:61 +0x672
created by github.com/hashicorp/nomad-autoscaler/agent.(*Agent).Run
	/home/circleci/project/project/agent/agent.go:63 +0x313

goroutine 686002 [chan receive, 9 minutes]:
github.com/hashicorp/nomad-autoscaler/plugins/builtin/target/nomad/plugin.(*TargetPlugin).Status(0xc00016f3c0, 0xc00050ac00, 0x0, 0x0, 0x0)
	/home/circleci/project/project/plugins/builtin/target/nomad/plugin/plugin.go:154 +0x46c
github.com/hashicorp/nomad-autoscaler/policy.(*Handler).generateEvaluation(0xc0002e8360, 0xc0002dc960, 0x1f1e300, 0xc000326240, 0xc000381f10)
	/home/circleci/project/project/policy/handler.go:269 +0x196
github.com/hashicorp/nomad-autoscaler/policy.(*Handler).handleTick(0xc0002e8360, 0x156eba0, 0xc00016f380, 0xc0002dc960, 0xc00039e901, 0xc0004e40c0, 0x24)
	/home/circleci/project/project/policy/handler.go:189 +0xd5
github.com/hashicorp/nomad-autoscaler/policy.(*Handler).Run(0xc0002e8360, 0x156eba0, 0xc00016f380, 0xc000288600)
	/home/circleci/project/project/policy/handler.go:144 +0x6ed
github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run.func1(0xc0002e8360, 0x156eba0, 0xc00016f380, 0xc000288600, 0xc0000b9900, 0xc0004e40c0, 0x24)
	/home/circleci/project/project/policy/manager.go:105 +0x4d
created by github.com/hashicorp/nomad-autoscaler/policy.(*Manager).Run
	/home/circleci/project/project/policy/manager.go:104 +0xb06

The last one of that could be the one tha's causing the select on handler.go block ( nomad-autoscaler/policy.(*Handler).handleTick ). But the times doesn't match, this was only blocked 9 minutes while the goroutine waiting for the errCh to be available was waiting 1439 minutes.

@lgfa29
Copy link
Contributor

lgfa29 commented Oct 29, 2020

Thank you so much for the detailed report @jorgemarey.

I think your assessment is right and a race condition will trigger this deadlock where the worker and the handler are both stuck waiting on each other.

The 0.2.0-beta2 we released yesterday would mitigate this slightly because we now have multiple workers handling policies. But they could still get stuck, it's just that there would be other workers to keep picking up policies.

In hindsight, this design choice of blocking/unblocking the handler Goroutines was not great. We will try and change it for the next release.

In the mean time, would it possible for you to test it agains 0.2.0-beta2?

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 28, 2021

Closed by #354

@lgfa29 lgfa29 closed this as completed Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants