-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoscaler blocked #303
Comments
Looking a bit more in detail the goroutine dump. I found out this ones:
The last one of that could be the one tha's causing the select on handler.go block ( |
Thank you so much for the detailed report @jorgemarey. I think your assessment is right and a race condition will trigger this deadlock where the worker and the handler are both stuck waiting on each other. The In hindsight, this design choice of blocking/unblocking the handler Goroutines was not great. We will try and change it for the next release. In the mean time, would it possible for you to test it agains |
Closed by #354 |
Autoscaler version: 0.1.0
This is a log post, but I'll try to give as much detail as I can.
Hi, we found that sometimes the autoscaler gets blocked and doesn't perform scaling actions. We don't know if this is general or only for one policy as what I'm going to describe happened when we're testing only with one job and one policy who had two checks. Maybe it's related to #218, but I'm not sure. After we saw the autoscaler blocked (no more logs in the output) we killed it and get the goroutine dump to see what happened. We saw the following (removed some goroutines that we're not important):
Here there're 4 kinds on goroutine blocks.
There're 13 goroutines blocked on
policy/worker.go:114 (Worker.HandlePolicy)
and another 13 onworker.go:312 (checkHandler.start)
Looking at the code it seems like if the worker reached past the
proceedCh
send select before the checkHandler reached theproceedCh
receive.nomad-autoscaler/policy/worker.go
Lines 106 to 116 in 037dba2
nomad-autoscaler/policy/worker.go
Lines 310 to 320 in 037dba2
That it's what makes both goroutines block there indefinitely. It's kind of weird because for the worker to reach that select it has to previously receive a result from each check and that's made just before the checkHandler select on
proceedCh
but I guess that depending on the execution it could happen.Other of the blocking points it's on
source.go:157 (Source.MonitorPolicy)
nomad-autoscaler/policy/nomad/source.go
Lines 155 to 161 in 037dba2
And the last one is on
agent.go:75 (Agent.runEvalHandler)
nomad-autoscaler/agent/agent.go
Lines 73 to 84 in 037dba2
It seems that it can't send the the error on the channel (line 118) and no evaluations are received (line 155) because something is blocking on this select
nomad-autoscaler/policy/handler.go
Lines 118 to 157 in 037dba2
These are the last logs of the autoscaler. Aftter
2020-10-28T14:13:53.013Z
there're no more logs until I sent the kill signal ataround 2020-10-28T14:54:00.
I don't know if there're two different issues here or it's all the same. I'll comment if we see something else.
The text was updated successfully, but these errors were encountered: