-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to ACK policy evaluation #343
Comments
Here are logs I captured from the last time this issue occurred and our cluster queue was stalled (the above is just the most recent log message, but am not sure if the queue was stalled after this)
|
Thanks for the report and all the log info. I am wondering if this is related to #303 🤔 I am currently redesigning how policies are evaluated to avoid this potential source of race condition, where the worker gets stuck waiting for all checks to complete. I will have a better update in the coming days. |
@lgfa29 I notice in the linked issue you suggested trying Thanks! looking forward to your update. |
Thanks! That's good to know.
|
Closed by #354 |
We have seen this error a number of times in our autoscaler logs, and have recently observed that when it occurs, the queue (
horizontal
,cluster
), will no longer evaluate policies when this occurs.Instances of the error:
logs around the most recent instance:
In specific cases, our cluster autoscaler failed to evaluate any further policies, and jobs became blocked for a number of hours in the cluster. Following a restart of the
nomad-autoscaler
, everything behaved as usual. This has occurred on roughly a 3-5 day interval, but the timing is not consistent. We have taken to a nightly restart to work around this.I don't currently have more debugging information or steps to reproduce this state, but am happy to provide anything I can to help debug.
The text was updated successfully, but these errors were encountered: