Failed to ACK policy evaluation #343

baxor · 2021-01-12T17:49:49Z

We have seen this error a number of times in our autoscaler logs, and have recently observed that when it occurs, the queue (horizontal, cluster), will no longer evaluate policies when this occurs.

Instances of the error:

➜ grep -i "failed to ack" autoscaler.std*
autoscaler.stderr (1).0:2020-12-03T16:06:01.519Z [WARN]  policy_eval.worker: failed to ACK policy evaluation: eval_id=2e9340a9-3240-c790-595e-adb36ba79bf9 eval_token=be2334fc-f8e3-d2a7-808b-b63f4de84e90 id=f8384e60-d84d-55af-d87e-b5d74e27b225 policy_id=7dea9a82-f882-a3aa-b3e6-2d95507d4aee queue=cluster err="evaluation ID not found"
autoscaler.stderr (1).0:2020-12-03T16:15:58.400Z [WARN]  policy_eval.worker: failed to ACK policy evaluation: eval_id=113ddce4-9e9b-01a9-d9a7-3d08218bf63b eval_token=590da9b6-963a-2358-6b31-95f180245a8d id=94de3d7d-3ede-533c-0f4d-def99d6bf90b policy_id=7dea9a82-f882-a3aa-b3e6-2d95507d4aee queue=cluster err="evaluation ID not found"
autoscaler.stderr (2).0:2020-12-04T23:47:12.386Z [WARN]  policy_eval.worker: failed to ACK policy evaluation: eval_id=b5680ba8-c996-80cc-8c2a-737cb1aaefe2 eval_token=cfd5ee89-20e8-4013-a8cc-0063b72fb062 id=a126ed15-6217-a687-867f-f9b0031adec4 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster err="evaluation ID not found"
autoscaler.stderr (2).0:2020-12-05T00:06:16.068Z [WARN]  policy_eval.worker: failed to ACK policy evaluation: eval_id=52b398ea-5ebc-0154-07f5-21abe73c1579 eval_token=6f5be4d6-f9be-ca6a-6eec-e84b5b10a1d6 id=4fa97a14-f9c3-2a2b-7c76-406d93efe9d6 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster err="evaluation ID not found"
autoscaler.stderr (2).0:2020-12-05T00:39:38.401Z [WARN]  policy_eval.worker: failed to ACK policy evaluation: eval_id=d3788eb6-228f-9d0d-d1de-19228a49f585 eval_token=3e4d8f33-235b-43a2-f1ad-dde659a71a4c id=a126ed15-6217-a687-867f-f9b0031adec4 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster err="evaluation ID not found"
autoscaler.stderr (2).0:2020-12-05T04:42:27.426Z [WARN]  policy_eval.worker: failed to ACK policy evaluation: eval_id=5515d069-302c-277c-ce22-6664c38b49f1 eval_token=6ca4403e-e205-5659-351b-ca8d16c29d4d id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster err="evaluation ID not found"
autoscaler.stderr (4).0:2020-12-07T17:10:16.775Z [WARN]  policy_eval.worker: failed to ACK policy evaluation: eval_id=0d426741-27aa-1fc5-f314-51430c780bb2 eval_token=8ce4916a-8dcc-a848-8616-1b06d7ab1ab7 id=759376f0-9d62-6d3d-d942-a3a9c76934a7 policy_id=44d409b2-5cc8-eb60-4a4c-8365659445ac queue=cluster err="evaluation ID not found"
autoscaler.stderr (4).0:2020-12-07T17:15:16.775Z [WARN]  policy_eval.worker: failed to ACK policy evaluation: eval_id=c1b211cf-6182-c48e-b9b8-0cded1b5df0b eval_token=d2856c99-995f-850b-6df2-8ff5db26dcd7 id=e0054cb7-6961-fc52-54d6-85ba83255c73 policy_id=44d409b2-5cc8-eb60-4a4c-8365659445ac queue=cluster err="evaluation ID not found"
autoscaler.stderr (7).0:2021-01-06T10:36:31.853Z [WARN]  policy_eval.worker: failed to ACK policy evaluation: eval_id=24f41b4e-a477-6e2d-863c-b975fc11c04b eval_token=7644f014-ae99-1742-405f-8ba6146ff9df id=254213eb-a7a8-a30c-e426-00147acd2687 policy_id=a0e7bccb-c7fc-594d-5941-90b185f125af queue=cluster err="evaluation ID not found"
autoscaler.stderr.3:2020-12-29T00:39:39.495Z [WARN]  policy_eval.worker: failed to ACK policy evaluation: eval_id=58a1a191-32a3-e7af-208f-e2c63e63bbf6 eval_token=9ad81d92-af35-660d-20ed-004957a411ff id=d668a357-7491-9383-7e6c-83d729af3a2d policy_id=c9632fa1-53ba-86c3-cdc0-9e2458db9102 queue=cluster err="evaluation ID not found"

logs around the most recent instance:

2294 2020-12-07T17:10:16.775Z [INFO]  policy_eval.worker.check_handler: successfully submitted scaling action to target: check=applications-scale-memory-utilization id=759376f0-9d62-6d3d-d942-a3a9c76934a7 policy_id=44d409b2-5cc8-eb60     -4a4c-8365659445ac queue=cluster source=datadog strategy=target-value target=aws-asg desired_count=8
2295 2020-12-07T17:10:16.775Z [INFO]  policy_eval.worker: policy evaluation complete: id=759376f0-9d62-6d3d-d942-a3a9c76934a7 policy_id=44d409b2-5cc8-eb60-4a4c-8365659445ac queue=cluster target=aws-asg
2296 2020-12-07T17:10:16.775Z [DEBUG] policy_eval.broker: ack eval: eval_id=0d426741-27aa-1fc5-f314-51430c780bb2 token=8ce4916a-8dcc-a848-8616-1b06d7ab1ab7 eval_id=0d426741-27aa-1fc5-f314-51430c780bb2 token=8ce4916a-8dcc-a848-8616-1b     06d7ab1ab7
2297 2020-12-07T17:10:16.775Z [WARN]  policy_eval.worker: failed to ACK policy evaluation: eval_id=0d426741-27aa-1fc5-f314-51430c780bb2 eval_token=8ce4916a-8dcc-a848-8616-1b06d7ab1ab7 id=759376f0-9d62-6d3d-d942-a3a9c76934a7 policy_id     =44d409b2-5cc8-eb60-4a4c-8365659445ac queue=cluster err="evaluation ID not found"
2298 2020-12-07T17:10:16.775Z [DEBUG] policy_eval.broker: dequeue eval: queue=cluster
2299 2020-12-07T17:10:16.775Z [DEBUG] policy_eval.broker: waiting for eval: queue=cluster
2300 2020-12-07T17:10:16.775Z [DEBUG] policy_manager.policy_handler: scaling policy has been placed into cooldown: policy_id=44d409b2-5cc8-eb60-4a4c-8365659445ac cooldown=5m0s
2301 2020-12-07T17:10:53.668Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2302 2020-12-07T17:11:53.668Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection

In specific cases, our cluster autoscaler failed to evaluate any further policies, and jobs became blocked for a number of hours in the cluster. Following a restart of the nomad-autoscaler, everything behaved as usual. This has occurred on roughly a 3-5 day interval, but the timing is not consistent. We have taken to a nightly restart to work around this.

I don't currently have more debugging information or steps to reproduce this state, but am happy to provide anything I can to help debug.

The text was updated successfully, but these errors were encountered:

baxor · 2021-01-12T17:51:55Z

Here are logs I captured from the last time this issue occurred and our cluster queue was stalled (the above is just the most recent log message, but am not sure if the queue was stalled after this)

2020-12-05T04:42:28.035Z [DEBUG] policy_eval.worker: received policy for evaluation: id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster target=aws-asg
2020-12-05T04:42:28.035Z [DEBUG] policy_eval.worker.check_handler: received policy check for evaluation: check=applications-scale-blocked-evaluations id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg
2020-12-05T04:42:28.035Z [DEBUG] policy_eval.worker.check_handler: received policy check for evaluation: check=applications-scale-memory-utilization id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg
2020-12-05T04:42:28.035Z [DEBUG] policy_eval.worker.check_handler: received policy check for evaluation: check=applications-scale-cpu-utilization id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg
2020-12-05T04:42:28.035Z [DEBUG] policy_eval.worker.check_handler: fetching current count: check=applications-scale-cpu-utilization id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg
2020-12-05T04:42:28.035Z [DEBUG] policy_eval.worker.check_handler: fetching current count: check=applications-scale-memory-utilization id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg
2020-12-05T04:42:28.035Z [DEBUG] policy_eval.worker.check_handler: fetching current count: check=applications-scale-blocked-evaluations id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg
2020-12-05T04:42:28.338Z [INFO]  policy_eval.worker: target not ready: id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster target=aws-asg
2020-12-05T04:42:28.338Z [DEBUG] policy_eval.broker: ack eval: eval_id=b0a6205d-b496-ae3d-652f-1b00fb247fbd token=5d769a9e-83e9-469f-e2f7-a67f793833ea eval_id=b0a6205d-b496-ae3d-652f-1b00fb247fbd token=5d769a9e-83e9-469f-e2f7-a67f793833ea
2020-12-05T04:42:28.338Z [DEBUG] policy_eval.broker: eval ack'd: policy_id=d2be2d99-1569-a853-7148-1a226d6ff438
2020-12-05T04:42:28.338Z [DEBUG] policy_eval.broker: dequeue eval: queue=cluster
2020-12-05T04:42:28.338Z [DEBUG] policy_eval.broker: waiting for eval: queue=cluster
2020-12-05T04:42:28.338Z [DEBUG] policy_eval.broker: waiting for eval: queue=cluster
2020-12-05T04:42:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:43:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:44:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:45:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:46:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:47:27.789Z [DEBUG] policy_manager.policy_handler: scaling policy has been placed into cooldown: policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 cooldown=3m18.573697569s
2020-12-05T04:47:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:48:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:49:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:50:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:50:46.946Z [DEBUG] policy_eval.broker: enqueue eval: eval_id=03027e27-4a3b-1213-7187-e8cf5ae65664 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:50:46.946Z [DEBUG] policy_eval.broker: eval enqueued: eval_id=03027e27-4a3b-1213-7187-e8cf5ae65664 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:50:46.946Z [DEBUG] policy_eval.broker: eval dequeued: queue=cluster eval_id=03027e27-4a3b-1213-7187-e8cf5ae65664 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 token=92aa7e23-6928-3f44-0c36-e5cc81f32d11
2020-12-05T04:50:46.946Z [DEBUG] policy_eval.worker: received policy for evaluation: id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster target=aws-asg
2020-12-05T04:50:46.946Z [DEBUG] policy_eval.worker.check_handler: received policy check for evaluation: check=applications-scale-blocked-evaluations id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg
2020-12-05T04:50:46.946Z [DEBUG] policy_eval.worker.check_handler: fetching current count: check=applications-scale-blocked-evaluations id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg
2020-12-05T04:50:46.946Z [DEBUG] policy_eval.worker.check_handler: received policy check for evaluation: check=applications-scale-cpu-utilization id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg
2020-12-05T04:50:46.946Z [DEBUG] policy_eval.worker.check_handler: fetching current count: check=applications-scale-cpu-utilization id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg
2020-12-05T04:50:46.946Z [DEBUG] policy_eval.worker.check_handler: received policy check for evaluation: check=applications-scale-memory-utilization id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg
2020-12-05T04:50:46.947Z [DEBUG] policy_eval.worker.check_handler: fetching current count: check=applications-scale-memory-utilization id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg
2020-12-05T04:50:47.254Z [DEBUG] policy_eval.worker.check_handler: querying source: check=applications-scale-blocked-evaluations id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg query="max:nomad.nomad.blocked_evals.total_blocked{region:us-east-1, environment:staging }" source=datadog
2020-12-05T04:50:47.316Z [DEBUG] policy_eval.worker.check_handler: querying source: check=applications-scale-cpu-utilization id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg query="max:nomad.client.allocated.cpu{region:us-east-1, class:applications, environment:staging }*100/(max:nomad.client.allocated.cpu{region:us-east-1, class:applications, environment:staging }+max:nomad.client.unallocated.cpu{region:us-east-1, class:applications, environment:staging })" source=datadog
2020-12-05T04:50:47.333Z [DEBUG] policy_eval.worker.check_handler: calculating new count: check=applications-scale-blocked-evaluations id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg count=8
2020-12-05T04:50:47.386Z [DEBUG] policy_eval.worker.check_handler: querying source: check=applications-scale-memory-utilization id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg query="max:nomad.client.allocated.memory{region:us-east-1, class:applications, environment:staging }*100/(max:nomad.client.allocated.memory{region:us-east-1, class:applications, environment:staging }+max:nomad.client.unallocated.memory{region:us-east-1, class:applications, environment:staging })" source=datadog
2020-12-05T04:50:47.432Z [DEBUG] policy_eval.worker.check_handler: calculating new count: check=applications-scale-cpu-utilization id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg count=8
2020-12-05T04:50:47.491Z [DEBUG] policy_eval.worker.check_handler: calculating new count: check=applications-scale-memory-utilization id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg count=8
2020-12-05T04:50:47.491Z [DEBUG] policy_eval.worker.check_handler: nothing to do: check=applications-scale-memory-utilization id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg
2020-12-05T04:50:47.491Z [DEBUG] policy_eval.worker.check_handler: check not selected: check=applications-scale-cpu-utilization id=1ab22a8e-2171-5013-b341-8f5489574e5a policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster source=datadog strategy=target-value target=aws-asg
2020-12-05T04:51:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:51:40.946Z [DEBUG] policy_eval.broker: enqueue eval: eval_id=8766b21e-ad5b-6319-7457-67f6349a1f86 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:51:40.946Z [DEBUG] policy_eval.broker: policy already enqueued: eval_id=8766b21e-ad5b-6319-7457-67f6349a1f86 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:51:40.946Z [DEBUG] policy_eval.broker: eval enqueued: eval_id=8766b21e-ad5b-6319-7457-67f6349a1f86 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:52:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:53:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:53:40.910Z [DEBUG] policy_eval.broker: enqueue eval: eval_id=06e55135-0033-c4c2-3616-16f0d3eb9b0c policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:53:40.910Z [DEBUG] policy_eval.broker: policy already enqueued: eval_id=06e55135-0033-c4c2-3616-16f0d3eb9b0c policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:53:40.910Z [DEBUG] policy_eval.broker: eval enqueued: eval_id=06e55135-0033-c4c2-3616-16f0d3eb9b0c policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:54:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:55:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:55:40.957Z [DEBUG] policy_eval.broker: enqueue eval: eval_id=ddc870e7-f390-58b0-a8b6-4e697dce6d7f policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:55:40.957Z [DEBUG] policy_eval.broker: policy already enqueued: eval_id=ddc870e7-f390-58b0-a8b6-4e697dce6d7f policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:55:40.957Z [DEBUG] policy_eval.broker: eval enqueued: eval_id=ddc870e7-f390-58b0-a8b6-4e697dce6d7f policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:55:46.946Z [DEBUG] policy_eval.broker: nack eval: eval_id=03027e27-4a3b-1213-7187-e8cf5ae65664 token=92aa7e23-6928-3f44-0c36-e5cc81f32d11
2020-12-05T04:55:46.946Z [WARN]  policy_eval.broker: eval delivery limit reached: eval_id=03027e27-4a3b-1213-7187-e8cf5ae65664 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 token=92aa7e23-6928-3f44-0c36-e5cc81f32d11 count=1 limit=1
2020-12-05T04:56:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:57:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:57:40.898Z [DEBUG] policy_eval.broker: enqueue eval: eval_id=58278b08-40fe-a7d9-49e7-a2c9cafa3b95 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:57:40.898Z [DEBUG] policy_eval.broker: eval enqueued: eval_id=58278b08-40fe-a7d9-49e7-a2c9cafa3b95 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:58:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:59:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T04:59:40.854Z [DEBUG] policy_eval.broker: enqueue eval: eval_id=a18fd220-c37a-9234-f7bb-d86a7bef02f7 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:59:40.854Z [DEBUG] policy_eval.broker: policy already enqueued: eval_id=a18fd220-c37a-9234-f7bb-d86a7bef02f7 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T04:59:40.854Z [DEBUG] policy_eval.broker: new eval is newer, policy updated: eval_id=a18fd220-c37a-9234-f7bb-d86a7bef02f7 policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T05:00:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T05:01:40.354Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-12-05T05:01:40.829Z [DEBUG] policy_eval.broker: enqueue eval: eval_id=be030dc9-4c91-d327-30b6-4e1abbc5ba0b policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T05:01:40.829Z [DEBUG] policy_eval.broker: policy already enqueued: eval_id=be030dc9-4c91-d327-30b6-4e1abbc5ba0b policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=
2020-12-05T05:01:40.829Z [DEBUG] policy_eval.broker: new eval is newer, policy updated: eval_id=be030dc9-4c91-d327-30b6-4e1abbc5ba0b policy_id=d2be2d99-1569-a853-7148-1a226d6ff438 queue=cluster token=```

lgfa29 · 2021-01-12T20:24:01Z

Thanks for the report and all the log info. I am wondering if this is related to #303 🤔

I am currently redesigning how policies are evaluated to avoid this potential source of race condition, where the worker gets stuck waiting for all checks to complete.

I will have a better update in the coming days.

baxor · 2021-01-13T22:42:11Z

Thanks for the report and all the log info. I am wondering if this is related to #303 🤔

I am currently redesigning how policies are evaluated to avoid this potential source of race condition, where the worker gets stuck waiting for all checks to complete.

I will have a better update in the coming days.

@lgfa29 I notice in the linked issue you suggested trying v0.2.0-beta2 and realized it is worth noting that we were on v0.2.0-beta3 at the time of hitting this bug. Currently on v0.2.0 release, and will update here if we run into this again.

Thanks! looking forward to your update.

lgfa29 · 2021-01-14T02:21:52Z

Thanks! That's good to know.

0.2.0-beta2 had a small change that would potentially make this error happen less frequently, but it was not a complete fix.

lgfa29 · 2021-01-28T16:20:25Z

Closed by #354

lgfa29 added stage/needs-investigation theme/policy-eval Policy broker, workers and evaluation type/bug labels Jan 12, 2021

lgfa29 mentioned this issue Jan 26, 2021

Don't block worker while evaluating a policy #354

Merged

lgfa29 closed this as completed Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to ACK policy evaluation #343

Failed to ACK policy evaluation #343

baxor commented Jan 12, 2021 •

edited

Loading

baxor commented Jan 12, 2021

lgfa29 commented Jan 12, 2021

baxor commented Jan 13, 2021 •

edited

Loading

lgfa29 commented Jan 14, 2021

lgfa29 commented Jan 28, 2021

Failed to ACK policy evaluation #343

Failed to ACK policy evaluation #343

Comments

baxor commented Jan 12, 2021 • edited Loading

baxor commented Jan 12, 2021

lgfa29 commented Jan 12, 2021

baxor commented Jan 13, 2021 • edited Loading

lgfa29 commented Jan 14, 2021

lgfa29 commented Jan 28, 2021

baxor commented Jan 12, 2021 •

edited

Loading

baxor commented Jan 13, 2021 •

edited

Loading