Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

policy-worker: evaluation worker sporadically doesn't get unblocked to continue #218

Closed
jrasell opened this issue Jul 14, 2020 · 0 comments · Fixed by #354
Closed

policy-worker: evaluation worker sporadically doesn't get unblocked to continue #218

jrasell opened this issue Jul 14, 2020 · 0 comments · Fixed by #354
Labels
Milestone

Comments

@jrasell
Copy link
Member

jrasell commented Jul 14, 2020

It seems that on occasions during multi-check policy evaluation runs, the policy evaluation worker will never become unblocked nor finish. In the logs below, we see an initial run for the policy 516f98cb-bd86-1be7-5fb5-10a945567de5 and at 2020-07-09T10:01:29.114Z see the strategy calculates a scale up is required. This log message however is the last until another policy is sent for evaluation indicating the routine has not continued forward as expected. Further in the logs at 2020-07-09T10:02:28.846Z the policy is again sent for evaluation, this time the process completes till resolution and the scaling action is triggered.

2020-07-09T10:01:28.880Z [INFO]  agent.worker: received policy for evaluation: policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 target=aws-asg
2020-07-09T10:01:28.880Z [INFO]  agent.worker.check_handler: received policy check for evaluation: check=mem_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg
2020-07-09T10:01:28.880Z [INFO]  agent.worker.check_handler: fetching current count: check=mem_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg
2020-07-09T10:01:28.880Z [INFO]  agent.worker.check_handler: received policy check for evaluation: check=cpu_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg
2020-07-09T10:01:28.880Z [INFO]  agent.worker.check_handler: fetching current count: check=cpu_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg
2020-07-09T10:01:29.085Z [INFO]  agent.worker.check_handler: querying source: check=mem_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg query=scalar(sum(nomad_client_allocated_memory{node_class="hashistack"}*100/(nomad_client_unallocated_memory{node_class="hashistack"}+nomad_client_allocated_memory{node_class="hashistack"}))/count(nomad_client_allocated_memory{node_class="hashistack"}))
2020-07-09T10:01:29.087Z [INFO]  agent.worker.check_handler: calculating new count: check=mem_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg count=1 metric=95.84402257567983
2020-07-09T10:01:29.087Z [TRACE] internal_plugin.target-value: calculated scaling strategy results: policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 current_count=1 new_count=2 metric_value=95.84402257567983 factor=1.3692003225097118 direction=up
2020-07-09T10:01:29.112Z [INFO]  agent.worker.check_handler: querying source: check=cpu_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg query=scalar(sum(nomad_client_allocated_cpu{node_class="hashistack"}*100/(nomad_client_unallocated_cpu{node_class="hashistack"}+nomad_client_allocated_cpu{node_class="hashistack"}))/count(nomad_client_allocated_cpu{node_class="hashistack"}))
2020-07-09T10:01:29.114Z [INFO]  agent.worker.check_handler: calculating new count: check=cpu_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg count=1 metric=47
2020-07-09T10:01:29.114Z [TRACE] internal_plugin.target-value: calculated scaling strategy results: policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 current_count=1 new_count=1 metric_value=47 factor=0.6714285714285714 direction=down
2020-07-09T10:01:29.114Z [INFO]  agent.worker.check_handler: nothing to do: check=cpu_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg
2020-07-09T10:01:29.114Z [TRACE] agent.worker: check mem_allocated_percentage selected: policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 target=aws-asg direction=up count=2
2020-07-09T10:01:58.630Z [TRACE] policy_manager.policy_handler: tick: policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2
2020-07-09T10:01:58.630Z [TRACE] policy_manager.policy_handler: getting target status: policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2
2020-07-09T10:01:58.630Z [TRACE] policy_manager.policy_handler: sending policy for evaluation: policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2
2020-07-09T10:01:58.630Z [INFO]  agent.worker: received policy for evaluation: policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2 target=nomad-target
2020-07-09T10:01:58.630Z [INFO]  agent.worker.check_handler: received policy check for evaluation: check=avg_sessions policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2 source=prometheus strategy=target-value target=nomad-target
2020-07-09T10:01:58.630Z [INFO]  agent.worker.check_handler: fetching current count: check=avg_sessions policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2 source=prometheus strategy=target-value target=nomad-target
2020-07-09T10:01:58.630Z [INFO]  agent.worker.check_handler: querying source: check=avg_sessions policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2 source=prometheus strategy=target-value target=nomad-target query=scalar(sum(traefik_entrypoint_open_connections{entrypoint="webapp"})/scalar(nomad_nomad_job_summary_running{task_group="demo"}))
2020-07-09T10:01:58.631Z [INFO]  agent.worker.check_handler: calculating new count: check=avg_sessions policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2 source=prometheus strategy=target-value target=nomad-target count=3 metric=10
2020-07-09T10:01:58.631Z [INFO]  agent.worker.check_handler: nothing to do: check=avg_sessions policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2 source=prometheus strategy=target-value target=nomad-target
2020-07-09T10:01:58.631Z [INFO]  agent.worker: no checks need to be executed: policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2 target=nomad-target
2020-07-09T10:02:25.103Z [DEBUG] internal_plugin.nomad-target: triggering run of handler garbage collection
2020-07-09T10:02:28.627Z [TRACE] policy_manager.policy_handler: tick: policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5
2020-07-09T10:02:28.627Z [TRACE] policy_manager.policy_handler: getting target status: policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5
2020-07-09T10:02:28.630Z [TRACE] policy_manager.policy_handler: tick: policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2
2020-07-09T10:02:28.630Z [TRACE] policy_manager.policy_handler: getting target status: policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2
2020-07-09T10:02:28.630Z [TRACE] policy_manager.policy_handler: sending policy for evaluation: policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2
2020-07-09T10:02:28.630Z [INFO]  agent.worker: received policy for evaluation: policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2 target=nomad-target
2020-07-09T10:02:28.630Z [INFO]  agent.worker.check_handler: received policy check for evaluation: check=avg_sessions policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2 source=prometheus strategy=target-value target=nomad-target
2020-07-09T10:02:28.630Z [INFO]  agent.worker.check_handler: fetching current count: check=avg_sessions policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2 source=prometheus strategy=target-value target=nomad-target
2020-07-09T10:02:28.630Z [INFO]  agent.worker.check_handler: querying source: check=avg_sessions policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2 source=prometheus strategy=target-value target=nomad-target query=scalar(sum(traefik_entrypoint_open_connections{entrypoint="webapp"})/scalar(nomad_nomad_job_summary_running{task_group="demo"}))
2020-07-09T10:02:28.632Z [INFO]  agent.worker.check_handler: calculating new count: check=avg_sessions policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2 source=prometheus strategy=target-value target=nomad-target count=3 metric=10
2020-07-09T10:02:28.633Z [INFO]  agent.worker.check_handler: nothing to do: check=avg_sessions policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2 source=prometheus strategy=target-value target=nomad-target
2020-07-09T10:02:28.633Z [INFO]  agent.worker: no checks need to be executed: policy_id=ffab910a-793c-440e-8cfb-53c067f27aa2 target=nomad-target
2020-07-09T10:02:28.845Z [TRACE] policy_manager.policy_handler: sending policy for evaluation: policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5
2020-07-09T10:02:28.846Z [INFO]  agent.worker: received policy for evaluation: policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 target=aws-asg
2020-07-09T10:02:28.846Z [INFO]  agent.worker.check_handler: received policy check for evaluation: check=mem_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg
2020-07-09T10:02:28.846Z [INFO]  agent.worker.check_handler: received policy check for evaluation: check=cpu_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg
2020-07-09T10:02:28.846Z [INFO]  agent.worker.check_handler: fetching current count: check=cpu_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg
2020-07-09T10:02:28.846Z [INFO]  agent.worker.check_handler: fetching current count: check=mem_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg
2020-07-09T10:02:29.039Z [INFO]  agent.worker.check_handler: querying source: check=cpu_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg query=scalar(sum(nomad_client_allocated_cpu{node_class="hashistack"}*100/(nomad_client_unallocated_cpu{node_class="hashistack"}+nomad_client_allocated_cpu{node_class="hashistack"}))/count(nomad_client_allocated_cpu{node_class="hashistack"}))
2020-07-09T10:02:29.040Z [INFO]  agent.worker.check_handler: calculating new count: check=cpu_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg count=1 metric=47
2020-07-09T10:02:29.040Z [TRACE] internal_plugin.target-value: calculated scaling strategy results: policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 current_count=1 new_count=1 metric_value=47 factor=0.6714285714285714 direction=down
2020-07-09T10:02:29.040Z [INFO]  agent.worker.check_handler: nothing to do: check=cpu_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg
2020-07-09T10:02:29.086Z [INFO]  agent.worker.check_handler: querying source: check=mem_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg query=scalar(sum(nomad_client_allocated_memory{node_class="hashistack"}*100/(nomad_client_unallocated_memory{node_class="hashistack"}+nomad_client_allocated_memory{node_class="hashistack"}))/count(nomad_client_allocated_memory{node_class="hashistack"}))
2020-07-09T10:02:29.088Z [INFO]  agent.worker.check_handler: calculating new count: check=mem_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg count=1 metric=95.84402257567983
2020-07-09T10:02:29.088Z [TRACE] internal_plugin.target-value: calculated scaling strategy results: policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 current_count=1 new_count=2 metric_value=95.84402257567983 factor=1.3692003225097118 direction=up
2020-07-09T10:02:29.088Z [TRACE] agent.worker: check mem_allocated_percentage selected: policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 target=aws-asg direction=up count=2
2020-07-09T10:02:29.088Z [INFO]  agent.worker.check_handler: scaling target: check=mem_allocated_percentage policy_id=516f98cb-bd86-1be7-5fb5-10a945567de5 source=prometheus strategy=target-value target=aws-asg from=1 to=2 reason="scaling up because factor is 1.369200" meta=map[]
@jrasell jrasell added the bug label Jul 14, 2020
@jrasell jrasell added this to the 0.1.1 milestone Jul 14, 2020
@jrasell jrasell modified the milestones: 0.1.1, 0.2.0 Sep 15, 2020
@lgfa29 lgfa29 added stage/needs-investigation theme/policy-eval Policy broker, workers and evaluation type/bug and removed bug labels Nov 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants