Improve LB rebuild behavior when health status changes occur #2874

htuch · 2018-03-22T15:12:30Z

Today, when health checks transition host status or new health status arrives in EDS (#2726), we do expensive (at least O(n^2)) rebuilds in various places of the host lists, healthy host lists, locality lists, subsets, WRR, etc.

This isn't great when we scale the number of endpoints per cluster, if we have short health check intervals or if we have short EDS intervals. This issue will track work on optimizing this behavior.

htuch · 2018-03-22T15:12:55Z

@alyssawilk @zuercher FYI, tracking issue.

Signed-off-by: Harvey Tuch <htuch@google.com>

brian-pane · 2018-07-05T17:17:40Z

We're seeing this update code using 80+% of the total CPU capacity on 32-core hosts with ~2000 upstream endpoints in a cluster. Are the O(n^2)-time operations happening just in one thread (and pushed out as thread-local copies to all the workers), or are they happening in all the workers? From the CPU metrics, I suspect it's the latter (all the workers).

brian-pane · 2018-07-05T17:20:26Z

For what it's worth, we're also using the subset load balancer with three keys, each with two possible values. Based on CPU profiles, Envoy::Config::Metadata::metadataValue is a large contributor to the total run time of the rebuild.

mattklein123 · 2018-07-05T17:22:10Z

@brian-pane please discuss in #3790. Right now it's happening on every thread. There are various ways we can improve this situation, but I don't think reverting this change is in the cards. If this is causing major issues as I already said in the other issue, we can add an option to disable weighting support entirely.

brian-pane · 2018-07-05T17:44:38Z

@mattklein123 you lost me on the part about "reverting this change." Unless I'm missing something, #2874 isn't proposing to revert anything, but rather to improve the algorithmic efficiency of the current implementation.

mattklein123 · 2018-07-05T17:45:45Z

Sorry, I was just saying that I don't want to revert this change, and would like to figure out solutions to roll forward.

brian-pane · 2018-07-05T17:52:01Z

Got it, thanks. I'm on the same page: fixing forward rather than rolling back.

Makes BaseDynamicClusterImpl::updateDynamicHostList O(n) rather than O(n^2) Instead of calling .erase() on list iterators as we find them, we swap with the end of the list and erase after iterating over the list. This shows a ~3x improvement in execution time in the included benchmark test. Risk Level: Medium. No reordering happens to the endpoint list. Not runtime guarded. Testing: New benchmark, existing unit tests pass (and cover the affected function). Docs Changes: N/A Release Notes: N/A Relates to #2874 #11362 Signed-off-by: Phil Genera <pgenera@google.com>

Makes BaseDynamicClusterImpl::updateDynamicHostList O(n) rather than O(n^2) Instead of calling .erase() on list iterators as we find them, we swap with the end of the list and erase after iterating over the list. This shows a ~3x improvement in execution time in the included benchmark test. Risk Level: Medium. No reordering happens to the endpoint list. Not runtime guarded. Testing: New benchmark, existing unit tests pass (and cover the affected function). Docs Changes: N/A Release Notes: N/A Relates to envoyproxy#2874 envoyproxy#11362 Signed-off-by: Phil Genera <pgenera@google.com> Signed-off-by: scheler <santosh.cheler@appdynamics.com>

github-actions · 2024-11-15T16:01:37Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions · 2024-11-22T20:01:27Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

htuch added enhancement Feature requests. Not bugs or questions. area/perf help wanted Needs help! labels Mar 22, 2018

htuch added a commit to htuch/envoy that referenced this issue Mar 22, 2018

Link to envoyproxy#2874.

08d7314

Signed-off-by: Harvey Tuch <htuch@google.com>

htuch mentioned this issue May 3, 2018

Make WRR LB thread aware #3278

Open

rgs1 mentioned this issue Jul 4, 2018

perf regression #3790

Closed

antoniovicente mentioned this issue May 14, 2019

load balancer: Add benchmark for round robin LB #6939

Closed

pgenera mentioned this issue Jun 4, 2020

eds: decrease computational complexity of updates #11442

Merged

mattklein123 self-assigned this Dec 12, 2020

mattklein123 removed their assignment Jul 19, 2022

alyssawilk removed the help wanted Needs help! label Oct 16, 2024

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 15, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve LB rebuild behavior when health status changes occur #2874

Improve LB rebuild behavior when health status changes occur #2874

htuch commented Mar 22, 2018

htuch commented Mar 22, 2018

brian-pane commented Jul 5, 2018

brian-pane commented Jul 5, 2018

mattklein123 commented Jul 5, 2018

brian-pane commented Jul 5, 2018

mattklein123 commented Jul 5, 2018

brian-pane commented Jul 5, 2018

github-actions bot commented Nov 15, 2024

github-actions bot commented Nov 22, 2024

Improve LB rebuild behavior when health status changes occur #2874

Improve LB rebuild behavior when health status changes occur #2874

Comments

htuch commented Mar 22, 2018

htuch commented Mar 22, 2018

brian-pane commented Jul 5, 2018

brian-pane commented Jul 5, 2018

mattklein123 commented Jul 5, 2018

brian-pane commented Jul 5, 2018

mattklein123 commented Jul 5, 2018

brian-pane commented Jul 5, 2018

github-actions bot commented Nov 15, 2024

github-actions bot commented Nov 22, 2024