Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve LB rebuild behavior when health status changes occur #2874

Closed
htuch opened this issue Mar 22, 2018 · 9 comments
Closed

Improve LB rebuild behavior when health status changes occur #2874

htuch opened this issue Mar 22, 2018 · 9 comments
Labels
area/perf enhancement Feature requests. Not bugs or questions. stale stalebot believes this issue/PR has not been touched recently

Comments

@htuch
Copy link
Member

htuch commented Mar 22, 2018

Today, when health checks transition host status or new health status arrives in EDS (#2726), we do expensive (at least O(n^2)) rebuilds in various places of the host lists, healthy host lists, locality lists, subsets, WRR, etc.

This isn't great when we scale the number of endpoints per cluster, if we have short health check intervals or if we have short EDS intervals. This issue will track work on optimizing this behavior.

@htuch htuch added enhancement Feature requests. Not bugs or questions. area/perf help wanted Needs help! labels Mar 22, 2018
@htuch
Copy link
Member Author

htuch commented Mar 22, 2018

@alyssawilk @zuercher FYI, tracking issue.

htuch added a commit to htuch/envoy that referenced this issue Mar 22, 2018
Signed-off-by: Harvey Tuch <htuch@google.com>
@rgs1 rgs1 mentioned this issue Jul 4, 2018
@brian-pane
Copy link
Contributor

We're seeing this update code using 80+% of the total CPU capacity on 32-core hosts with ~2000 upstream endpoints in a cluster. Are the O(n^2)-time operations happening just in one thread (and pushed out as thread-local copies to all the workers), or are they happening in all the workers? From the CPU metrics, I suspect it's the latter (all the workers).

@brian-pane
Copy link
Contributor

For what it's worth, we're also using the subset load balancer with three keys, each with two possible values. Based on CPU profiles, Envoy::Config::Metadata::metadataValue is a large contributor to the total run time of the rebuild.

@mattklein123
Copy link
Member

@brian-pane please discuss in #3790. Right now it's happening on every thread. There are various ways we can improve this situation, but I don't think reverting this change is in the cards. If this is causing major issues as I already said in the other issue, we can add an option to disable weighting support entirely.

@brian-pane
Copy link
Contributor

@mattklein123 you lost me on the part about "reverting this change." Unless I'm missing something, #2874 isn't proposing to revert anything, but rather to improve the algorithmic efficiency of the current implementation.

@mattklein123
Copy link
Member

Sorry, I was just saying that I don't want to revert this change, and would like to figure out solutions to roll forward.

@brian-pane
Copy link
Contributor

Got it, thanks. I'm on the same page: fixing forward rather than rolling back.

htuch pushed a commit that referenced this issue Jul 9, 2020
Makes BaseDynamicClusterImpl::updateDynamicHostList O(n) rather than O(n^2)

Instead of calling .erase() on list iterators as we find them, we swap with the end of the list and erase after iterating over the list. This shows a ~3x improvement in execution time in the included benchmark test.

Risk Level: Medium. No reordering happens to the endpoint list. Not runtime guarded.
Testing: New benchmark, existing unit tests pass (and cover the affected function).
Docs Changes: N/A
Release Notes: N/A

Relates to #2874 #11362

Signed-off-by: Phil Genera <pgenera@google.com>
scheler pushed a commit to scheler/envoy that referenced this issue Aug 4, 2020
Makes BaseDynamicClusterImpl::updateDynamicHostList O(n) rather than O(n^2)

Instead of calling .erase() on list iterators as we find them, we swap with the end of the list and erase after iterating over the list. This shows a ~3x improvement in execution time in the included benchmark test.

Risk Level: Medium. No reordering happens to the endpoint list. Not runtime guarded.
Testing: New benchmark, existing unit tests pass (and cover the affected function).
Docs Changes: N/A
Release Notes: N/A

Relates to envoyproxy#2874 envoyproxy#11362

Signed-off-by: Phil Genera <pgenera@google.com>
Signed-off-by: scheler <santosh.cheler@appdynamics.com>
@mattklein123 mattklein123 self-assigned this Dec 12, 2020
@mattklein123 mattklein123 removed their assignment Jul 19, 2022
@alyssawilk alyssawilk removed the help wanted Needs help! label Oct 16, 2024
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 15, 2024
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/perf enhancement Feature requests. Not bugs or questions. stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

4 participants