-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve LB rebuild behavior when health status changes occur #2874
Comments
@alyssawilk @zuercher FYI, tracking issue. |
Signed-off-by: Harvey Tuch <htuch@google.com>
We're seeing this update code using 80+% of the total CPU capacity on 32-core hosts with ~2000 upstream endpoints in a cluster. Are the O(n^2)-time operations happening just in one thread (and pushed out as thread-local copies to all the workers), or are they happening in all the workers? From the CPU metrics, I suspect it's the latter (all the workers). |
For what it's worth, we're also using the subset load balancer with three keys, each with two possible values. Based on CPU profiles, |
@brian-pane please discuss in #3790. Right now it's happening on every thread. There are various ways we can improve this situation, but I don't think reverting this change is in the cards. If this is causing major issues as I already said in the other issue, we can add an option to disable weighting support entirely. |
@mattklein123 you lost me on the part about "reverting this change." Unless I'm missing something, #2874 isn't proposing to revert anything, but rather to improve the algorithmic efficiency of the current implementation. |
Sorry, I was just saying that I don't want to revert this change, and would like to figure out solutions to roll forward. |
Got it, thanks. I'm on the same page: fixing forward rather than rolling back. |
Makes BaseDynamicClusterImpl::updateDynamicHostList O(n) rather than O(n^2) Instead of calling .erase() on list iterators as we find them, we swap with the end of the list and erase after iterating over the list. This shows a ~3x improvement in execution time in the included benchmark test. Risk Level: Medium. No reordering happens to the endpoint list. Not runtime guarded. Testing: New benchmark, existing unit tests pass (and cover the affected function). Docs Changes: N/A Release Notes: N/A Relates to #2874 #11362 Signed-off-by: Phil Genera <pgenera@google.com>
Makes BaseDynamicClusterImpl::updateDynamicHostList O(n) rather than O(n^2) Instead of calling .erase() on list iterators as we find them, we swap with the end of the list and erase after iterating over the list. This shows a ~3x improvement in execution time in the included benchmark test. Risk Level: Medium. No reordering happens to the endpoint list. Not runtime guarded. Testing: New benchmark, existing unit tests pass (and cover the affected function). Docs Changes: N/A Release Notes: N/A Relates to envoyproxy#2874 envoyproxy#11362 Signed-off-by: Phil Genera <pgenera@google.com> Signed-off-by: scheler <santosh.cheler@appdynamics.com>
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions. |
Today, when health checks transition host status or new health status arrives in EDS (#2726), we do expensive (at least O(n^2)) rebuilds in various places of the host lists, healthy host lists, locality lists, subsets, WRR, etc.
This isn't great when we scale the number of endpoints per cluster, if we have short health check intervals or if we have short EDS intervals. This issue will track work on optimizing this behavior.
The text was updated successfully, but these errors were encountered: