-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not redo healthchecks for endpoints with different priorities #35243
Comments
This seems like a similar issue resolved by #27900 which added preserving HCing information when an endpoint bounced between locality. cc @adisuissa |
To the best of my knowledge there is a test that validates this, so it should be working. It was added as part of #14483, so unless something changed it should work for your Envoy version. I'm not sure if there's any other config knob that overrides the healthcheck status in this case. |
Thank you for the response and the links to the code. I don't see anything obvious that would cause this so I will work on a repro in a less complex environment. |
circling back to this I am unfamiliar with how to write unit tests or functional tests so as I'm working that out, I thought I would just use the envoy you can install via brew to see if I can build a simple repro case. envoy config: https://gist.github.com/shulin-sq/60d723009689b8f31400f87fff890393 I have a video of the repro but it's too large to upload but in the screenshot you can see that the healthcheck failed even though in the control plane we're only changing the priority and not the endpoint address. I also went back -1yr of envoy builds that I have handy as old as v1.25.4 and was able to reproduce the bug in our staging environment. unfortunately I was unable to go back to when the fix was first introduced so I cannot say if it was ever behaving in the way that I expected. |
digging more into this it seems that all_hosts here is empty envoy/source/common/upstream/upstream_impl.cc Line 2272 in ee56715
I'm wondering if this is a bug or if I've set up the test incorrectly to make it seem like a new cluster is added instead of just changing the endpoint
|
Ok while debugging the code I was able to confirm that my repro is correct. The cluster churn is because of the xds_cluster that I added as an example which keeps trying to remove itself from the config So far what I've seen is that it seems the issue is due to envoy/source/common/upstream/upstream_impl.cc Lines 975 to 996 in 5b80c5d
because all_hosts is populated by the crossPriorityHostMap envoy/source/common/upstream/upstream_impl.cc Lines 2269 to 2272 in 5b80c5d
crossPriorityHostMap does not have priority metadata in its internal storage. So say we are flipping the priority of 34.194.112.169, it is constantly being added and removed from crossPriorityHostMap Before I attempt to introduce a PR I wonder if anyone more familiar with this code would recommend how we introduce a fix. I think in languages I'm more familiar with what I would do is keep crossPriorityHostMap as a mapping of priorities to host addresses, and when fetching it, return a set of addresses. |
Alright I think I have an idea on how to fix this and will submit a PR soon. |
<!-- !!!ATTENTION!!! If you are fixing *any* crash or *any* potential security issue, *do not* open a pull request in this repo. Please report the issue via emailing envoy-security@googlegroups.com where the issue will be triaged appropriately. Thank you in advance for helping to keep Envoy secure. !!!ATTENTION!!! For an explanation of how to fill out the fields, please see the relevant section in [PULL_REQUESTS.md](https://github.com/envoyproxy/envoy/blob/main/PULL_REQUESTS.md) --> Commit Message: Use previously calculated healthcheck when endpoints move priority levels repeatedly. This is done by changing how mutable_cross_priority_host_map_ tracks hosts. mutable_cross_priority_host_map_ is used to track if a host already exists in the previous configuration, and its healthcheck should not be recalculated. This worked only some of the time because it would remove and then add all hosts that have changed in a priority, to a map of string (ip addr:port) to Host instance. However this did not account for when an endpoint had two Host representatives in different priorities, as is in the case when an endpoint changes priorities and there is an edge case where a host can be removed from mutable_cross_priority_host_map_ before the "should we skip activate healthchecks" logic triggers. This PR fixes that by only removing an endpoint from mutable_cross_priority_host_map_ if that removal is executed from the lowest priority. This fix does assume that memberships in priorities are always calculated starting from the lowest number to the highest. Additional Description: Risk Level: Med? Testing: see note at the bottom Docs Changes: n/a Release Notes: added a note to the changelog Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] #35243 [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] ### Explanation This was a difficult bug to spot because it only happens when priority changes multiple times For example let's consider this situation: - we have an endpoint A that swaps between priority 0, 1, and 2 - we assume that priority is always processed starting from 0 and going up (eg, 0, 1, 2, 3, 4... etc) - mutable_cross_priority_host_map_ is the "list" in the situation that includes all endpoints from all priorities When priority number goes up things are ok 0 -> 1 processing priority 0: remove A processing priority 1: add A 1 -> 2 priority 1: remove A priority 2: add A but things get weird when numbers go down 2 -> 1, things are still peaceful here priority 1: add A (but this gets ignored since A is already in the list) priority 2: remove A (!!!) 1 -> 0, at this point the list does not include A, so any logic that checks if A exists in the cross priorty host map will fail and A will be considered as a new endpoint. without the fix: https://gist.github.com/shulin-sq/adfb4268f5f199f054e908e3fd7afae8 with the fix: https://gist.github.com/shulin-sq/7779a341e598d81cfaeca447b0f582d1 ### Testing * functional test described in the issue: #35243 * unit test, confirmed that it fails before the fix is applied Signed-off-by: Shulin Jia <shulin@squareup.com>
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions. |
Title: Do not redo healthchecks for endpoints with different priorities
Description:
we use eds to deliver endpoints that represent different priorities. Sometimes the priorities would switch for an endpoint as we change which set of IPs should be prioritized in load balancing. During this action I've noticed that when an endpoint changes priority it becomes unhealthy.
Since both IPs in the cluster becomes unhealthy this creates a scenario where if I do not use panic routing, clients will see a no_healthy_upstream 503 error. This seems avoidable since in this example the ip 10.180.179.183 was already being healthchecked prior to the update.
Is it possible to carry over 10.180.179.183's healthcheck state even though its priority has changed?
related issues might be: #4280
envoy version: 1.31.3
The text was updated successfully, but these errors were encountered: