Do not redo healthchecks for endpoints with different priorities #35243

shulin-sq · 2024-07-17T23:33:51Z

Title: Do not redo healthchecks for endpoints with different priorities

Description:

Describe the desired behavior, what scenario it enables and how it
would be used.

we use eds to deliver endpoints that represent different priorities. Sometimes the priorities would switch for an endpoint as we change which set of IPs should be prioritized in load balancing. During this action I've noticed that when an endpoint changes priority it becomes unhealthy.

# initially
myapp_cluster::10.180.179.183:443::health_flags::healthy
myapp_cluster::10.180.179.183:443::priority::0

# upon update of priorities
myapp_cluster::10.178.43.7:443::health_flags::/failed_active_hc
myapp_cluster::10.178.43.7:443::priority::0
myapp_cluster::10.180.179.183:443::health_flags::/failed_active_hc
myapp_cluster::10.180.179.183:443::priority::1

# eventually
myapp_cluster::10.178.43.7:443::health_flags::healthy
myapp_cluster::10.178.43.7:443::priority::0
myapp_cluster::10.180.179.183:443::health_flags::healthy
myapp_cluster::10.180.179.183:443::priority::1

Since both IPs in the cluster becomes unhealthy this creates a scenario where if I do not use panic routing, clients will see a no_healthy_upstream 503 error. This seems avoidable since in this example the ip 10.180.179.183 was already being healthchecked prior to the update.

Is it possible to carry over 10.180.179.183's healthcheck state even though its priority has changed?

related issues might be: #4280
envoy version: 1.31.3

The text was updated successfully, but these errors were encountered:

KBaichoo · 2024-07-18T15:08:13Z

This seems like a similar issue resolved by #27900 which added preserving HCing information when an endpoint bounced between locality.

cc @adisuissa

adisuissa · 2024-07-19T13:03:23Z

To the best of my knowledge there is a test that validates this, so it should be working. It was added as part of #14483, so unless something changed it should work for your Envoy version.

I'm not sure if there's any other config knob that overrides the healthcheck status in this case.
Further investigation is probably needed, and I suggest starting with an integration test that reproduces the bug.

shulin-sq · 2024-07-20T02:47:39Z

Thank you for the response and the links to the code. I don't see anything obvious that would cause this so I will work on a repro in a less complex environment.

shulin-sq · 2024-08-06T22:44:11Z

@adisuissa @KBaichoo

circling back to this

I am unfamiliar with how to write unit tests or functional tests so as I'm working that out, I thought I would just use the envoy you can install via brew to see if I can build a simple repro case.

envoy config: https://gist.github.com/shulin-sq/60d723009689b8f31400f87fff890393
control plane: shulin-sq/java-control-plane@fa78081 running the test control plane implementation with some modifications to flip the priority between 0 and 1 io.envoyproxy.controlplane.server.TestMain
envoy version: envoy version: 816188b86a0a52095b116b107f576324082c7c02/1.30.1/Distribution/RELEASE/BoringSSL

I have a video of the repro but it's too large to upload but in the screenshot you can see that the healthcheck failed even though in the control plane we're only changing the priority and not the endpoint address.

I also went back -1yr of envoy builds that I have handy as old as v1.25.4 and was able to reproduce the bug in our staging environment. unfortunately I was unable to go back to when the fix was first introduced so I cannot say if it was ever behaving in the way that I expected.

shulin-sq · 2024-08-09T22:25:30Z

digging more into this it seems that all_hosts here is empty

envoy/source/common/upstream/upstream_impl.cc

Line 2272 in ee56715

const bool existing_host_found = existing_host != all_hosts.end();

from my functional test that I shared above which means that the previous endpoint is never found.

I'm wondering if this is a bug or if I've set up the test incorrectly to make it seem like a new cluster is added instead of just changing the endpoint

[2024-08-09 22:12:30.241][82232][info][upstream] [source/common/upstream/cds_api_helper.cc:32] cds: add 1 cluster(s), remove 1 cluster(s)
[2024-08-09 22:12:30.242][82232][debug][upstream] [source/common/upstream/cds_api_helper.cc:54] cds: add/update cluster 'httpbin' skipped
[2024-08-09 22:12:30.242][82232][info][upstream] [source/common/upstream/cds_api_helper.cc:71] cds: added/updated 0 cluster(s), skipped 1 unmodified cluster(s)
[2024-08-09 22:12:30.243][82232][debug][upstream] [source/common/upstream/upstream_impl.cc:469] transport socket match, socket default selected for host with address 34.194.112.169:80
[2024-08-09 22:12:30.244][82232][debug][upstream] [source/extensions/clusters/eds/eds.cc:428] EDS hosts or locality weights changed for cluster: httpbin current hosts 0 priority 0
[2024-08-09 22:12:30.244][82327][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82329][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82328][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82338][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82339][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82337][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82232][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82330][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82340][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82331][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82334][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82232][debug][upstream] [source/extensions/clusters/eds/eds.cc:428] EDS hosts or locality weights changed for cluster: httpbin current hosts 1 priority 1
[2024-08-09 22:12:30.244][82327][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.244][82329][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.244][82328][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.244][82330][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.244][82331][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82337][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82334][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82232][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82338][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82340][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82339][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
{"health_checker_type":"HTTP","host":{"socket_address":{"protocol":"TCP","address":"34.194.112.169","port_value":80,"resolver_name":"","ipv4_compat":false}},"cluster_name":"httpbin","add_healthy_event":{"first_check":true},"timestamp":"2024-08-09T22:12:25.243Z","locality":{"region":"","zone":"","sub_zone":""}}
{"health_checker_type":"HTTP","host":{"socket_address":{"protocol":"TCP","address":"34.194.112.169","port_value":80,"resolver_name":"","ipv4_compat":false}},"cluster_name":"httpbin","add_healthy_event":{"first_check":true},"timestamp":"2024-08-09T22:12:28.487Z","locality":{"region":"","zone":"","sub_zone":""}}

shulin-sq · 2024-08-13T04:16:00Z

Ok while debugging the code I was able to confirm that my repro is correct. The cluster churn is because of the xds_cluster that I added as an example which keeps trying to remove itself from the config

So far what I've seen is that it seems the issue is due to

envoy/source/common/upstream/upstream_impl.cc

Lines 975 to 996 in 5b80c5d

    
           void MainPrioritySetImpl::updateCrossPriorityHostMap(const HostVector& hosts_added, 
        
                                                                const HostVector& hosts_removed) { 
        
             if (hosts_added.empty() && hosts_removed.empty()) { 
        
               // No new hosts have been added and no old hosts have been removed. 
        
               return; 
        
             } 
        
             // Since read_only_all_host_map_ may be shared by multiple threads, when the host set changes, 
        
             // we cannot directly modify read_only_all_host_map_. 
        
             if (mutable_cross_priority_host_map_ == nullptr) { 
        
               // Copy old read only host map to mutable host map. 
        
               mutable_cross_priority_host_map_ = std::make_shared<HostMap>(*const_cross_priority_host_map_); 
        
             } 
        
             for (const auto& host : hosts_removed) { 
        
               mutable_cross_priority_host_map_->erase(addressToString(host->address())); 
        
             } 
        
             for (const auto& host : hosts_added) { 
        
               mutable_cross_priority_host_map_->insert({addressToString(host->address()), host}); 
        
             } 
        
           }

because all_hosts is populated by the crossPriorityHostMap

envoy/source/common/upstream/upstream_impl.cc

Lines 2269 to 2272 in 5b80c5d

    
           for (const HostSharedPtr& host : new_hosts) { 
        
             // To match a new host with an existing host means comparing their addresses. 
        
             auto existing_host = all_hosts.find(addressToString(host->address())); 
        
             const bool existing_host_found = existing_host != all_hosts.end();

crossPriorityHostMap does not have priority metadata in its internal storage. So say we are flipping the priority of 34.194.112.169, it is constantly being added and removed from crossPriorityHostMap

Before I attempt to introduce a PR I wonder if anyone more familiar with this code would recommend how we introduce a fix. I think in languages I'm more familiar with what I would do is keep crossPriorityHostMap as a mapping of priorities to host addresses, and when fetching it, return a set of addresses.

shulin-sq · 2024-08-19T16:21:21Z

Alright I think I have an idea on how to fix this and will submit a PR soon.

Commit Message: Use previously calculated healthcheck when endpoints move priority levels repeatedly. This is done by changing how mutable_cross_priority_host_map_ tracks hosts. mutable_cross_priority_host_map_ is used to track if a host already exists in the previous configuration, and its healthcheck should not be recalculated. This worked only some of the time because it would remove and then add all hosts that have changed in a priority, to a map of string (ip addr:port) to Host instance. However this did not account for when an endpoint had two Host representatives in different priorities, as is in the case when an endpoint changes priorities and there is an edge case where a host can be removed from mutable_cross_priority_host_map_ before the "should we skip activate healthchecks" logic triggers. This PR fixes that by only removing an endpoint from mutable_cross_priority_host_map_ if that removal is executed from the lowest priority. This fix does assume that memberships in priorities are always calculated starting from the lowest number to the highest. Additional Description: Risk Level: Med? Testing: see note at the bottom Docs Changes: n/a Release Notes: added a note to the changelog Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] #35243 [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] ### Explanation This was a difficult bug to spot because it only happens when priority changes multiple times For example let's consider this situation: - we have an endpoint A that swaps between priority 0, 1, and 2 - we assume that priority is always processed starting from 0 and going up (eg, 0, 1, 2, 3, 4... etc) - mutable_cross_priority_host_map_ is the "list" in the situation that includes all endpoints from all priorities When priority number goes up things are ok 0 -> 1 processing priority 0: remove A processing priority 1: add A 1 -> 2 priority 1: remove A priority 2: add A but things get weird when numbers go down 2 -> 1, things are still peaceful here priority 1: add A (but this gets ignored since A is already in the list) priority 2: remove A (!!!) 1 -> 0, at this point the list does not include A, so any logic that checks if A exists in the cross priorty host map will fail and A will be considered as a new endpoint. without the fix: https://gist.github.com/shulin-sq/adfb4268f5f199f054e908e3fd7afae8 with the fix: https://gist.github.com/shulin-sq/7779a341e598d81cfaeca447b0f582d1 ### Testing * functional test described in the issue: #35243 * unit test, confirmed that it fails before the fix is applied Signed-off-by: Shulin Jia <shulin@squareup.com>

github-actions · 2024-09-18T20:01:26Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions · 2024-09-25T20:01:38Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

shulin-sq added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Jul 17, 2024

KBaichoo added bug area/health_checking and removed enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Jul 18, 2024

shulin-sq mentioned this issue Aug 20, 2024

Fix healthchecks that do not persist when priority is changed #35748

Merged

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 18, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not redo healthchecks for endpoints with different priorities #35243

Do not redo healthchecks for endpoints with different priorities #35243

shulin-sq commented Jul 17, 2024

KBaichoo commented Jul 18, 2024

adisuissa commented Jul 19, 2024

shulin-sq commented Jul 20, 2024 •

edited

Loading

shulin-sq commented Aug 6, 2024 •

edited

Loading

shulin-sq commented Aug 9, 2024

shulin-sq commented Aug 13, 2024

shulin-sq commented Aug 19, 2024

github-actions bot commented Sep 18, 2024

github-actions bot commented Sep 25, 2024

Do not redo healthchecks for endpoints with different priorities #35243

Do not redo healthchecks for endpoints with different priorities #35243

Comments

shulin-sq commented Jul 17, 2024

KBaichoo commented Jul 18, 2024

adisuissa commented Jul 19, 2024

shulin-sq commented Jul 20, 2024 • edited Loading

shulin-sq commented Aug 6, 2024 • edited Loading

shulin-sq commented Aug 9, 2024

shulin-sq commented Aug 13, 2024

shulin-sq commented Aug 19, 2024

github-actions bot commented Sep 18, 2024

github-actions bot commented Sep 25, 2024

shulin-sq commented Jul 20, 2024 •

edited

Loading

shulin-sq commented Aug 6, 2024 •

edited

Loading