Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not redo healthchecks for endpoints with different priorities #35243

Closed
shulin-sq opened this issue Jul 17, 2024 · 9 comments
Closed

Do not redo healthchecks for endpoints with different priorities #35243

shulin-sq opened this issue Jul 17, 2024 · 9 comments
Labels
area/health_checking bug stale stalebot believes this issue/PR has not been touched recently

Comments

@shulin-sq
Copy link
Contributor

Title: Do not redo healthchecks for endpoints with different priorities

Description:

Describe the desired behavior, what scenario it enables and how it
would be used.

we use eds to deliver endpoints that represent different priorities. Sometimes the priorities would switch for an endpoint as we change which set of IPs should be prioritized in load balancing. During this action I've noticed that when an endpoint changes priority it becomes unhealthy.

# initially
myapp_cluster::10.180.179.183:443::health_flags::healthy
myapp_cluster::10.180.179.183:443::priority::0

# upon update of priorities
myapp_cluster::10.178.43.7:443::health_flags::/failed_active_hc
myapp_cluster::10.178.43.7:443::priority::0
myapp_cluster::10.180.179.183:443::health_flags::/failed_active_hc
myapp_cluster::10.180.179.183:443::priority::1

# eventually
myapp_cluster::10.178.43.7:443::health_flags::healthy
myapp_cluster::10.178.43.7:443::priority::0
myapp_cluster::10.180.179.183:443::health_flags::healthy
myapp_cluster::10.180.179.183:443::priority::1

Since both IPs in the cluster becomes unhealthy this creates a scenario where if I do not use panic routing, clients will see a no_healthy_upstream 503 error. This seems avoidable since in this example the ip 10.180.179.183 was already being healthchecked prior to the update.

Is it possible to carry over 10.180.179.183's healthcheck state even though its priority has changed?

related issues might be: #4280
envoy version: 1.31.3

@shulin-sq shulin-sq added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Jul 17, 2024
@KBaichoo KBaichoo added bug area/health_checking and removed enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Jul 18, 2024
@KBaichoo
Copy link
Contributor

This seems like a similar issue resolved by #27900 which added preserving HCing information when an endpoint bounced between locality.

cc @adisuissa

@adisuissa
Copy link
Contributor

To the best of my knowledge there is a test that validates this, so it should be working. It was added as part of #14483, so unless something changed it should work for your Envoy version.

I'm not sure if there's any other config knob that overrides the healthcheck status in this case.
Further investigation is probably needed, and I suggest starting with an integration test that reproduces the bug.

@shulin-sq
Copy link
Contributor Author

shulin-sq commented Jul 20, 2024

Thank you for the response and the links to the code. I don't see anything obvious that would cause this so I will work on a repro in a less complex environment.

@shulin-sq
Copy link
Contributor Author

shulin-sq commented Aug 6, 2024

@adisuissa @KBaichoo

circling back to this

I am unfamiliar with how to write unit tests or functional tests so as I'm working that out, I thought I would just use the envoy you can install via brew to see if I can build a simple repro case.

envoy config: https://gist.github.com/shulin-sq/60d723009689b8f31400f87fff890393
control plane: shulin-sq/java-control-plane@fa78081 running the test control plane implementation with some modifications to flip the priority between 0 and 1 io.envoyproxy.controlplane.server.TestMain
envoy version: envoy version: 816188b86a0a52095b116b107f576324082c7c02/1.30.1/Distribution/RELEASE/BoringSSL

I have a video of the repro but it's too large to upload but in the screenshot you can see that the healthcheck failed even though in the control plane we're only changing the priority and not the endpoint address.

Screenshot 2024-08-06 at 3 41 48 PM

I also went back -1yr of envoy builds that I have handy as old as v1.25.4 and was able to reproduce the bug in our staging environment. unfortunately I was unable to go back to when the fix was first introduced so I cannot say if it was ever behaving in the way that I expected.

@shulin-sq
Copy link
Contributor Author

digging more into this it seems that all_hosts here is empty

const bool existing_host_found = existing_host != all_hosts.end();
from my functional test that I shared above which means that the previous endpoint is never found.

I'm wondering if this is a bug or if I've set up the test incorrectly to make it seem like a new cluster is added instead of just changing the endpoint

[2024-08-09 22:12:30.241][82232][info][upstream] [source/common/upstream/cds_api_helper.cc:32] cds: add 1 cluster(s), remove 1 cluster(s)
[2024-08-09 22:12:30.242][82232][debug][upstream] [source/common/upstream/cds_api_helper.cc:54] cds: add/update cluster 'httpbin' skipped
[2024-08-09 22:12:30.242][82232][info][upstream] [source/common/upstream/cds_api_helper.cc:71] cds: added/updated 0 cluster(s), skipped 1 unmodified cluster(s)
[2024-08-09 22:12:30.243][82232][debug][upstream] [source/common/upstream/upstream_impl.cc:469] transport socket match, socket default selected for host with address 34.194.112.169:80
[2024-08-09 22:12:30.244][82232][debug][upstream] [source/extensions/clusters/eds/eds.cc:428] EDS hosts or locality weights changed for cluster: httpbin current hosts 0 priority 0
[2024-08-09 22:12:30.244][82327][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82329][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82328][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82338][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82339][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82337][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82232][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82330][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82340][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82331][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82334][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82232][debug][upstream] [source/extensions/clusters/eds/eds.cc:428] EDS hosts or locality weights changed for cluster: httpbin current hosts 1 priority 1
[2024-08-09 22:12:30.244][82327][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.244][82329][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.244][82328][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.244][82330][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.244][82331][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82337][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82334][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82232][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82338][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82340][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82339][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
{"health_checker_type":"HTTP","host":{"socket_address":{"protocol":"TCP","address":"34.194.112.169","port_value":80,"resolver_name":"","ipv4_compat":false}},"cluster_name":"httpbin","add_healthy_event":{"first_check":true},"timestamp":"2024-08-09T22:12:25.243Z","locality":{"region":"","zone":"","sub_zone":""}}
{"health_checker_type":"HTTP","host":{"socket_address":{"protocol":"TCP","address":"34.194.112.169","port_value":80,"resolver_name":"","ipv4_compat":false}},"cluster_name":"httpbin","add_healthy_event":{"first_check":true},"timestamp":"2024-08-09T22:12:28.487Z","locality":{"region":"","zone":"","sub_zone":""}}

@shulin-sq
Copy link
Contributor Author

Ok while debugging the code I was able to confirm that my repro is correct. The cluster churn is because of the xds_cluster that I added as an example which keeps trying to remove itself from the config

So far what I've seen is that it seems the issue is due to

void MainPrioritySetImpl::updateCrossPriorityHostMap(const HostVector& hosts_added,
const HostVector& hosts_removed) {
if (hosts_added.empty() && hosts_removed.empty()) {
// No new hosts have been added and no old hosts have been removed.
return;
}
// Since read_only_all_host_map_ may be shared by multiple threads, when the host set changes,
// we cannot directly modify read_only_all_host_map_.
if (mutable_cross_priority_host_map_ == nullptr) {
// Copy old read only host map to mutable host map.
mutable_cross_priority_host_map_ = std::make_shared<HostMap>(*const_cross_priority_host_map_);
}
for (const auto& host : hosts_removed) {
mutable_cross_priority_host_map_->erase(addressToString(host->address()));
}
for (const auto& host : hosts_added) {
mutable_cross_priority_host_map_->insert({addressToString(host->address()), host});
}
}

Screenshot 2024-08-12 at 9 01 11 PM

because all_hosts is populated by the crossPriorityHostMap

for (const HostSharedPtr& host : new_hosts) {
// To match a new host with an existing host means comparing their addresses.
auto existing_host = all_hosts.find(addressToString(host->address()));
const bool existing_host_found = existing_host != all_hosts.end();

crossPriorityHostMap does not have priority metadata in its internal storage. So say we are flipping the priority of 34.194.112.169, it is constantly being added and removed from crossPriorityHostMap

Before I attempt to introduce a PR I wonder if anyone more familiar with this code would recommend how we introduce a fix. I think in languages I'm more familiar with what I would do is keep crossPriorityHostMap as a mapping of priorities to host addresses, and when fetching it, return a set of addresses.

@shulin-sq
Copy link
Contributor Author

Alright I think I have an idea on how to fix this and will submit a PR soon.

wbpcode pushed a commit that referenced this issue Sep 7, 2024
<!--
!!!ATTENTION!!!

If you are fixing *any* crash or *any* potential security issue, *do
not*
open a pull request in this repo. Please report the issue via emailing
envoy-security@googlegroups.com where the issue will be triaged
appropriately.
Thank you in advance for helping to keep Envoy secure.

!!!ATTENTION!!!

For an explanation of how to fill out the fields, please see the
relevant section
in
[PULL_REQUESTS.md](https://github.com/envoyproxy/envoy/blob/main/PULL_REQUESTS.md)
-->

Commit Message: Use previously calculated healthcheck when endpoints
move priority levels repeatedly. This is done by changing how
mutable_cross_priority_host_map_ tracks hosts.
mutable_cross_priority_host_map_ is used to track if a host already
exists in the previous configuration, and its healthcheck should not be
recalculated. This worked only some of the time because it would remove
and then add all hosts that have changed in a priority, to a map of
string (ip addr:port) to Host instance. However this did not account for
when an endpoint had two Host representatives in different priorities,
as is in the case when an endpoint changes priorities and there is an
edge case where a host can be removed from
mutable_cross_priority_host_map_ before the "should we skip activate
healthchecks" logic triggers. This PR fixes that by only removing an
endpoint from mutable_cross_priority_host_map_ if that removal is
executed from the lowest priority. This fix does assume that memberships
in priorities are always calculated starting from the lowest number to
the highest.

Additional Description: 
Risk Level: Med?
Testing: see note at the bottom
Docs Changes: n/a
Release Notes: added a note to the changelog
Platform Specific Features: 
[Optional Runtime guard:]
[Optional Fixes #Issue] #35243
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]


### Explanation
This was a difficult bug to spot because it only happens when priority
changes multiple times

For example let's consider this situation: 
- we have an endpoint A that swaps between priority 0, 1, and 2
- we assume that priority is always processed starting from 0 and going
up (eg, 0, 1, 2, 3, 4... etc)
- mutable_cross_priority_host_map_ is the "list" in the situation that
includes all endpoints from all priorities

When priority number goes up things are ok

0 -> 1
processing priority 0: remove A
processing priority 1: add A

1 -> 2
priority 1: remove A
priority 2: add A

but things get weird when numbers go down

2 -> 1, things are still peaceful here
priority 1: add A (but this gets ignored since A is already in the list)
priority 2: remove A (!!!)

1 -> 0, at this point the list does not include A, so any logic that
checks if A exists in the cross priorty host map will fail and A will be
considered as a new endpoint.

without the fix:
https://gist.github.com/shulin-sq/adfb4268f5f199f054e908e3fd7afae8
with the fix:
https://gist.github.com/shulin-sq/7779a341e598d81cfaeca447b0f582d1

### Testing
* functional test described in the issue:
#35243
* unit test, confirmed that it fails before the fix is applied

Signed-off-by: Shulin Jia <shulin@squareup.com>
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 18, 2024
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/health_checking bug stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

3 participants