Fix healthchecks that do not persist when priority is changed #35748
Envoy/Publish and verify (success)
Check has finished
Details
Check run finished (success ✔️)
The check run can be viewed here:
Envoy/Publish and verify (pr/35748/main@51d6ff0)
Check started by
Request (pr/35748/main@51d6ff0)
@shulin-sq 51d6ff0
#35748 merge
main@c198038
Fix healthchecks that do not persist when priority is changed
Commit Message: Use previously calculated healthcheck when endpoints move priority levels repeatedly. This is done by changing how mutable_cross_priority_host_map_ tracks hosts. mutable_cross_priority_host_map_ is used to track if a host already exists in the previous configuration, and its healthcheck should not be recalculated. This worked only some of the time because it would remove and then add all hosts that have changed in a priority, to a map of string (ip addr:port) to Host instance. However this did not account for when an endpoint had two Host representatives in different priorities, as is in the case when an endpoint changes priorities and there is an edge case where a host can be removed from mutable_cross_priority_host_map_ before the "should we skip activate healthchecks" logic triggers. This PR fixes that by only removing an endpoint from mutable_cross_priority_host_map_ if that removal is executed from the lowest priority. This fix does assume that memberships in priorities are always calculated starting from the lowest number to the highest.
Additional Description:
Risk Level: Med?
Testing: see note at the bottom
Docs Changes: n/a
Release Notes: added a note to the changelog
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue] #35243
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]Explanation
This was a difficult bug to spot because it only happens when priority changes multiple times
For example let's consider this situation:
- we have an endpoint A that swaps between priority 0, 1, and 2
- we assume that priority is always processed starting from 0 and going up (eg, 0, 1, 2, 3, 4... etc)
- mutable_cross_priority_host_map_ is the "list" in the situation that includes all endpoints from all priorities
When priority number goes up things are ok
0 -> 1
processing priority 0: remove A
processing priority 1: add A1 -> 2
priority 1: remove A
priority 2: add Abut things get weird when numbers go down
2 -> 1, things are still peaceful here
priority 1: add A (but this gets ignored since A is already in the list)
priority 2: remove A (!!!)1 -> 0, at this point the list does not include A, so any logic that checks if A exists in the cross priorty host map will fail and A will be considered as a new endpoint.
without the fix: https://gist.github.com/shulin-sq/adfb4268f5f199f054e908e3fd7afae8
with the fix:
https://gist.github.com/shulin-sq/7779a341e598d81cfaeca447b0f582d1Testing
- functional test described in the issue: #35243
- unit test, confirmed that it fails before the fix is applied
Environment
Request variables
Key | Value |
---|---|
ref | fb342f0 |
sha | 51d6ff0 |
pr | 35748 |
base-sha | c198038 |
actor | @shulin-sq |
message | Fix healthchecks that do not persist when priority is changed... |
started | 1725469727.805146 |
target-branch | main |
trusted | false |
Build image
Container image/s (as used in this CI run)
Key | Value |
---|---|
default | envoyproxy/envoy-build-ubuntu:f94a38f62220a2b017878b790b6ea98a0f6c5f9c |
mobile | envoyproxy/envoy-build-ubuntu:mobile-f94a38f62220a2b017878b790b6ea98a0f6c5f9c |
Version
Envoy version (as used in this CI run)
Key | Value |
---|---|
major | 1 |
minor | 32 |
patch | 0 |
dev | true |