-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix healthchecks that do not persist when priority is changed #35748
Conversation
Hi @shulin-sq, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
Thanks for tackling this! |
430b3f6
to
993b5e1
Compare
@adisuissa thanks for pointing it out! I made some changes
|
993b5e1
to
e2659f1
Compare
@@ -1403,6 +1403,133 @@ TEST_F(EdsTest, EndpointMovedToNewPriority) { | |||
EXPECT_FALSE(hosts[1]->healthFlagGet(Host::HealthFlag::FAILED_ACTIVE_HC)); | |||
EXPECT_FALSE(hosts[1]->healthFlagGet(Host::HealthFlag::PENDING_DYNAMIC_REMOVAL)); | |||
} | |||
|
|||
// 1 -> 2 | |||
// Moves all the endpoints to priority 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
priority 2 right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you're right. I'll just delete this line since I don't mention it below.
@@ -987,7 +988,18 @@ void MainPrioritySetImpl::updateCrossPriorityHostMap(const HostVector& hosts_add | |||
} | |||
|
|||
for (const auto& host : hosts_removed) { | |||
mutable_cross_priority_host_map_->erase(addressToString(host->address())); | |||
auto existing_host = mutable_cross_priority_host_map_->find(addressToString(host->address())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adisuissa is there a preference to use find() vs [] in the project? https://github.com/abseil/abseil-cpp/blob/5a01d0f77e37493570e35aedaa95c4bcf1673c7c/absl/container/internal/raw_hash_set.h#L3361 find might be O(n) as opposed to O(1) but I saw a few other places in the codebase using find with maps as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refer me to where you saw that the time complexity of find is O(n) and of the operator is O(1)?
I think both should be O(1) average complexity, and O(n) worst-case (due to hash conflicts), but I may be mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/abseil/abseil-cpp/blob/master/absl/container/internal/raw_hash_set.h#L3545 I don't know this code well but also couldn't find any docs that explain this definitively. I will defer to your recommendation. I would be fine with keeping it since other places use find() to access this map this is an established pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tracking and fixing this issue.
Overall LGTM, modulo minor comments.
changelogs/current.yaml
Outdated
@@ -2,6 +2,9 @@ date: Pending | |||
|
|||
behavior_changes: | |||
# *Changes that are expected to cause an incompatibility if applicable; deployment changes are likely required* | |||
- area: upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably move down to the "bug fixes" section.
@@ -987,7 +988,18 @@ void MainPrioritySetImpl::updateCrossPriorityHostMap(const HostVector& hosts_add | |||
} | |||
|
|||
for (const auto& host : hosts_removed) { | |||
mutable_cross_priority_host_map_->erase(addressToString(host->address())); | |||
auto existing_host = mutable_cross_priority_host_map_->find(addressToString(host->address())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: create a (const) temp var holding the addressToString(host->address())
result, and pass it here and below.
@@ -987,7 +988,18 @@ void MainPrioritySetImpl::updateCrossPriorityHostMap(const HostVector& hosts_add | |||
} | |||
|
|||
for (const auto& host : hosts_removed) { | |||
mutable_cross_priority_host_map_->erase(addressToString(host->address())); | |||
auto existing_host = mutable_cross_priority_host_map_->find(addressToString(host->address())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refer me to where you saw that the time complexity of find is O(n) and of the operator is O(1)?
I think both should be O(1) average complexity, and O(n) worst-case (due to hash conflicts), but I may be mistaken.
auto existing_host = mutable_cross_priority_host_map_->find(addressToString(host->address())); | ||
|
||
if (existing_host != mutable_cross_priority_host_map_->end()) { | ||
// an earlier priority added this host, which indicates that the host moved from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style-nit: comments start with capital letter.
if (existing_host->second->priority() < priority) { | ||
continue; | ||
} else { | ||
mutable_cross_priority_host_map_->erase(addressToString(host->address())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please convert to if (existing_host->second->priority() >= priority) ...
and remove the continue
part.
@@ -1403,6 +1403,133 @@ TEST_F(EdsTest, EndpointMovedToNewPriority) { | |||
EXPECT_FALSE(hosts[1]->healthFlagGet(Host::HealthFlag::FAILED_ACTIVE_HC)); | |||
EXPECT_FALSE(hosts[1]->healthFlagGet(Host::HealthFlag::PENDING_DYNAMIC_REMOVAL)); | |||
} | |||
|
|||
// 1 -> 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this bug should have its specific test (and please add a comment for the regression issue at the top of the test).
@@ -987,7 +988,18 @@ void MainPrioritySetImpl::updateCrossPriorityHostMap(const HostVector& hosts_add | |||
} | |||
|
|||
for (const auto& host : hosts_removed) { | |||
mutable_cross_priority_host_map_->erase(addressToString(host->address())); | |||
auto existing_host = mutable_cross_priority_host_map_->find(addressToString(host->address())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const
/wait |
a871846
to
94cfeab
Compare
/retest |
@adisuissa @wbpcode I have updated the PR to address the comments from the last round. |
@wbpcode gentle nudge to review |
I will get some free time to review this PR this weekend. Thanks. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall with only one comment. Thanks for your contribution.
/wait-any
// An earlier priority added this host, which indicates that the host moved from | ||
// a higher value priority to a lower value priority. We should not remove it from the | ||
// cross priority map. | ||
if (existing_host->second->priority() >= priority) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will the >
happens? Or should we only do the erasing when the priority is equal?
PS: I think this is a design problem of corss_priority_map
which I can improve in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed, thanks for spotting. I think the thing I was thinking of is what I described in the comment
// An earlier priority added this host, which indicates that the host moved from
// a higher value priority to a lower value priority. We should not remove it from the
// cross priority map.
but when the operators flipped it's more obvious that we only need to handle deletions when the priorities are equal. I've also updated the comment accordingly.
Regarding the design of cross_priority_map, I did think that because it's keyed on just the endpoint address, this seems to not capture differences of metadata of the endpoint but given that I was only focused on the priority part I wasn't sure if there was any other bugs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it initially is designed for lb to search a address quickly which could tolerate minor inaccuracy. But now, I need to re-think it.
94cfeab
to
0ec9935
Compare
0ec9935
to
fd7e2ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks.
Could you merge the main to kick the CI? |
…vels repeatedly. This is done by changing how mutable_cross_priority_host_map_ tracks hosts. mutable_cross_priority_host_map_ is used to track if a host already exists in the previous configuration, and its healthcheck should not be recalculated. This worked only some of the time because it would remove and then add all hosts that have changed in a priority, to a map of string (ip addr:port) to Host instance. However this did not account for when an endpoint had two Host representatives in different priorities, as is in the case when an endpoint changes priorities and there is an edge case where a host can be removed from mutable_cross_priority_host_map_ before the "should we skip activate healthchecks" logic triggers. This PR fixes that by only removing an endpoint from mutable_cross_priority_host_map_ if that removal is executed from the lowest priority. This fix does assume that memberships in priorities are always calculated starting from the lowest number to the highest. Signed-off-by: Shulin Jia <shulin@squareup.com>
fd7e2ba
to
51d6ff0
Compare
@wbpcode done. Can you help with merging? |
@wbpcode just checking in, I did a rebase from master. did that not kick off the CI? |
Commit Message: Use previously calculated healthcheck when endpoints move priority levels repeatedly. This is done by changing how mutable_cross_priority_host_map_ tracks hosts. mutable_cross_priority_host_map_ is used to track if a host already exists in the previous configuration, and its healthcheck should not be recalculated. This worked only some of the time because it would remove and then add all hosts that have changed in a priority, to a map of string (ip addr:port) to Host instance. However this did not account for when an endpoint had two Host representatives in different priorities, as is in the case when an endpoint changes priorities and there is an edge case where a host can be removed from mutable_cross_priority_host_map_ before the "should we skip activate healthchecks" logic triggers. This PR fixes that by only removing an endpoint from mutable_cross_priority_host_map_ if that removal is executed from the lowest priority. This fix does assume that memberships in priorities are always calculated starting from the lowest number to the highest.
Additional Description:
Risk Level: Med?
Testing: see note at the bottom
Docs Changes: n/a
Release Notes: added a note to the changelog
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue] #35243
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]
Explanation
This was a difficult bug to spot because it only happens when priority changes multiple times
For example let's consider this situation:
When priority number goes up things are ok
0 -> 1
processing priority 0: remove A
processing priority 1: add A
1 -> 2
priority 1: remove A
priority 2: add A
but things get weird when numbers go down
2 -> 1, things are still peaceful here
priority 1: add A (but this gets ignored since A is already in the list)
priority 2: remove A (!!!)
1 -> 0, at this point the list does not include A, so any logic that checks if A exists in the cross priorty host map will fail and A will be considered as a new endpoint.
without the fix: https://gist.github.com/shulin-sq/adfb4268f5f199f054e908e3fd7afae8
with the fix:
https://gist.github.com/shulin-sq/7779a341e598d81cfaeca447b0f582d1
Testing