-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix destination staleness issue when adding EndpointSlices #12427
Conversation
This scenario occurs when two different EndpointSlices contain the same pod. Currently, the address from the older EndpointSlice is used for that pod and this PR changes it so that the address from the newer EndpointSlice is used instead. But it's not clear to me why one would be more correct than the other. Is such a conflict a valid state? Missing delete events will put the index into a incorrect state regardless so I'm not sure it makes sense to prefer the newer address just in case we have missed a delete. For dual-stack, it seems that we legitimately will have multiple addresses for a single pod. Do we need to refactor this code so that a pod can have multiple addresses instead of just preferring one over the other? |
When updating the portPublisher's address set when a new EndpointSlice creation event is received, its addresses where getting overwritten with stale data whenever its IDs already existed in the current pp's address set. This bug would be manifested for example when dealing with dual-stack services. We'd have two ES with different IPs but with the same TargetRef pointing to the same pod, so their IDs would be the same. When the creation event for the second ES is received, its addresses would get overridden with the ones from the first ES. Note however that dual-stack services are not supported yet, but we have unit tests for them in `server_test.go` although this particular failure case wasn't covered. Upcoming dual-stack support will modify this logic a bit so there's no need to add that case. Nevertheless, there can also be pathological cases in single-stack where this can be a problem. For example when ES get recycled but the deletion event is not caught for some reason, when the addition event is received its Address data will be overwritten by the old stale entry. ## Other Changes - Remove overriding `newAddressSet.LocalTrafficPolicy` as that is already taken care inside `pp.endpointSliceToAddresses(slice)`. - When there are no Add events to send, return early without updating state nor updating metrics. - Finally, the IPv6 address for the "name1-ipv6" ES test fixture was updated to match the associated pod.
f2a70f9
to
dfe3dc6
Compare
I have rebased with the latest main, which now includes the IP family in the Addresses index key so the first failure case is no longer relevant here. |
…12427)" This reverts commit 4fccf3e. The early return was causing `pp.addresses = newAddressSet` to not be run when the list of addresses is empty; but setting that is still necessary so that labels are tracked correctly. This was caught by the tap (viz) integration test run in the release workflow.
…12427)" (#12589) This reverts commit 4fccf3e. The early return was causing `pp.addresses = newAddressSet` to not be run when the list of addresses is empty; but setting that is still necessary so that labels are tracked correctly. This was caught by the tap (viz) integration test run in the release workflow.
…inkerd#12427)" (linkerd#12589) This reverts commit 4fccf3e. The early return was causing `pp.addresses = newAddressSet` to not be run when the list of addresses is empty; but setting that is still necessary so that labels are tracked correctly. This was caught by the tap (viz) integration test run in the release workflow. (cherry picked from commit 9bd8c00)
* origin/policy-feat-grpcroute-status-support: chore(ci): merge fixes from origin/main build(deps): bump tj-actions/changed-files from 44.3.0 to 44.4.0 (linkerd#12588) build(deps): bump github.com/fatih/color from 1.16.0 to 1.17.0 (linkerd#12590) chore(ci): Remove conditional integration testing (linkerd#12591) build(deps-dev): bump sinon from 17.0.1 to 17.0.2 in /web/app (linkerd#12587) build(deps): bump github.com/prometheus/client_golang (linkerd#12586) build(deps): bump thiserror from 1.0.59 to 1.0.60 (linkerd#12585) Revert "Fix destination staleness issue when adding EndpointSlices (linkerd#12427)" (linkerd#12589) Add outbound index metrics to the policy controller (linkerd#12429) build(deps): bump tj-actions/changed-files from 44.3.0 to 44.4.0 (linkerd#12588) build(deps): bump github.com/fatih/color from 1.16.0 to 1.17.0 (linkerd#12590) chore(ci): Remove conditional integration testing (linkerd#12591) build(deps-dev): bump sinon from 17.0.1 to 17.0.2 in /web/app (linkerd#12587) build(deps): bump github.com/prometheus/client_golang (linkerd#12586) build(deps): bump thiserror from 1.0.59 to 1.0.60 (linkerd#12585) Revert "Fix destination staleness issue when adding EndpointSlices (linkerd#12427)" (linkerd#12589) Set backend_not_found route status when any backends are not found (linkerd#12565) Add outbound index metrics to the policy controller (linkerd#12429) Signed-off-by: Mark S <the@wondersmith.dev>
This is a second take on #12427, which avoided a theoretical/correctness issue around overwritting new ES addresses with stale data. We had to revert that in #12589 because the change introduced a bug, by returning early when the ES had no addresses and failed to properly initiallize `addesses` for the portPublisher. This just removes the early return.
This is a second take on #12427, which avoided a theoretical/correctness issue around overwritting new ES addresses with stale data. We had to revert that in #12589 because the change introduced a bug, by returning early when the ES had no addresses and failed to properly initiallize `addesses` for the portPublisher. This just removes the early return.
This is a second take on #12427, which avoided a theoretical/correctness issue around overwritting new ES addresses with stale data. We had to revert that in #12589 because the change introduced a bug, by returning early when the ES had no addresses and failed to properly initiallize `addesses` for the portPublisher. This just removes the early return.
This is a second take on #12427, which avoided a theoretical/correctness issue around overwritting new ES addresses with stale data. We had to revert that in #12589 because the change introduced a bug, by returning early when the ES had no addresses and failed to properly initiallize `addesses` for the portPublisher. This just removes the early return.
When updating the portPublisher's address set when a new EndpointSlice creation event is received, its addresses where getting overwritten with stale data whenever its IDs already existed in the current pp's address set.
This bug would be manifested for example when dealing with dual-stack services. We'd have two ES with different IPs but with the same TargetRef pointing to the same pod, so their IDs would be the same. When the creation event for the second ES is received, its addresses would get overridden with the ones from the first ES.Note however that dual-stack services are not supported yet, but we have unit tests for them in(addressed in #12428)server_test.go
although this particular failure case wasn't covered. Upcoming dual-stack support will modify this logic a bit so there's no need to add that case.Nevertheless, there can also be pathological cases in single-stack where this can be a problem. For example when ES get recycled but the deletion event is not caught for some reason, when the addition event is received its Address data will be overwritten by the old stale entry.
Other Changes
newAddressSet.LocalTrafficPolicy
as that is already taken care insidepp.endpointSliceToAddresses(slice)
.