Fix destination staleness issue when adding EndpointSlices #12427

alpeb · 2024-04-12T19:48:40Z

When updating the portPublisher's address set when a new EndpointSlice creation event is received, its addresses where getting overwritten with stale data whenever its IDs already existed in the current pp's address set.

This bug would be manifested for example when dealing with dual-stack services. We'd have two ES with different IPs but with the same TargetRef pointing to the same pod, so their IDs would be the same. When the creation event for the second ES is received, its addresses would get overridden with the ones from the first ES.

Note however that dual-stack services are not supported yet, but we have unit tests for them in server_test.go although this particular failure case wasn't covered. Upcoming dual-stack support will modify this logic a bit so there's no need to add that case. (addressed in #12428)

Nevertheless, there can also be pathological cases in single-stack where this can be a problem. For example when ES get recycled but the deletion event is not caught for some reason, when the addition event is received its Address data will be overwritten by the old stale entry.

Other Changes

Remove overriding newAddressSet.LocalTrafficPolicy as that is already taken care inside pp.endpointSliceToAddresses(slice).
When there are no Add events to send, return early without updating state nor updating metrics.

adleong · 2024-04-13T00:31:57Z

This scenario occurs when two different EndpointSlices contain the same pod. Currently, the address from the older EndpointSlice is used for that pod and this PR changes it so that the address from the newer EndpointSlice is used instead. But it's not clear to me why one would be more correct than the other. Is such a conflict a valid state? Missing delete events will put the index into a incorrect state regardless so I'm not sure it makes sense to prefer the newer address just in case we have missed a delete.

For dual-stack, it seems that we legitimately will have multiple addresses for a single pod. Do we need to refactor this code so that a pod can have multiple addresses instead of just preferring one over the other?

alpeb · 2024-04-15T15:33:22Z

Thanks for the review @adleong. I think I'm gonna put this one into draft and let's instead focus on #12428 first so that we have a clearer picture of the dual-stack handling before discussing the current issue.

When updating the portPublisher's address set when a new EndpointSlice creation event is received, its addresses where getting overwritten with stale data whenever its IDs already existed in the current pp's address set. This bug would be manifested for example when dealing with dual-stack services. We'd have two ES with different IPs but with the same TargetRef pointing to the same pod, so their IDs would be the same. When the creation event for the second ES is received, its addresses would get overridden with the ones from the first ES. Note however that dual-stack services are not supported yet, but we have unit tests for them in `server_test.go` although this particular failure case wasn't covered. Upcoming dual-stack support will modify this logic a bit so there's no need to add that case. Nevertheless, there can also be pathological cases in single-stack where this can be a problem. For example when ES get recycled but the deletion event is not caught for some reason, when the addition event is received its Address data will be overwritten by the old stale entry. ## Other Changes - Remove overriding `newAddressSet.LocalTrafficPolicy` as that is already taken care inside `pp.endpointSliceToAddresses(slice)`. - When there are no Add events to send, return early without updating state nor updating metrics. - Finally, the IPv6 address for the "name1-ipv6" ES test fixture was updated to match the associated pod.

alpeb · 2024-05-06T23:51:01Z

I have rebased with the latest main, which now includes the IP family in the Addresses index key so the first failure case is no longer relevant here.
I haven't been able to come up with a reproducible scenario where the overwriting I described would cause staleness. I still don't see though the reasoning for this overwriting...

…12427)" This reverts commit 4fccf3e. The early return was causing `pp.addresses = newAddressSet` to not be run when the list of addresses is empty; but setting that is still necessary so that labels are tracked correctly. This was caught by the tap (viz) integration test run in the release workflow.

…12427)" (#12589) This reverts commit 4fccf3e. The early return was causing `pp.addresses = newAddressSet` to not be run when the list of addresses is empty; but setting that is still necessary so that labels are tracked correctly. This was caught by the tap (viz) integration test run in the release workflow.

…inkerd#12427)" (linkerd#12589) This reverts commit 4fccf3e. The early return was causing `pp.addresses = newAddressSet` to not be run when the list of addresses is empty; but setting that is still necessary so that labels are tracked correctly. This was caught by the tap (viz) integration test run in the release workflow. (cherry picked from commit 9bd8c00)

* origin/policy-feat-grpcroute-status-support: chore(ci): merge fixes from origin/main build(deps): bump tj-actions/changed-files from 44.3.0 to 44.4.0 (linkerd#12588) build(deps): bump github.com/fatih/color from 1.16.0 to 1.17.0 (linkerd#12590) chore(ci): Remove conditional integration testing (linkerd#12591) build(deps-dev): bump sinon from 17.0.1 to 17.0.2 in /web/app (linkerd#12587) build(deps): bump github.com/prometheus/client_golang (linkerd#12586) build(deps): bump thiserror from 1.0.59 to 1.0.60 (linkerd#12585) Revert "Fix destination staleness issue when adding EndpointSlices (linkerd#12427)" (linkerd#12589) Add outbound index metrics to the policy controller (linkerd#12429) build(deps): bump tj-actions/changed-files from 44.3.0 to 44.4.0 (linkerd#12588) build(deps): bump github.com/fatih/color from 1.16.0 to 1.17.0 (linkerd#12590) chore(ci): Remove conditional integration testing (linkerd#12591) build(deps-dev): bump sinon from 17.0.1 to 17.0.2 in /web/app (linkerd#12587) build(deps): bump github.com/prometheus/client_golang (linkerd#12586) build(deps): bump thiserror from 1.0.59 to 1.0.60 (linkerd#12585) Revert "Fix destination staleness issue when adding EndpointSlices (linkerd#12427)" (linkerd#12589) Set backend_not_found route status when any backends are not found (linkerd#12565) Add outbound index metrics to the policy controller (linkerd#12429) Signed-off-by: Mark S <the@wondersmith.dev>

This is a second take on #12427, which avoided a theoretical/correctness issue around overwritting new ES addresses with stale data. We had to revert that in #12589 because the change introduced a bug, by returning early when the ES had no addresses and failed to properly initiallize `addesses` for the portPublisher. This just removes the early return.

alpeb requested a review from a team as a code owner April 12, 2024 19:48

alpeb requested review from adleong and mateiidavid April 12, 2024 20:12

alpeb marked this pull request as draft April 15, 2024 15:33

alpeb force-pushed the alpeb/avoid-overwrite-es-addr branch from f2a70f9 to dfe3dc6 Compare May 6, 2024 22:49

adleong approved these changes May 7, 2024

View reviewed changes

alpeb marked this pull request as ready for review May 8, 2024 14:11

alpeb merged commit 4fccf3e into main May 8, 2024
36 checks passed

alpeb deleted the alpeb/avoid-overwrite-es-addr branch May 8, 2024 14:12

alpeb mentioned this pull request May 20, 2024

Refactor ES addition logic in Destination #12625

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix destination staleness issue when adding EndpointSlices #12427

Fix destination staleness issue when adding EndpointSlices #12427

alpeb commented Apr 12, 2024 •

edited

Loading

adleong commented Apr 13, 2024

alpeb commented Apr 15, 2024

alpeb commented May 6, 2024

Fix destination staleness issue when adding EndpointSlices #12427

Fix destination staleness issue when adding EndpointSlices #12427

Conversation

alpeb commented Apr 12, 2024 • edited Loading

Other Changes

adleong commented Apr 13, 2024

alpeb commented Apr 15, 2024

alpeb commented May 6, 2024

alpeb commented Apr 12, 2024 •

edited

Loading