Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix destination staleness issue when adding EndpointSlices #12427

Merged
merged 1 commit into from
May 8, 2024

Conversation

alpeb
Copy link
Member

@alpeb alpeb commented Apr 12, 2024

When updating the portPublisher's address set when a new EndpointSlice creation event is received, its addresses where getting overwritten with stale data whenever its IDs already existed in the current pp's address set.

This bug would be manifested for example when dealing with dual-stack services. We'd have two ES with different IPs but with the same TargetRef pointing to the same pod, so their IDs would be the same. When the creation event for the second ES is received, its addresses would get overridden with the ones from the first ES.

Note however that dual-stack services are not supported yet, but we have unit tests for them in server_test.go although this particular failure case wasn't covered. Upcoming dual-stack support will modify this logic a bit so there's no need to add that case. (addressed in #12428)

Nevertheless, there can also be pathological cases in single-stack where this can be a problem. For example when ES get recycled but the deletion event is not caught for some reason, when the addition event is received its Address data will be overwritten by the old stale entry.

Other Changes

  • Remove overriding newAddressSet.LocalTrafficPolicy as that is already taken care inside pp.endpointSliceToAddresses(slice).
  • When there are no Add events to send, return early without updating state nor updating metrics.

@alpeb alpeb requested a review from a team as a code owner April 12, 2024 19:48
@alpeb alpeb requested review from adleong and mateiidavid April 12, 2024 20:12
@adleong
Copy link
Member

adleong commented Apr 13, 2024

This scenario occurs when two different EndpointSlices contain the same pod. Currently, the address from the older EndpointSlice is used for that pod and this PR changes it so that the address from the newer EndpointSlice is used instead. But it's not clear to me why one would be more correct than the other. Is such a conflict a valid state? Missing delete events will put the index into a incorrect state regardless so I'm not sure it makes sense to prefer the newer address just in case we have missed a delete.

For dual-stack, it seems that we legitimately will have multiple addresses for a single pod. Do we need to refactor this code so that a pod can have multiple addresses instead of just preferring one over the other?

@alpeb
Copy link
Member Author

alpeb commented Apr 15, 2024

Thanks for the review @adleong. I think I'm gonna put this one into draft and let's instead focus on #12428 first so that we have a clearer picture of the dual-stack handling before discussing the current issue.

@alpeb alpeb marked this pull request as draft April 15, 2024 15:33
When updating the portPublisher's address set when a new EndpointSlice
creation event is received, its addresses where getting overwritten with
stale data whenever its IDs already existed in the current pp's address
set.

This bug would be manifested for example when dealing with dual-stack
services. We'd have two ES with different IPs but with the same
TargetRef pointing to the same pod, so their IDs would be the same. When
the creation event for the second ES is received, its addresses would
get overridden with the ones from the first ES.

Note however that dual-stack services are not supported yet, but we have
unit tests for them in `server_test.go` although this particular failure
case wasn't covered. Upcoming dual-stack support will modify this logic
a bit so there's no need to add that case.

Nevertheless, there can also be pathological cases in single-stack where
this can be a problem. For example when ES get recycled but the deletion
event is not caught for some reason, when the addition event is received
its Address data will be overwritten by the old stale entry.

## Other Changes

- Remove overriding `newAddressSet.LocalTrafficPolicy` as that is
  already taken care inside `pp.endpointSliceToAddresses(slice)`.
- When there are no Add events to send, return early without updating
  state nor updating metrics.
- Finally, the IPv6 address for the "name1-ipv6" ES test fixture was
updated to match the associated pod.
@alpeb alpeb force-pushed the alpeb/avoid-overwrite-es-addr branch from f2a70f9 to dfe3dc6 Compare May 6, 2024 22:49
@alpeb
Copy link
Member Author

alpeb commented May 6, 2024

I have rebased with the latest main, which now includes the IP family in the Addresses index key so the first failure case is no longer relevant here.
I haven't been able to come up with a reproducible scenario where the overwriting I described would cause staleness. I still don't see though the reasoning for this overwriting...

@alpeb alpeb marked this pull request as ready for review May 8, 2024 14:11
@alpeb alpeb merged commit 4fccf3e into main May 8, 2024
36 checks passed
@alpeb alpeb deleted the alpeb/avoid-overwrite-es-addr branch May 8, 2024 14:12
alpeb added a commit that referenced this pull request May 13, 2024
…12427)"

This reverts commit 4fccf3e.

The early return was causing `pp.addresses = newAddressSet` to not be run when the list of addresses is empty; but setting that is still necessary so that labels are tracked correctly.

This was caught by the tap (viz) integration test run in the release workflow.
adleong pushed a commit that referenced this pull request May 13, 2024
…12427)" (#12589)

This reverts commit 4fccf3e.

The early return was causing `pp.addresses = newAddressSet` to not be run when the list of addresses is empty; but setting that is still necessary so that labels are tracked correctly.

This was caught by the tap (viz) integration test run in the release workflow.
the-wondersmith pushed a commit to the-wondersmith/linkerd2 that referenced this pull request May 15, 2024
…inkerd#12427)" (linkerd#12589)

This reverts commit 4fccf3e.

The early return was causing `pp.addresses = newAddressSet` to not be run when the list of addresses is empty; but setting that is still necessary so that labels are tracked correctly.

This was caught by the tap (viz) integration test run in the release workflow.

(cherry picked from commit 9bd8c00)
the-wondersmith added a commit to the-wondersmith/linkerd2 that referenced this pull request May 15, 2024
* origin/policy-feat-grpcroute-status-support:
  chore(ci): merge fixes from origin/main
  build(deps): bump tj-actions/changed-files from 44.3.0 to 44.4.0 (linkerd#12588)
  build(deps): bump github.com/fatih/color from 1.16.0 to 1.17.0 (linkerd#12590)
  chore(ci): Remove conditional integration testing (linkerd#12591)
  build(deps-dev): bump sinon from 17.0.1 to 17.0.2 in /web/app (linkerd#12587)
  build(deps): bump github.com/prometheus/client_golang (linkerd#12586)
  build(deps): bump thiserror from 1.0.59 to 1.0.60 (linkerd#12585)
  Revert "Fix destination staleness issue when adding EndpointSlices (linkerd#12427)" (linkerd#12589)
  Add outbound index metrics to the policy controller (linkerd#12429)
  build(deps): bump tj-actions/changed-files from 44.3.0 to 44.4.0 (linkerd#12588)
  build(deps): bump github.com/fatih/color from 1.16.0 to 1.17.0 (linkerd#12590)
  chore(ci): Remove conditional integration testing (linkerd#12591)
  build(deps-dev): bump sinon from 17.0.1 to 17.0.2 in /web/app (linkerd#12587)
  build(deps): bump github.com/prometheus/client_golang (linkerd#12586)
  build(deps): bump thiserror from 1.0.59 to 1.0.60 (linkerd#12585)
  Revert "Fix destination staleness issue when adding EndpointSlices (linkerd#12427)" (linkerd#12589)
  Set backend_not_found route status when any backends are not found (linkerd#12565)
  Add outbound index metrics to the policy controller (linkerd#12429)

Signed-off-by: Mark S <the@wondersmith.dev>
alpeb added a commit that referenced this pull request May 20, 2024
This is a second take on #12427, which avoided a theoretical/correctness
issue around overwritting new ES addresses with stale data.

We had to revert that in #12589 because the change introduced a bug, by
returning early when the ES had no addresses and failed to properly
initiallize `addesses` for the portPublisher.

This just removes the early return.
alpeb added a commit that referenced this pull request May 20, 2024
This is a second take on #12427, which avoided a theoretical/correctness
issue around overwritting new ES addresses with stale data.

We had to revert that in #12589 because the change introduced a bug, by
returning early when the ES had no addresses and failed to properly
initiallize `addesses` for the portPublisher.

This just removes the early return.
alpeb added a commit that referenced this pull request May 20, 2024
This is a second take on #12427, which avoided a theoretical/correctness
issue around overwritting new ES addresses with stale data.

We had to revert that in #12589 because the change introduced a bug, by
returning early when the ES had no addresses and failed to properly
initiallize `addesses` for the portPublisher.

This just removes the early return.
alpeb added a commit that referenced this pull request May 22, 2024
This is a second take on #12427, which avoided a theoretical/correctness
issue around overwritting new ES addresses with stale data.

We had to revert that in #12589 because the change introduced a bug, by
returning early when the ES had no addresses and failed to properly
initiallize `addesses` for the portPublisher.

This just removes the early return.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants