Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 graceful reset (etcd leave) doesn't work with KubeSpan #8057

Closed
Tracked by #7561
smira opened this issue Dec 12, 2023 · 0 comments · Fixed by #8060
Closed
Tracked by #7561

🐛 graceful reset (etcd leave) doesn't work with KubeSpan #8057

smira opened this issue Dec 12, 2023 · 0 comments · Fixed by #8060
Assignees

Comments

@smira
Copy link
Member

smira commented Dec 12, 2023

The machine leaves Discovery Service earlier than leaving etcd, so by the time etcd leave happens, links to the controlplane nodes are already broken, and etcd member can't communicate with other peers.

The fix is to defer leaving Discovery Service on reset to some later phase, when network connectivity via KubeSpan is no longer needed.

@smira smira self-assigned this Dec 12, 2023
smira added a commit to smira/talos that referenced this issue Dec 13, 2023
Fixes siderolabs#8057

I went back and forth on the way to fix it exactly, and ended up with a
pretty simple version of a fix.

The problem was that discovery service was removing the member at the
initial phase of reset, which actually still requires KubeSpan to be up:

* leaving `etcd` (need to talk to other members)
* stopping pods (might need to talk to Kubernetes API with some CNIs)

Now leaving discovery service happens way later, when network
interactions are no longer required.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
smira added a commit to smira/talos that referenced this issue Dec 13, 2023
Fixes siderolabs#8057

I went back and forth on the way to fix it exactly, and ended up with a
pretty simple version of a fix.

The problem was that discovery service was removing the member at the
initial phase of reset, which actually still requires KubeSpan to be up:

* leaving `etcd` (need to talk to other members)
* stopping pods (might need to talk to Kubernetes API with some CNIs)

Now leaving discovery service happens way later, when network
interactions are no longer required.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
smira added a commit to smira/talos that referenced this issue Dec 14, 2023
Fixes siderolabs#8057

I went back and forth on the way to fix it exactly, and ended up with a
pretty simple version of a fix.

The problem was that discovery service was removing the member at the
initial phase of reset, which actually still requires KubeSpan to be up:

* leaving `etcd` (need to talk to other members)
* stopping pods (might need to talk to Kubernetes API with some CNIs)

Now leaving discovery service happens way later, when network
interactions are no longer required.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 10c59a6)
smira added a commit to smira/talos that referenced this issue Dec 14, 2023
Fixes siderolabs#8057

I went back and forth on the way to fix it exactly, and ended up with a
pretty simple version of a fix.

The problem was that discovery service was removing the member at the
initial phase of reset, which actually still requires KubeSpan to be up:

* leaving `etcd` (need to talk to other members)
* stopping pods (might need to talk to Kubernetes API with some CNIs)

Now leaving discovery service happens way later, when network
interactions are no longer required.

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 10c59a6)
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant