Namespace migration - Fix potential namespace migration problem with one node cluster #3188

mihivagyok · 2024-02-21T15:42:29Z

Description

kind/bug

The following can happen with a 1 node cluster:

operator updates kube-system/calico-node with node selector - that is successful

{"level":"info","ts":"2024-02-15T12:37:24Z","logger":"controller_windows","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"kube-system","Request.Name":"kube-dns"}
{"level":"info","ts":"2024-02-15T12:37:24Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:37:29Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:37:34Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:37:39Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:37:44Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:37:49Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:37:54Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:37:59Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:38:04Z","logger":"controller_installation","msg":"All kube-system calico/node pods are now ready after nodeSelector update","Request.Namespace":"","Request.Name":"default"}

Once it is ready, it starts make room for typha. That means it scales down kube-system/calico-typha to 0:

{"level":"info","ts":"2024-02-15T12:38:05Z","logger":"controller_installation","msg":"Scaling kube-system/calico-typha deployment to 0 replicas to make room for migration","Request.Namespace":"","Request.Name":"default"}

The next log is showing that after 3 minutes, it starts the migration - however that time, the kube-system/calico-node goes to CrashLoopBackOff already and operator stuck into this state: it tries to calculate ready and desired calico-nodes, but it fails the math, that's why it does not continue - the calculation in function waitUntilNodeCanBeMigrated() does not work (ksR, ksD, csR, csD).


> kubectl get pods --all-namespaces -o wide
NAMESPACE                     NAME                                             READY   STATUS             RESTARTS         AGE   IP               NODE          NOMINATED NODE   READINESS GATES
calico-system                 calico-kube-controllers-56797fd6cc-q4lvg         1/1     Running            0                52m   172.30.234.155   10.39.81.71   <none>           <none>
calico-system                 calico-typha-59f74df8d6-krwhz                    1/1     Running            0                52m   10.39.81.71      10.39.81.71   <none>           <none>
kube-system                   calico-node-dvg9v                                0/1     CrashLoopBackOff   17 (3m16s ago)   52m   10.39.81.71      10.39.81.71   <none>           <none>

{"level":"info","ts":"2024-02-15T12:41:05Z","logger":"controller_installation","msg":"nodes to migrate","Request.Namespace":"","Request.Name":"default","count":1}
{"level":"info","ts":"2024-02-15T12:41:05Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:06Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:07Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:08Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:09Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:10Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:11Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:12Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:13Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:14Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}

Then once we add label projectcalico.org/operator-node-migration=migrated label to the node, it can continue.

➜ kubectl label node 10.171.199.85 projectcalico.org/operator-node-migration=migrated --overwrite
node/10.171.199.85 labeled

{"level":"info","ts":"2024-02-15T12:46:39Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:46:40Z","logger":"controller_installation","msg":"Migrated 1 out of 1 nodes","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:46:40Z","logger":"controller_installation","msg":"waiting for calico-node to 1 replicas, ready at 1, up-to-date at 1, available at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:46:45Z","logger":"controller_installation","msg":"waiting for calico-node to 1 replicas, ready at 1, up-to-date at 1, available at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:46:50Z","logger":"controller_installation","msg":"waiting for calico-node to 1 replicas, ready at 1, up-to-date at 1, available at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:46:52Z","logger":"controller_windows","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"periodic-5m0s-reconcile-event"}
{"level":"info","ts":"2024-02-15T12:46:55Z","logger":"controller_installation","msg":"waiting for calico-node to 1 replicas, ready at 1, up-to-date at 1, available at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:47:00Z","logger":"controller_installation","msg":"waiting for calico-node to 1 replicas, ready at 1, up-to-date at 1, available at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:47:05Z","logger":"controller_installation","msg":"waiting for calico-node to 1 replicas, ready at 1, up-to-date at 1, available at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:47:10Z","logger":"controller_installation","msg":"Namespace migration complete","Request.Namespace":"","Request.Name":"default"}

This behaviour can be triggered if minReadySeconds: 15 is set for typhaDeployment in the Installation CR.

Please note that sometimes this can happen faster:

{"level":"info","ts":"2024-02-15T13:51:18Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T13:51:23Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T13:51:28Z","logger":"controller_installation","msg":"All kube-system calico/node pods are now ready after nodeSelector update","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T13:51:28Z","logger":"controller_installation","msg":"Scaling kube-system/calico-typha deployment to 0 replicas to make room for migration","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T13:51:48Z","logger":"controller_installation","msg":"nodes to migrate","Request.Namespace":"","Request.Name":"default","count":1}
{"level":"info","ts":"2024-02-15T13:51:48Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T13:51:49Z","logger":"controller_installation","msg":"Migrated 1 out of 1 nodes","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T13:51:49Z","logger":"controller_installation","msg":"waiting for calico-node to 1 replicas, ready at 0, up-to-date at 1, available at 0","Request.Namespace":"","Request.Name":"default"}

That's why add this simple logic which has effect only on 1 node clusters.
Change is well-tested (namespace migration is built into our CI process, so dozens tests are executed)

I would like to cherry pick this change to 1.32 and 1.33 if possible. Thanks!

For PR author

Tests for change.
If changing pkg/apis/, run make gen-files
If changing versions, run make gen-versions

For PR reviewers

A note for code reviewers - all pull requests must have the following:

Milestone set according to targeted release.
Appropriate labels:
- kind/bug if this is a bugfix.
- kind/enhancement if this is a a new feature.
- enterprise if this PR applies to Calico Enterprise only.

caseydavenport · 2024-02-26T19:59:00Z

I think this checks out, but @tmjd knows the logic here a bit better than I do.

caseydavenport · 2024-02-26T20:00:27Z

/sem-approve

mihivagyok · 2024-03-14T09:42:56Z

@tmjd Hi! Could you please take a look? Thank you!

pkg/controller/migration/namespace_migration.go

tmjd · 2024-04-08T12:48:04Z

/sem-approve

tmjd

LGTM

tmjd · 2024-04-08T14:15:42Z

Thank you for this fix @mihivagyok
I'll be mentioning the nice work you've done on this and other PRs in the Calico community meeting on Wednesday April 10. You're welcome to join if you're available and interested.

mihivagyok requested a review from a team as a code owner February 21, 2024 15:42

marvin-tigera added this to the v1.34.0 milestone Feb 21, 2024

marvin-tigera added docs-pr-required release-note-required labels Feb 21, 2024

mihivagyok changed the title ~~Namespace migrations - Fix potential namespace migration problem with one node cluster~~ Namespace migration - Fix potential namespace migration problem with one node cluster Feb 21, 2024

tmjd reviewed Apr 1, 2024

View reviewed changes

pkg/controller/migration/namespace_migration.go Outdated Show resolved Hide resolved

fix potential namespace migration problem with one node cluster

f55336d

mihivagyok force-pushed the fix-migration-one-node-cluster branch from 104891e to 220227a Compare April 8, 2024 11:42

return early if there is only one node in the cluster, update comment

5fe6538

mihivagyok force-pushed the fix-migration-one-node-cluster branch from 220227a to 5fe6538 Compare April 8, 2024 11:45

mihivagyok requested a review from tmjd April 8, 2024 11:46

tmjd approved these changes Apr 8, 2024

View reviewed changes

tmjd merged commit 788f56f into tigera:master Apr 8, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Namespace migration - Fix potential namespace migration problem with one node cluster #3188

Namespace migration - Fix potential namespace migration problem with one node cluster #3188

mihivagyok commented Feb 21, 2024 •

edited

Loading

caseydavenport commented Feb 26, 2024

caseydavenport commented Feb 26, 2024

mihivagyok commented Mar 14, 2024

tmjd commented Apr 8, 2024

tmjd left a comment

tmjd commented Apr 8, 2024

Namespace migration - Fix potential namespace migration problem with one node cluster #3188

Namespace migration - Fix potential namespace migration problem with one node cluster #3188

Conversation

mihivagyok commented Feb 21, 2024 • edited Loading

Description

For PR author

For PR reviewers

caseydavenport commented Feb 26, 2024

caseydavenport commented Feb 26, 2024

mihivagyok commented Mar 14, 2024

tmjd commented Apr 8, 2024

tmjd left a comment

Choose a reason for hiding this comment

tmjd commented Apr 8, 2024

mihivagyok commented Feb 21, 2024 •

edited

Loading