Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Namespace migration - Fix potential namespace migration problem with one node cluster #3188

Merged
merged 2 commits into from
Apr 8, 2024

Conversation

mihivagyok
Copy link
Contributor

@mihivagyok mihivagyok commented Feb 21, 2024

Description

kind/bug

The following can happen with a 1 node cluster:

  • operator updates kube-system/calico-node with node selector - that is successful
{"level":"info","ts":"2024-02-15T12:37:24Z","logger":"controller_windows","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"kube-system","Request.Name":"kube-dns"}
{"level":"info","ts":"2024-02-15T12:37:24Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:37:29Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:37:34Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:37:39Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:37:44Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:37:49Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:37:54Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:37:59Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:38:04Z","logger":"controller_installation","msg":"All kube-system calico/node pods are now ready after nodeSelector update","Request.Namespace":"","Request.Name":"default"}
  • Once it is ready, it starts make room for typha. That means it scales down kube-system/calico-typha to 0:
{"level":"info","ts":"2024-02-15T12:38:05Z","logger":"controller_installation","msg":"Scaling kube-system/calico-typha deployment to 0 replicas to make room for migration","Request.Namespace":"","Request.Name":"default"}
  • The next log is showing that after 3 minutes, it starts the migration - however that time, the kube-system/calico-node goes to CrashLoopBackOff already and operator stuck into this state: it tries to calculate ready and desired calico-nodes, but it fails the math, that's why it does not continue - the calculation in function waitUntilNodeCanBeMigrated() does not work (ksR, ksD, csR, csD).

> kubectl get pods --all-namespaces -o wide
NAMESPACE                     NAME                                             READY   STATUS             RESTARTS         AGE   IP               NODE          NOMINATED NODE   READINESS GATES
calico-system                 calico-kube-controllers-56797fd6cc-q4lvg         1/1     Running            0                52m   172.30.234.155   10.39.81.71   <none>           <none>
calico-system                 calico-typha-59f74df8d6-krwhz                    1/1     Running            0                52m   10.39.81.71      10.39.81.71   <none>           <none>
kube-system                   calico-node-dvg9v                                0/1     CrashLoopBackOff   17 (3m16s ago)   52m   10.39.81.71      10.39.81.71   <none>           <none>

{"level":"info","ts":"2024-02-15T12:41:05Z","logger":"controller_installation","msg":"nodes to migrate","Request.Namespace":"","Request.Name":"default","count":1}
{"level":"info","ts":"2024-02-15T12:41:05Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:06Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:07Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:08Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:09Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:10Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:11Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:12Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:13Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:41:14Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}

Then once we add label projectcalico.org/operator-node-migration=migrated label to the node, it can continue.

➜ kubectl label node 10.171.199.85 projectcalico.org/operator-node-migration=migrated --overwrite
node/10.171.199.85 labeled

{"level":"info","ts":"2024-02-15T12:46:39Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:46:40Z","logger":"controller_installation","msg":"Migrated 1 out of 1 nodes","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:46:40Z","logger":"controller_installation","msg":"waiting for calico-node to 1 replicas, ready at 1, up-to-date at 1, available at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:46:45Z","logger":"controller_installation","msg":"waiting for calico-node to 1 replicas, ready at 1, up-to-date at 1, available at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:46:50Z","logger":"controller_installation","msg":"waiting for calico-node to 1 replicas, ready at 1, up-to-date at 1, available at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:46:52Z","logger":"controller_windows","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"periodic-5m0s-reconcile-event"}
{"level":"info","ts":"2024-02-15T12:46:55Z","logger":"controller_installation","msg":"waiting for calico-node to 1 replicas, ready at 1, up-to-date at 1, available at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:47:00Z","logger":"controller_installation","msg":"waiting for calico-node to 1 replicas, ready at 1, up-to-date at 1, available at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:47:05Z","logger":"controller_installation","msg":"waiting for calico-node to 1 replicas, ready at 1, up-to-date at 1, available at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T12:47:10Z","logger":"controller_installation","msg":"Namespace migration complete","Request.Namespace":"","Request.Name":"default"}

This behaviour can be triggered if minReadySeconds: 15 is set for typhaDeployment in the Installation CR.

Please note that sometimes this can happen faster:

{"level":"info","ts":"2024-02-15T13:51:18Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T13:51:23Z","logger":"controller_installation","msg":"waiting for kube-system/calico-node to have 1 replicas, currently at 0","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T13:51:28Z","logger":"controller_installation","msg":"All kube-system calico/node pods are now ready after nodeSelector update","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T13:51:28Z","logger":"controller_installation","msg":"Scaling kube-system/calico-typha deployment to 0 replicas to make room for migration","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T13:51:48Z","logger":"controller_installation","msg":"nodes to migrate","Request.Namespace":"","Request.Name":"default","count":1}
{"level":"info","ts":"2024-02-15T13:51:48Z","logger":"controller_installation","msg":"Max unavailble nodes calculation resolved to 0, defaulting back to 1 to allow upgrades to continue","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T13:51:49Z","logger":"controller_installation","msg":"Migrated 1 out of 1 nodes","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-02-15T13:51:49Z","logger":"controller_installation","msg":"waiting for calico-node to 1 replicas, ready at 0, up-to-date at 1, available at 0","Request.Namespace":"","Request.Name":"default"}

That's why add this simple logic which has effect only on 1 node clusters.
Change is well-tested (namespace migration is built into our CI process, so dozens tests are executed)

I would like to cherry pick this change to 1.32 and 1.33 if possible. Thanks!

For PR author

  • Tests for change.
  • If changing pkg/apis/, run make gen-files
  • If changing versions, run make gen-versions

For PR reviewers

A note for code reviewers - all pull requests must have the following:

  • Milestone set according to targeted release.
  • Appropriate labels:
    • kind/bug if this is a bugfix.
    • kind/enhancement if this is a a new feature.
    • enterprise if this PR applies to Calico Enterprise only.

@mihivagyok mihivagyok requested a review from a team as a code owner February 21, 2024 15:42
@marvin-tigera marvin-tigera added this to the v1.34.0 milestone Feb 21, 2024
@mihivagyok mihivagyok changed the title Namespace migrations - Fix potential namespace migration problem with one node cluster Namespace migration - Fix potential namespace migration problem with one node cluster Feb 21, 2024
@caseydavenport
Copy link
Member

I think this checks out, but @tmjd knows the logic here a bit better than I do.

@caseydavenport
Copy link
Member

/sem-approve

@mihivagyok
Copy link
Contributor Author

@tmjd Hi! Could you please take a look? Thank you!

@mihivagyok mihivagyok force-pushed the fix-migration-one-node-cluster branch from 104891e to 220227a Compare April 8, 2024 11:42
@mihivagyok mihivagyok force-pushed the fix-migration-one-node-cluster branch from 220227a to 5fe6538 Compare April 8, 2024 11:45
@mihivagyok mihivagyok requested a review from tmjd April 8, 2024 11:46
@tmjd
Copy link
Member

tmjd commented Apr 8, 2024

/sem-approve

Copy link
Member

@tmjd tmjd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tmjd
Copy link
Member

tmjd commented Apr 8, 2024

Thank you for this fix @mihivagyok
I'll be mentioning the nice work you've done on this and other PRs in the Calico community meeting on Wednesday April 10. You're welcome to join if you're available and interested.

@tmjd tmjd merged commit 788f56f into tigera:master Apr 8, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants