Fixing yet another race condition during cluster/kernel upgrade #917

yevgeny-shnaidman · 2024-10-07T08:38:35Z

The following scenario causes worker pods to be stuck during cluster upgrade:

kernel module is loaded into the node. NMC contains both spec and status using the current kernel version 2) cluster upgrade starts. As part of the upgrade node becomes
Unschedulable
Module-NMC removes Spec from NMC, since the node is Uschedulable 4) NMC controller creates an unload worker pod, since the timestamp of
the ready condition of the node has not been updated yet to a new
value (the node is still not ready).
once the node becomes ready, the unload worker pod is scheduled with
the old kernel version and therefore constantly fails.

Solution:
We will skip handling spec, statuses , garbage collection and labelling for the node that are not ready (unschedulable). We will only collect the completed pod and will update NMC statuses accordingly. Once the node becomes ready, the new reconciliation loop will kick in

The following scenario causes worker pods to be stuck during cluster upgrade: 1) kernel module is loaded into the node. NMC contains both spec and status using the current kernel version 2) cluster upgrade starts. As part of the upgrade node becomes Unschedulable 3) Module-NMC removes Spec from NMC, since the node is Uschedulable 4) NMC controller creates an unload worker pod, since the timestamp of the ready condition of the node has not been updated yet to a new value (the node is still not ready). 5) once the node becomes ready, the unload worker pod is scheduled with the old kernel version and therefore constantly fails. Solution: We will skip handling spec, statuses , garbage collection and labelling for the node that are not ready (unschedulable). We will only collect the completed pod and will update NMC statuses accordingly. Once the node becomes ready, the new reconciliation loop will kick in

k8s-ci-robot · 2024-10-07T08:38:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yevgeny-shnaidman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yevgeny-shnaidman]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yevgeny-shnaidman · 2024-10-07T08:38:54Z

/assign @ybettan

ybettan · 2024-10-07T08:42:56Z

/lgtm

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 7, 2024

k8s-ci-robot requested review from mresvanis and ybettan October 7, 2024 08:38

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 7, 2024

k8s-ci-robot assigned ybettan Oct 7, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2024

k8s-ci-robot merged commit 4a5518e into kubernetes-sigs:release-2.2 Oct 7, 2024
14 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing yet another race condition during cluster/kernel upgrade #917

Fixing yet another race condition during cluster/kernel upgrade #917

yevgeny-shnaidman commented Oct 7, 2024

k8s-ci-robot commented Oct 7, 2024

yevgeny-shnaidman commented Oct 7, 2024

ybettan commented Oct 7, 2024

Fixing yet another race condition during cluster/kernel upgrade #917

Fixing yet another race condition during cluster/kernel upgrade #917

Conversation

yevgeny-shnaidman commented Oct 7, 2024

k8s-ci-robot commented Oct 7, 2024

yevgeny-shnaidman commented Oct 7, 2024

ybettan commented Oct 7, 2024