update the capacity to zero on shutdown/reset #502

SchSeba · 2023-10-02T07:53:05Z

When the device plugin is restarted, kubelet marks the resource as unhealthy, but still reports the resource as existing for a grace period (5 mins). If a pod is scheduled before the device plugin comes up, the pod create fails without a retryloop with an error message Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices <DEVICE_NAME>, which is unexpected.

This commit allow the device plugin to send an empty list of devices before the reset or shutdown

zeeke · 2023-10-02T08:20:45Z

pkg/resources/server.go

@@ -309,6 +322,14 @@ func (rs *resourceServer) restart() error {
 	// Send terminate signal to ListAndWatch()
 	rs.termSignal <- true

+	// wait for the terminated signal or 5 second


In resoruceServer.Stop() we do the termSignal-terminatedSignal handshake before stopping the grpcServer. n restart, we do it after rs.grpcServer.Stop(). Is it intentional?

It's not strictly related to this PR, but with these new changes I think it can raise an error on L189
stream.Send(resp)

you are right! sorry

SchSeba · 2023-10-02T08:25:38Z

@zeeke please give it another look :)

adrianchiris · 2023-10-05T09:16:58Z

hmm im not sure that thats what kubelet expects (being suddenly reported there are no devices when device plugin shuts
down or restarts)

will it remove entries from checkpoint file ?
what if there are pods already using resources ?

zeeke · 2023-10-02T08:26:44Z

pkg/resources/server.go

@@ -41,6 +41,7 @@ type resourceServer struct {
 	resourceNamePrefix string
 	grpcServer         *grpc.Server
 	termSignal         chan bool
+	terminatedSignal   chan bool


nit: termSignal and terminatedSignal only relate to ListAndWatch function. They don't relate to the resourceServer struct to terminate or see if it's terminated.

Renaming those two channels to listAndWatchStopSignal and listAndWatchFinishedSignal (or something similar) might improve a little the readability of this file.

SchSeba · 2023-10-21T15:34:28Z

hmm im not sure that thats what kubelet expects (being suddenly reported there are no devices when device plugin shuts
down or restarts)
will it remove entries from checkpoint file ?
what if there are pods already using resources ?

From my tests it doesn't remove the checkpoint file and running pods continue to run (I will try to leave it down for a longer period to be sure there is no reconcile or something that will kill the running pods)

without this change, we have an issue when we take down the pod and the pod is allocated to that node the pod will not be able to start.

When the device plugin is restarted, kubelet marks the resource as unhealthy, but still reports the resource as existing for a grace period (5 mins). If a pod is scheduled before the device plugin comes up, the pod create fails without a retryloop with an error message Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices <DEVICE_NAME>, which is unexpected. This commit allow the device plugin to send an empty list of devices before the reset or shutdown Signed-off-by: Sebastian Sch <sebassch@gmail.com>

SchSeba · 2023-10-21T15:44:41Z

related to openshift/sriov-network-operator#812

adrianchiris · 2023-11-23T14:29:11Z

@SchSeba been digging a bit into kubelet code

kubelet will report updated status to kubelet every 10 seconds[1] (every NodeStatusUpdateFrequency[2])

once device plugin plugin exits (its endpoint no longer valid), all devices are deemed unhealty[3]

[1] https://github.com/kubernetes/kubernetes/blob/d61cbac69aae97db1839bd2e0e86d68f26b353a7/pkg/kubelet/kubelet.go#L1637
[2] https://github.com/kubernetes/kubernetes/blob/d61cbac69aae97db1839bd2e0e86d68f26b353a7/staging/src/k8s.io/kubelet/config/v1beta1/types.go#L267
[3] https://github.com/kubernetes/kubernetes/blob/d61cbac69aae97db1839bd2e0e86d68f26b353a7/pkg/kubelet/nodestatus/setters.go#L342

so after at most 10 seconds node will report zero allocatable resources if plugin has exited.
if any pod is scheduled to the node before that, it will enter admission error.
setting devices explicitly as un-healthy will not help as unhealty devices is only reported to kube api during that sync loop.

what i suggest, is in sriov-network-operator after we remove device plugin to spin up a goroutine to keep deleting pods in admission error if they consume resources from device plugin until new device plugin is up or alternatively (even better option imo) once device plugin is up again clean up any pods which consume dp resources in admission error once.

LMK what you think.

SchSeba · 2023-12-05T16:21:13Z

Interesting I will try to implement a POC in the operator but before that a question do you see any issue changing the number to 0 when we reboot? can that effect something?

adrianchiris · 2023-12-05T16:45:13Z

i do not know what the original authors of the API (grpc) intended.
technically we dont achieve more reliability if we do it IMO.
when device plugin exits, it will mark the resources reported as invalid (internally) IIRC. which is the same as if plugin exited.
in both cases it only affects scheduler when node status is updated in apiserver.

zeeke · 2023-12-06T12:18:08Z

IIUC, if the device plugin exits and leaves the resources unhealthy (for a while I guess), any deployed Pod will get an admission error. If the Pod comes from a Deployment or a Replicaset, the kube controller spawns a lot of subsequent Pods, creating a little junk:

NAME                                                 READY   STATUS                        RESTARTS   AGE    IP               NODE       NOMINATED NODE   READINESS GATES
deploy1-5686945dcc-2x9j5                              0/6     UnexpectedAdmissionError      0          43h    <none>           worker-2   <none>           <none>
deploy1-5686945dcc-4bdpz                              0/6     UnexpectedAdmissionError      0          43h    <none>           worker-2   <none>           <none>
deploy1-5686945dcc-4ht8f                              0/6     Init:ContainerStatusUnknown   0          43h    <none>           worker-2   <none>           <none>
deploy1-5686945dcc-4xv28                              0/6     UnexpectedAdmissionError      0          43h    <none>           worker-2   <none>           <none>

Setting the device count to 0 will prevent the scheduler from selecting the node, making it retry the same Pod in an exponential backoff fashion.

coveralls · 2024-05-24T13:52:34Z

Pull Request Test Coverage Report for Build 6377485447

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

31 of 39 (79.49%) changed or added relevant lines in 1 file are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.2%) to 75.876%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/resources/server.go	31	39	79.49%

Files with Coverage Reduction	New Missed Lines	%
pkg/resources/server.go	2	78.52%

Totals
Change from base Build 6371847942:	-0.2%
Covered Lines:	2013
Relevant Lines:	2653

💛 - Coveralls

coveralls · 2024-05-24T13:52:34Z

Pull Request Test Coverage Report for Build 6377734797

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

31 of 40 (77.5%) changed or added relevant lines in 1 file are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-0.2%) to 75.867%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/resources/server.go	31	40	77.5%

Files with Coverage Reduction	New Missed Lines	%
pkg/resources/server.go	1	78.45%

Totals
Change from base Build 6371847942:	-0.2%
Covered Lines:	2012
Relevant Lines:	2652

💛 - Coveralls

coveralls · 2024-05-24T13:52:34Z

Pull Request Test Coverage Report for Build 6377419645

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

31 of 41 (75.61%) changed or added relevant lines in 1 file are covered.
19 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.2%) to 75.876%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/resources/server.go	31	41	75.61%

Files with Coverage Reduction	New Missed Lines	%
pkg/resources/server.go	19	78.52%

Totals
Change from base Build 6371847942:	-0.2%
Covered Lines:	2013
Relevant Lines:	2653

💛 - Coveralls

coveralls · 2024-05-24T13:52:34Z

Pull Request Test Coverage Report for Build 6377419645

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

31 of 41 (75.61%) changed or added relevant lines in 1 file are covered.
19 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.2%) to 75.876%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/resources/server.go	31	41	75.61%

Files with Coverage Reduction	New Missed Lines	%
pkg/resources/server.go	19	78.52%

Totals
Change from base Build 6371847942:	-0.2%
Covered Lines:	2013
Relevant Lines:	2653

💛 - Coveralls

SchSeba force-pushed the update_devices_on_shutdown branch from b542bb8 to ee1e0e9 Compare October 2, 2023 08:01

zeeke reviewed Oct 2, 2023

View reviewed changes

SchSeba force-pushed the update_devices_on_shutdown branch from ee1e0e9 to 0e7b699 Compare October 2, 2023 08:25

zeeke reviewed Oct 9, 2023

View reviewed changes

SchSeba force-pushed the update_devices_on_shutdown branch from 0e7b699 to c1a6852 Compare October 21, 2023 15:36

SchSeba mentioned this pull request Oct 21, 2023

Pod schedule failure when restarting device plugin openshift/sriov-network-operator#812

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update the capacity to zero on shutdown/reset #502

update the capacity to zero on shutdown/reset #502

SchSeba commented Oct 2, 2023

zeeke Oct 2, 2023 •

edited

Loading

SchSeba Oct 2, 2023

SchSeba commented Oct 2, 2023

adrianchiris commented Oct 5, 2023

zeeke Oct 2, 2023

SchSeba commented Oct 21, 2023

SchSeba commented Oct 21, 2023

adrianchiris commented Nov 23, 2023 •

edited

Loading

SchSeba commented Dec 5, 2023

adrianchiris commented Dec 5, 2023

zeeke commented Dec 6, 2023

coveralls commented May 24, 2024 •

edited

Loading

coveralls commented May 24, 2024

coveralls commented May 24, 2024

coveralls commented May 24, 2024

update the capacity to zero on shutdown/reset #502

Are you sure you want to change the base?

update the capacity to zero on shutdown/reset #502

Conversation

SchSeba commented Oct 2, 2023

zeeke Oct 2, 2023 • edited Loading

Choose a reason for hiding this comment

SchSeba Oct 2, 2023

Choose a reason for hiding this comment

SchSeba commented Oct 2, 2023

adrianchiris commented Oct 5, 2023

zeeke Oct 2, 2023

Choose a reason for hiding this comment

SchSeba commented Oct 21, 2023

SchSeba commented Oct 21, 2023

adrianchiris commented Nov 23, 2023 • edited Loading

SchSeba commented Dec 5, 2023

adrianchiris commented Dec 5, 2023

zeeke commented Dec 6, 2023

coveralls commented May 24, 2024 • edited Loading

Pull Request Test Coverage Report for Build 6377485447

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

coveralls commented May 24, 2024

Pull Request Test Coverage Report for Build 6377734797

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

coveralls commented May 24, 2024

Pull Request Test Coverage Report for Build 6377419645

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

coveralls commented May 24, 2024

Pull Request Test Coverage Report for Build 6377419645

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

zeeke Oct 2, 2023 •

edited

Loading

adrianchiris commented Nov 23, 2023 •

edited

Loading

coveralls commented May 24, 2024 •

edited

Loading