Provisioner does not allow rescheduling if a Node is deleted after a pod is scheduled #121

pwschuurman · 2022-03-02T17:49:37Z

If a node is deleted while a pod is scheduled on a node (but before a claim is provisioned), a pod can become indefinitely stuck in a Pending state.

Typically when a failure occurs in provisioning, the provisioner will relinquish control back to the Scheduler, to reschedule the Pod somehwere else. This is done by removing the volume.kubernetes.io/selected-node annotation from the PVC. The controller returns ProvisioningFinished in provisionClaimOperation. This can happen in the case when storage cannot be scheduled on the selected node: https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/master/controller/controller.go#L1420

However, if a Node becomes unavailable after it has been selected by the Scheduler, the provisioner will not remove this annotation, since it returns ProvisioningNoChange in provisionClaimOperation. This is potentially useful in some situations where there is eventual consistency for a Node to become available, once it has been selected. However, for the case when a Node is deleted, this is an unrecoverable condition, and requires the user to intervene (either by adding the exact node back (infeasible for dynamically provisioned node names), deleting/re-creating the pod and allowing the Scheduler to reschedule, or manually removing the selected-node annotation on the PVC).

The text was updated successfully, but these errors were encountered:

k8s-triage-robot · 2022-05-31T18:28:20Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-06-30T19:24:36Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-07-30T19:34:16Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-07-30T19:34:26Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

amacaskill · 2022-10-13T17:39:34Z

/remove-lifecycle rotten

amacaskill · 2022-10-13T17:39:58Z

/reopen

k8s-ci-robot · 2022-10-13T17:40:01Z

@amacaskill: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pwschuurman · 2022-11-23T22:34:39Z

Repro using VolumeSnapshot to delay provisioning: https://gist.github.com/pwschuurman/fd9c8c50889ce2382bcdca259c51d3e4

Create a VolumeSnapshot that references a non-existent disk (or a disk that takes a lot of time to be copied in order for the VolumeSnapshot to become ready)
Create a PVC that references the VolumeSnapshot as a DataSource
Create a pod that references said PVC. Scheduler will select a node for the pod, and add the volume.kubernetes.io/selected-node annotation to the PVC.
While operation from (1) is still pending, delete the node that the PVC is selected for. This could happen under normal conditions due to node repair, upgrade, autoscaling.
Once the VolumeSnapshot becomes ready, the provisioner will start to emit failed to get target node. PVC must be deleted (or annotation removed) to fix this problem.

Some ideas on how to handle this:

Add a timeout that will remove the annotation after some period of time. If a volume.kubernetes.io/selected-node annotation becomes, stale remove it from the PVC. This is troublesome as some delays can take a long time (eg: waiting for snapshot to be created), and may not fit into a well define timeout period.
Update csi-provisioner to use an informer, rather than a lister. This would allow the provisioner to be aware of deletion events for a node, and remove the annotation for affected volumes. The provisioner would likely need to keep a cache of node -> volume, in order to update affected volumes.
Update the scheduler to keep remove the annotation on node deletion.

k8s-triage-robot · 2023-02-21T23:17:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

msau42 · 2023-02-22T00:27:04Z

/remove-lifecycle stale
/triage accepted

msau42 · 2023-02-22T00:28:04Z

I think this is the same issue as kubernetes/kubernetes#100485

sunnylovestiramisu · 2023-03-06T19:43:45Z

Another option we discussed is: remove the annotation when the provisioner tries to access a Node that doesn't exist by detecting errors.NewNotFound

msau42 · 2023-03-07T22:44:59Z

/assign @sunnylovestiramisu

sunnylovestiramisu · 2023-03-08T01:11:05Z

Reproduced the error by the following step:

kubetest --build --up
Deploy a pd csi driver via [gcp-compute-persistent-disk-csi-driver/deploy/kubernetes/deploy-driver.sh](https://goto.google.com/src)
Create a storage class, create a pvc with annotation: volume.kubernetes.io/selected-node, create a pod
PVC stayed in PENDING state
Check csi-provisioner logs via k logs -n gce-pd-csi-driver csi-gce-pd-controller-container csi-provisioner

W0308 00:51:37.588114       1 controller.go:934] Retrying syncing claim "xxxxxx", failure 12
E0308 00:51:37.588141       1 controller.go:957] error syncing claim "xxxxxx": failed to get target node: node "non-exist-node" not found
I0308 00:51:37.588381       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"task-pv-claim", UID:"xxxxxx", APIVersion:"v1", ResourceVersion:"4824", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to get target node: node "non-exist-node" not found

sunnylovestiramisu · 2023-03-09T22:58:10Z

Manually testes with the fix #139

Copy the sig-storage-lib-external-provisioner with the fix to external-attacher vendor
make container of a new external-attacher image
Upload to GCR and then replace the driver link in stable-master image.yaml
Spin up a k8s cluster on GCE via kubetest --build --up
Deploy a pd csi driver via [gcp-compute-persistent-disk-csi-driver/deploy/kubernetes/deploy-driver.sh
Create a storage class, create a pvc with annotation: volume.kubernetes.io/selected-node, create a pod
PVC in state "Successfully provisioned volume pvc-xxxxxx"

sunnylovestiramisu · 2023-03-21T20:50:58Z

We should cherry-pick to external-provisioner 3.2, 3.3, 3.4

sunnylovestiramisu · 2023-04-05T17:42:07Z

The release has been published in external-provisioner:

https://github.com/kubernetes-csi/external-provisioner/releases/tag/v3.3.1
https://github.com/kubernetes-csi/external-provisioner/releases/tag/v3.2.2
https://github.com/kubernetes-csi/external-provisioner/releases/tag/v3.4.1

sunnylovestiramisu · 2023-04-05T17:45:00Z

/close

k8s-ci-robot · 2023-04-05T17:45:07Z

@sunnylovestiramisu: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 30, 2022

k8s-ci-robot closed this as completed Jul 30, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 13, 2022

k8s-ci-robot reopened this Oct 13, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 22, 2023

k8s-ci-robot assigned sunnylovestiramisu Mar 7, 2023

sunnylovestiramisu mentioned this issue Mar 9, 2023

Fix indefinite stuck Pending pod on a deleted node #139

Merged

sunnylovestiramisu mentioned this issue Mar 24, 2023

Update to kubernetes-sigs/sig-storage-lib-external-provisioner v9.0.2 kubernetes-csi/external-provisioner#896

Merged

This was referenced Apr 4, 2023

volume.kubernetes.io/selected-node never cleared for non-existent nodes on PVC without PVs kubernetes/kubernetes#100485

Closed

Update sig-storage-lib-external-provisioner to v8.0.1 kubernetes-csi/external-provisioner#905

Merged

This was referenced Apr 4, 2023

Update sig-storage-lib-external-provisioner to v8.0.1 kubernetes-csi/external-provisioner#906

Merged

Update sig-storage-lib-external-provisioner to v8.0.1 kubernetes-csi/external-provisioner#907

Merged

k8s-ci-robot closed this as completed Apr 5, 2023

dimityrmirchev mentioned this issue Jun 13, 2023

Update external-provisioner to version 3.4.1 gardener/gardener-extension-provider-openstack#640

Merged

judemars mentioned this issue Aug 15, 2023

Provisioner fails with "error syncing claim: node not found" after "final error received, removing pvc" #152

Open

orz-nil mentioned this issue Jul 9, 2024

CSI Provisioner failed to provision PV "failed to get target node:" kubernetes-sigs/aws-ebs-csi-driver#1713

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provisioner does not allow rescheduling if a Node is deleted after a pod is scheduled #121

Provisioner does not allow rescheduling if a Node is deleted after a pod is scheduled #121

pwschuurman commented Mar 2, 2022

k8s-triage-robot commented May 31, 2022

k8s-triage-robot commented Jun 30, 2022

k8s-triage-robot commented Jul 30, 2022

k8s-ci-robot commented Jul 30, 2022

amacaskill commented Oct 13, 2022

amacaskill commented Oct 13, 2022

k8s-ci-robot commented Oct 13, 2022

pwschuurman commented Nov 23, 2022

k8s-triage-robot commented Feb 21, 2023

msau42 commented Feb 22, 2023

msau42 commented Feb 22, 2023

sunnylovestiramisu commented Mar 6, 2023

msau42 commented Mar 7, 2023

sunnylovestiramisu commented Mar 8, 2023 •

edited

Loading

sunnylovestiramisu commented Mar 9, 2023 •

edited

Loading

sunnylovestiramisu commented Mar 21, 2023

sunnylovestiramisu commented Apr 5, 2023

sunnylovestiramisu commented Apr 5, 2023

k8s-ci-robot commented Apr 5, 2023

Provisioner does not allow rescheduling if a Node is deleted after a pod is scheduled #121

Provisioner does not allow rescheduling if a Node is deleted after a pod is scheduled #121

Comments

pwschuurman commented Mar 2, 2022

k8s-triage-robot commented May 31, 2022

k8s-triage-robot commented Jun 30, 2022

k8s-triage-robot commented Jul 30, 2022

k8s-ci-robot commented Jul 30, 2022

amacaskill commented Oct 13, 2022

amacaskill commented Oct 13, 2022

k8s-ci-robot commented Oct 13, 2022

pwschuurman commented Nov 23, 2022

k8s-triage-robot commented Feb 21, 2023

msau42 commented Feb 22, 2023

msau42 commented Feb 22, 2023

sunnylovestiramisu commented Mar 6, 2023

msau42 commented Mar 7, 2023

sunnylovestiramisu commented Mar 8, 2023 • edited Loading

sunnylovestiramisu commented Mar 9, 2023 • edited Loading

sunnylovestiramisu commented Mar 21, 2023

sunnylovestiramisu commented Apr 5, 2023

sunnylovestiramisu commented Apr 5, 2023

k8s-ci-robot commented Apr 5, 2023

sunnylovestiramisu commented Mar 8, 2023 •

edited

Loading

sunnylovestiramisu commented Mar 9, 2023 •

edited

Loading