Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postpone Deletion of a Persistent Volume Claim in case It Is Used by a Pod #1174

Conversation

pospispa
Copy link

Proposal for postponing deletion of Persistent Volume Claim in case it's used by a pod.

Implementation will fix issue kubernetes/kubernetes#45143

/sig storage
/cc @kubernetes/sig-storage-proposals

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/design Categorizes issue or PR as related to design. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 11, 2017
- In case the PVC is used by a pod: sets condition Terminating to true for the PVC.

### Scheduler
Scheduler will check if a pod uses a PVC and if any of the PVCs has condition Terminating set to true. In case this is true the scheduling of the pod will end up with an error: "PVC (%pvcName) is in Terminating state".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just check if the PVC has the deletion timestamp set?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anything that reacts to absence of a PV or PVC (scheduler, kubelet) should also react to the presence of a deletionTimestamp on a PV or PVC

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me firstly think loudly about the deletionTimestamp. The description of the deletionTimestamp states: "The resource will be deleted (no longer visible from resource lists, and not reachable by name) after the time in this field." So in case the deletionTimestamp is set for a PVC the PVC must be deleted after the time in the deletionTimestamp. However, in case the PVC is used by a pod it's not guaranteed that the pod won't be using the PVC after the time in the deletionTimestamp. That's why I prefer not to set the deletionTimestamp in PVCs, but use the condition Terminating for storing the information that the PVC is going to be deleted.

On the other hand, scheduler must be able to decide whether PVC is going to be deleted or not in order to fail scheduling of a pod that uses the PVC that is going to be deleted.
So it will be better if the condition that PVC is going to be deleted will be: deletionTimestampIsSet(PVC) || terminatingConditionIsTrue(PVC).

Copy link
Member

@liggitt liggitt Oct 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An object with a deletionTimestamp and non-empty finalizers blocks until finalizers are removed. The description of deletionTimestamp should be updated. The scheduler and kubelet should not schedule or mount based on PVs or PVCs with deletionTimestamps set

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that after kubectl delete pvc pvc-name the PVC may or may not have the deletionTimestamp set. So it's not possible to rely solely on deletionTimestamp for finding out whether a PVC is being deleted or not. Am I right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that after kubectl delete pvc pvc-name the PVC may or may not have the deletionTimestamp set

Not sure what you mean. An effective DELETE API call either sets the deletionTimestamp or completely removes the object from etcd.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in case of a PVC with a finalizer set the DELETE API call will always set the deletionTimestamp as the deletion has to be postponed because of the finalizer.
Thus, there can't be a PVC that is being deleted with deletionTimestamp == nil.
That said, there is no need for using condition Terminating for marking a PVC that is being deleted.

When a PVC is created finalizer is added into its metadata.

### Finalizer
Finalizer can list pods and watch for pods. It caches status of pods that use a PVC. It also watches for PVC deletes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new controller to handle PVC finalizer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a new internal controller.

Finalizer can list pods and watch for pods. It caches status of pods that use a PVC. It also watches for PVC deletes.

Finalizer reacts to PVC delete in one of the below ways:
- In case the PVC is not used by a pod: removes the finalizer information from the PVC metadata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you have a race condition between adding and deleting pods?

  1. Pod A is currently using PVC.
  2. Pod A is deleted.
  3. Soon after, Pod B is created and scheduled that uses PVC.
  4. PVC is deleted.
  5. Informer gets Pod delete for Pod A, sees no other Pod is using the PVC in its cache, and removes finalizer.
  6. Informer gets Pod add for Pod B, but finalizer is already deleted.

Copy link
Member

@liggitt liggitt Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presence of a PV or PVC deletion timestamp should fail setting up a PVC volume for a pod. That would prevent new pods from starting once the PV or PVC had started deletion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if Pod B got setup with the volume before the informer receives the updates?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @msau42 that the informer may get PVC delete before Pod B setup so in case the finalizer is removed immediately after PVC delete there is a chance for a race condition.

Currently, I can't think of other solution for the race condition than to postpone the finalizer removal from PVC for at least deletionPostponePeriod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not really safe either. The controller could be down.

Isn't this best effort? How strong of a guarantee is required? The description above implies a strong guarantee isn't required, and is vague about what the consequences are.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not really safe either.

I agree.

The controller could be down.

In case the controller is done the finalizer setting is not being removed from PVCs so PVCs are not being deleted and PVCs marked for deletion are piling up.

In addition, the controller is not setting finalizer for newly created PVCs.

Isn't this best effort? How strong of a guarantee is required? The description above implies a strong guarantee isn't required, and is vague about what the consequences are.

The worst possible consequences are data lost in some of the possible scenarios that always depend on the storage type being used:

  • in case the dynamic provisioning is used and reclaim policy is Delete the PVC deletion triggers deletion of the associated storage asset and PV.
  • the same as above applies for the static provisioning and Delete reclaim policy.

So, IMHO, it makes sense to try to find a solution for the above race condition.

k8s-github-robot pushed a commit that referenced this pull request Oct 12, 2017
…finalizer-info

Automatic merge from submit-queue.

Update deletionTimestamp with information about finalizer effect.

Updating description of `deletionTimestamp` as adviced [here](#1174 (comment)).
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 13, 2017
@pospispa
Copy link
Author

I addressed some of the comments in the update.
I added an alternative solution: "Scheduler In the Role of PVC Deletion Controller". I'd like to get feedback if the alternative has less disadvantages than the proposed implementation.


In case a pod is being scheduled and is using PVCs that do not have condition PVCUsedByPod set it will set this condition for these PVCs.

In case a pod is terminated and was using PVCs the scheduler will update PVCUsedByPod condition for these PVCs accordingly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler does not handle pod termination today.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it will mean extending scheduler to handle pod termination. I assume it means that it is a no-go alternative.


Note: scheduler is the source of truth of pods that are being started. The information on active pods may be a little bit outdated that causes that deletion of a PVC may be postponed (pod status in schedular is active while the pod is terminated in API server), but this does not cause any harm.

The disadvantage is that scheduler will become responsible for PVC deletion postponing that will make scheduler bigger.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think it's in the scheduler's scope to handle PVC deletion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. So this is a no-go alternative.

2. Pod A is deleted.
3. Soon after, Pod B is created and scheduled that uses PVC.
4. PVC is deleted.
5. PVC deletion controller gets Pod delete for Pod A, sees no other Pod is using the PVC in its cache, and removes finalizer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the race be solved if instead of removing the finalizer immediately, you resync all the Pods first?

Copy link
Author

@pospispa pospispa Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having PVC deletion controller resyncs all the pods before removing the finalizer is not enough. There must be a PVC resync done in scheduler or in kubelet. IMHO, these two resyncs will solve the race. I'll update the proposal accordingly.

Copy link
Member

@liggitt liggitt Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There must be a PVC resync done in scheduler or in kubelet. IMHO, these two resyncs will solve the race. I'll update the proposal accordingly.

Not sure what you mean by resync in the kubelet... the kubelet does a live lookup of the PVC when setting up the pod's volume. If the PVC has a deletionTimestamp, the kubelet should not continue setting up the volume.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by resync in the kubelet... the kubelet does a live lookup of the PVC when setting up the pod's volume. If the PVC has a deletionTimestamp, the kubelet should not continue setting up the volume.

I didn't know that kubelet does a live lookup of the PVC(s) used by the pod so I wanted to add it.

If the PVC has a deletionTimestamp kubelet must not start the pod.


## Implementation

### API Sever, PVC Admission Controller, PVC Create
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo : Sever -> Server

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting it.

@pospispa
Copy link
Author

@childsb @msau42 I believe I've addressed all essential comments including the race. Please, would you review the design proposal once again and let me know what has to be done to have the design proposal merged?

@ianchakeres
Copy link
Contributor

Should we consider using the same process for postponing deletion of PVs that are actively used by pods?

@ianchakeres
Copy link
Contributor

Should we consider having the PVC Deletion Controller add a finalizer to pods that have PV/PVCs? These finalizers would be used to indicate to the controller that the related PV/PVCs are in use by the related pod. The controller would then be called on pod deletion, and the finalizer would be removed.

### Kubelet
Kubelet does currently live lookup of PVC(s) that are used by a pod.

In case any of the PVC(s) used by the pod has the `deletionTimestamp` set kubelet won't start the pod but will report and error: "can't start pod (%pod) because it's using PVC (%pvcName) that is being deleted"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can log a specific message, but should follow the same code path as if the PVC was not found (fails in volume mount creation stage, retries N times, etc)


In case any of the PVC(s) used by the pod has the `deletionTimestamp` set kubelet won't start the pod but will report and error: "can't start pod (%pod) because it's using PVC (%pvcName) that is being deleted"

Note: as soon as all the PVC(s) with `deletionTimestamp` set are deleted (in case none of the PVC(s) don't have finalizer information set) the pod that now uses non-existent PVC(s) will be started and will be in pending state (waiting for the non-existent PVC(s) to be created). Otherwise, in case at least one PVC has finalizer information set, i.e. PVC deletion controller is waiting for the scheduled pod to be terminated, there will be a deadlock: kubelet waiting for PVC deletion while PVC deletion controller waiting for pod termination. The deadlock can be resolved only by deleting the scheduled pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubelets do not retry volume mounts forever. When it fails, the pod should no longer block deletion of the PVC

### PVC Deletion Controller
PVC deletion controller is a new internal controller.

PVC deletion controller can list pods and watch for pods. It caches status of pods that use a PVC. It also watches for PVC deletes.
Copy link
Member

@liggitt liggitt Oct 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, the pattern used here is informers on both resources feeding a single workqueue. In this case, the controller should watch both pods and PVCs:

For PVC add/update/delete events:

  • if deletionTimestamp is nil and finalizer is missing, add to queue
  • if deletionTimestamp is non-nil and finalizer is present and first, add to queue

For pod add events:

  • if pod is terminated, add all referenced PVCs to queue
    For pod update events:
  • if pod is changing from non-terminated to terminated, add all referenced PVCs to queue
    For pod delete events:
  • add all referenced PVCs to queue

When processing the PVC queue:

  1. If PVC is not found in cache, skip
  2. If PVC is in cache with nil deletionTimestamp and missing finalizer, attempt to add finalizer, requeue on error
  3. If PVC is in cache with non-nil deletionTimestamp and present finalizer at first position, do live pod list in namespace, and if all pods referencing the PVC are not yet bound to a node or are terminated, attempt to remove the finalizer, requeue on error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That removes the need to track PVCs in use, removes the need for special logic at startup, unifies retry cases in case of conflicts on PVC updates (or other errors)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/kubernetes/community/blob/master/contributors/devel/controllers.md for more details about the queuing approach, or https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/serviceaccount/serviceaccounts_controller.go for an example of a controller feeding a single workqueue from informers on two separate resources

@liggitt
Copy link
Member

liggitt commented Oct 22, 2017

Should we consider having the PVC Deletion Controller add a finalizer to pods that have PV/PVCs?

The PVC is the object we want to prevent getting deleted prematurely, so that is where the finalizer should go, not on the pod.

A new plugin for PVC admission controller will be created. The plugin will automatically add finalizer information into newly created PVC's metadata.

### Scheduler
Scheduler will check if a pod uses a PVC and if any of the PVCs has `deletionTimestamp` set. In case this is true the scheduling of the pod will end up with an error: "PVC (%pvcName) is in scheduled for deletion state".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should just involve an additional check for a non-nil deletionTimestamp on the PVC, and following the same code path as if the PVC was not found

@ianchakeres
Copy link
Contributor

@liggitt - Just clarifying... My comment about adding a finalizer on a pod would be in addition to a finalizer on the PVC.

@liggitt
Copy link
Member

liggitt commented Oct 22, 2017

@liggitt - Just clarifying... My comment about adding a finalizer on a pod would be in addition to a finalizer on the PVC.

The additional finalizer doesn't get you anything. We don't actually care about preventing deletion of pods using PVCs, so there's no reason to block it with an additional finalizer on the Pod object.

@ianchakeres
Copy link
Contributor

A pod PVC delete finalizer could be used to handle a pod with a PVC/PV being deleted, instead of watching all pods for deletion. This finalizer could reduce the number of events that need to be handled by the controller, to only those pod events that could be related to PVC deletion.

@liggitt
Copy link
Member

liggitt commented Oct 22, 2017

A pod PVC delete finalizer could be used to handle a pod with a PVC/PV being deleted, instead of watching all pods for deletion.

A finalizing controller for pods referencing PVCs would still have to watch all pods (there's no filter to say "just watch pods containing finalizer X")

This finalizer could reduce the number of events that need to be handled by the controller, to only those pod events that could be related to PVC deletion.

As described above, that can be determined by inspection of the pod spec on deletion/termination, without blocking the pod delete, requiring an additional update to remove the finalizer from the pod, and increasing latency of pod cleanup.

The finalizer on the PVC is what we want.

@pospispa
Copy link
Author

@ianchakeres

Should we consider using the same process for postponing deletion of PVs that are actively used by pods?

I prefer to approach the PVC and PV deletion problems per partes
I prefer to firstly design and implement solution only for PVC deletion because I consider the PVC deletion a security bug and because it affects users. The PV deletion bug affects only admins and admins should avoid deleting bound PVs.

However, I expect that the solution for PV deletion problem will capitalize on experience with solution of the PVC deletion bug and will be probably similar to PVC deletion bug solution.

Should we consider having the PVC Deletion Controller add a finalizer to pods that have PV/PVCs?

In case a finalizer is in an object it means the the object deletion is postponed until the finalizer is removed from the object by a controller.

I agree with @liggitt that postponing pod deletion (adding a finalizer to a pod that have PV/PVC) doesn't bring any benefit as the pod can be deleted immediately and the PVC deletion controller will be informed about the pod deletion anyway.

@pospispa
Copy link
Author

@liggitt thank you very much for your comments and proposals. I basically updated the design according to your proposals.

PVC deletion controller watches for both PVC and pod events that are processed as described below:
1. PVC add/update/delete events:
- If `deletionTimestamp` is `nil` and finalizer is missing, the PVC is added to PVC queue.
- If `deletionTimestamp` is `non-nil` and finalizer is present and first, the PVC is added to PVC queue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the finalizer position matters?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, I was thinking finalizers were handled in order, but that is not the case. @pospispa you can strike " and first"

In case any of the PVC(s) used by the pod has the `deletionTimestamp` set kubelet won't start the pod but will report and error: "can't start pod (%pod) because it's using PVC (%pvcName) that is being deleted". Kubelet will follow the same code path as if PVC(s) do not exist.

### PVC Deletion Controller
PVC deletion controller is a new internal controller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/deletion/finalizing/ might make the intent clearer

4. Pod delete events:
- All referenced PVCs are added to PVC queue.

PVC and pod information are kept in a cache.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is done inherently when using an informer (the cache should be the one feeding the informer)

The PVC queue holds PVCs that need to be processed according to the below rules:
- If PVC is not found in cache, the PVC is skipped.
- If PVC is in cache with `nil` `deletionTimestamp` and missing finalizer, finalizer is added to the PVC. In case the adding finalizer operation fails, the PVC is re-queued into the PVC queue.
- If PVC is in cache with `non-nil` `deletionTimestamp` and present finalizer at first position, live pod list is done for the PVC namespace. If all pods referencing the PVC are not yet bound to a node or are terminated, the finalizer removal is attempted. In case the finalizer removal operation fails the PVC is re-queued.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strike " at first position", finalizers are processed in any order

@pospispa
Copy link
Author

@liggitt and @jsafrane thank you for your comments. I updated the design proposal accordingly.

@jsafrane
Copy link
Member

Looks good enough to me. @msau42 @childsb, WDYT?

Copy link
Member

@msau42 msau42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with a nit

PVC finalizing controller won't watch pods because the information whether a PVC is used by a pod or not is now maintained by the scheduler.
PVC finalizing controller will remove the finalizer infromation from a PVC in case the PVC has `deletionTimestamp` set and does not have the PVCUsedByPod condition set to true.

In this case the known race condition discussed in the implementation section won't happen on two different objects (PVC and pod), but on the same object, i.e. PVC. Therefore, the probability of the race condition will be smaller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This race is no longer discussed :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the information about the race.

@pospispa
Copy link
Author

Should I already squash all the commits?

@msau42
Copy link
Member

msau42 commented Oct 26, 2017

yes thanks

…a Pod

Proposal for postponing deletion of Persistent Volume Claim in case it's used by a pod.

It will fix issue kubernetes/kubernetes#45143
@pospispa pospispa force-pushed the postpone-pvc-deletion-if-used-in-a-pod branch from 76d182c to b378919 Compare October 26, 2017 18:12
@pospispa
Copy link
Author

Done.

@msau42
Copy link
Member

msau42 commented Oct 26, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2017
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants