Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support automated failover for StatefulSets (or workloads that do not delete their PVCs) #558

Open
ShyamsundarR opened this issue Sep 28, 2022 · 5 comments
Assignees

Comments

@ShyamsundarR
Copy link
Member

StatefulSets(STS) by default do not delete their PVCs. This is being fixed in k8s upstream via this KEP and would potentially graduate to beta in an upcoming release. (https://github.com/kubernetes/kubernetes/pull/111300).

This issue is to deal with the following problems till the above KEP is beta/GA.

Problem definition:

  • An application using STS will not delete its PVCs
  • This will halt reconciliation of the PVC as Secondary, where it is expected that the PVC be deleted (and unused) before we proceed moving it to Secondary
    • In the case of Failover, this will prevent PeerReady to ever become true
    • In the case of Relocate, this will prevent moving the VRs to Secondary and hence fail to relocate the application
  • In a broader sense, any operator that on deletion does not delete its PVCs (for any reason) would suffer the same orchestration pitfalls

Proposed solution:

  1. For a failover use-case, we will ignore the PVC in deletion state as a check
  • We already process VAs and ensure there are no pods with the PVC running
  • On a failover we let the VR resync with the current primary, so any stray IO that maybe caused because of a future pod mounting the PVC before we demote it is not a concern or is lost anyway
  • This would also mean that for Failover the PVCs need not be deleted to be cleaned up
  • With this multiple failovers across the DR peers can be performed, without any user intervention to delete the PVCs
  1. For a relocate use-case, we will still ensure PVC is deleted to prevent any PVC use-after-check-and-demote of the PVC
  • This would hence still require user intervention to delete the PVCs for relocation use-cases, till we get the required KEP in a future k8s release

Other fixes and gotchas:

  • (fix) Currently a failover would attempt to restore PVs and hence would conflict with existing PVCs that were left behind by a previous failover, and this needs to be handled
    • The test would be to failover multiple times across peers and ensuring workload comes up
  • (gotcha) When an application is deleted the backing mirror image would be garbage collected by the primary where the application is currently running
    • The PVCs on the clusters would need to be cleaned up by the user (which is no different than when using a STS workload pattern at present)
@ShyamsundarR ShyamsundarR self-assigned this Sep 28, 2022
@ShyamsundarR
Copy link
Member Author

Further notes post discussions with @BenamarMk

Currently for volsync we have the following PVC rules:

  • PVC should be defined in the declarative gitops source
  • Or, operator created PVCs should NOT be deleted on operator resource deletion (applies to STS as well)
  • The above is to ensure we can do a final sync of the PVC in the relocate case

For volrep the PVC rules are:

  • PVC should be defined in the declarative gitops source
  • Or, operator created PVCs should be deleted on operator resource deletion (applies to STS as well)
  • The above is to ensure that there are no PVC consumers when demoting the PVC to secondary for a final sync

For failover cases, the PVC can be deleted in either volsync or volrep cases, as there is no final sync operation that is required.

We would need to have a common set of rules across volrep and volsync, such that the actual replication mechanisim is abstracted away from the DRPC user. Hence we should either mandate operator/STS created PVCs are to be deleted or not for both.

With volsync if the operator/STS created PVCs are deleted, there is no possible way to perform a final sync. Hence the rule that we would adopt would be,

  • PVC should be defined in the declarative gitops source
  • Or, operator created PVCs should NOT be deleted on operator resource deletion (applies to STS as well)

With the above, the STS auto-deletion of PVCs on STS deletion is not required, and we need to handle relocate for operator/STS created PVCs that are not deleted in the volrep case.

As a first step allowing failover, as mentioned in this issue, is a way forward, while retaining the same PVC rules as before.

Subsequently, changing the PVC rules for volrep as well to lign with volsync would be needed.

The one caveat in the case for relocate with volsync or volrep, if the PVC is not deleted would be as follows:

  • A PVC is checked to ensure no pods are referencing it
  • A PVC is checked to ensure no VolumeAttachments are present
  • But post these checks if there is a PVC consumer, the relocate may sync stale data
    • From a VR perspective, one option is to move the volume to secondary during relocate and letting the storage plugin ensure there are no consumers for the volume as it is demoted
    • The other is to ensure that all workloads as reported on the OCM hub are stopped before issuing the move to secondary for VRG (or final sync for volsync)

@ShyamsundarR
Copy link
Member Author

NOTE: We need to evaluate this with Placement and ApplicationSets instead of Subscriptions that are in use at present.

@ShyamsundarR
Copy link
Member Author

The one caveat in the case for relocate with volsync or volrep, if the PVC is not deleted would be as follows:

* A PVC is checked to ensure no pods are referencing it

* A PVC is checked to ensure no VolumeAttachments are present

* But post these checks if there is a PVC consumer, the relocate may sync stale data
  
  * From a VR perspective, one option is to move the volume to secondary during relocate and letting the storage plugin ensure there are no consumers for the volume as it is demoted
  * The other is to ensure that all workloads as reported on the OCM hub are stopped before issuing the move to secondary for VRG (or final sync for volsync)

We could technically change the PVC to read only, to prevent further data modifications in this case. (requires experimentation)

@ShyamsundarR
Copy link
Member Author

The one caveat in the case for relocate with volsync or volrep, if the PVC is not deleted would be as follows:

* A PVC is checked to ensure no pods are referencing it

* A PVC is checked to ensure no VolumeAttachments are present

* But post these checks if there is a PVC consumer, the relocate may sync stale data
  
  * From a VR perspective, one option is to move the volume to secondary during relocate and letting the storage plugin ensure there are no consumers for the volume as it is demoted
  * The other is to ensure that all workloads as reported on the OCM hub are stopped before issuing the move to secondary for VRG (or final sync for volsync)

We could technically change the PVC to read only, to prevent further data modifications in this case. (requires experimentation)

The above is disallowed:
The PersistentVolumeClaim "busybox-pvc" is invalid: spec: Forbidden: spec is immutable after creation except resources.requests for bound claims

@ShyamsundarR
Copy link
Member Author

With volsync if the operator/STS created PVCs are deleted, there is no possible way to perform a final sync.

An alternative to overcome this issue would be as follows:

  • Let the workload deletion delete the PVCs
  • The PV is already held back with a Retain reclaim policy
  • Create a new PVC (with a DR specific name) bound to the PV for a final sync
  • Sync the PVC (as there are no other consumers possible for the same at this state)
  • Delete the PVC created and continue as before

The above would help address a few other items as well:

  • No need to take ownership for PVCs
  • We hence do not need the prepare and run final sync phases, these will happen once we move to secondary (and sync the DR specific new PVC over)
  • This would support those operators that generate PVCs outside the ownership context of the gitops controller (as those cannot be owned by Ramen, and hence may miss the final sync phase)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 Backlog
Development

No branches or pull requests

1 participant