Skip to content

Commit

Permalink
Add more detail to alternatives section
Browse files Browse the repository at this point in the history
  • Loading branch information
pwschuurman committed Oct 6, 2022
1 parent b7ee691 commit 7ba997e
Showing 1 changed file with 55 additions and 11 deletions.
66 changes: 55 additions & 11 deletions keps/sig-multicluster/3335-statefulset-slice/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,14 +128,14 @@ checklist items _must_ be updated for the enhancement to be released.

Items marked with (R) are required *prior to targeting to a milestone / release*.

- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [ ] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [X] (R) Design details are appropriately documented
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
- [ ] (R) Graduation criteria is in place
- [X] (R) Graduation criteria is in place
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Production readiness review completed
- [ ] (R) Production readiness review approved
Expand Down Expand Up @@ -236,11 +236,12 @@ What is out of scope for this KEP? Listing non-goals helps to focus discussion
and make progress.
-->

* Updating a PDB to safeguard more than one StatefulSet slice
* Updating a PDB to safeguard more than one StatefulSet slice
* As StatefulSet slices are scaled up or down, corresponding PDBs can also be adjusted. For example, a PDB corresponding to a slice of `k` replicas could be adjusted to `MinAvailable: k-1` on scale up or down events. Providing guidance and functionality to adjust these PDBs is outside the scope of this KEP.
* Orchestrating pod movement from one StatefulSet slice to another
* Managing network connectivity between pods in different StatefulSet slices
* Orchestrating storage lifecycle of PVCs and PVs across different StatefulSet slices
* Orchestrating pod movement from one StatefulSet slice to another
* Managing network connectivity between pods in different StatefulSet slices
* Orchestrating storage lifecycle of PVCs and PVs across different StatefulSet slices
* Referenced PV/PVCs will need to be migrated in order for a new StatefulSet to reference data that was used by an existing StatefulSet. Orchestration complexity will depend on how volumes are used (RWO with `.spec.volumeClaimTemplates` on a StatefulSet, RWX with pod `.spec.volumes`).

## Proposal

Expand Down Expand Up @@ -940,9 +941,52 @@ not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->

Users can orphan pods from a StatefulSet, migrate pods across a namespace or cluster, and create a new StatefulSet to manage pods upon migration. In the case of pod eviction or failure, pods will need to be manually restarted, requiring manual intervention and constant monitoring.

Users can backup and restore a StatefulSet (and underlying storage) in a new namespace or cluster. Doing so requires the existing StatefulSet to be deleted, for underlying storage to be backed up and restored, resulting in downtime for the stateful application.
### Alternative API changes

**ReverseOrderedReady**: A new PodManagementPolicy policy called
`ReverseOrderedReady` could be added. This would allow a StatefulSet to be
started and actuated from the highest ordinal (current default is from the
lowest ordinal). For the cross-cluster migration use case, this would allow for
a source StatefulSet to be scaled down and a target StatefulSet to be scaled in.
The downside with this API is that pod management policy is not a mutable field.
So if an orchestrator uses this behavior to scale in a StatefulSet, in a
destination cluster, and then wants to revert the PodManagementPolicy back to
default, the StatefulSet would need to be deleted, and re-created.

**KEP-3521**: [KEP-3521](https://github.com/kubernetes/enhancements/issues/3521)
proposes a Pod `.spec` level API that enables a pod to be paused at the initial
scheduling phase of pod lifecycle. This provides granular control of which pods
should be started and running (active) and which pods shouldn't be scheduled
(standby). An orchestrator can leverage control over specific pod scheduling,
without making changes to the StatefulSet controller, as the StatefulSet
controller is in control of creating pods.

If the StatefulSet controller is using OrderedReady Pod Management, pausing
scheduling can result in a pod being marked as not Ready. This will prevent
the StatefulSet controller from actuating updates to higher ordinal pods (eg:
pod `m` will not be created if pod `n` is unhealthy, where `m` > `n`). This
may increase orchestrator complexity, by requiring an orchestrator of a
migration to leverage Parallel Pod Management during a migration, and then
re-create a StatefulSet (using `--cascade=orphan`) to revert back to
`OrderedReady` if desired.

Additionally, if modifying a StatefulSet template is undesired, a webhook must
be introduced to mark Pods as paused when they are created. This adds a layer
of complexity to an orchestrator operator, since it needs both an operator
component that is capable of making changes to ApiServer, and a webhook that is
reading from a consistent migration state.

### Alternatives without any API changes

**Orphan Pods**: Users can orphan pods from a StatefulSet, migrate pods across a
namespace or cluster, and create a new StatefulSet to manage pods upon
migration. In the case of pod eviction or failure, pods will need to be manually
recreated, requiring manual intervention and constant monitoring.

**Backup/Restore**: Users can backup and restore a StatefulSet (and underlying
storage) in a new namespace or cluster. Doing so requires the existing
StatefulSet to be deleted, for underlying storage to be backed up and restored,
resulting in downtime for the stateful application.

## Infrastructure Needed (Optional)

Expand Down

0 comments on commit 7ba997e

Please sign in to comment.