KEP-961: Bump Statefulset maxUnavailable to beta (built on top of #3997) #4474

knelasevero · 2024-02-01T19:17:18Z

Supersedes #3997
One-line PR description: Bump statefulset maxUnavailable to beta
Issue link: #961
Other comments:
Previous commits: kept Kante's commits here
My commits: Went over review notes not addressed and changed a few things related to proposed metric and some descriptions.

Signed-off-by: kerthcet <kerthcet@gmail.com>

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

soltysh

Left several comments

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

soltysh · 2024-02-05T12:55:25Z

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

+
+Controller tests in `pkg/controller/statefulset` cover all cases. Since pod statuses
+are faked in integration tests, simulating rolling update is difficult. We're opting
+for e2e tests instead.


Integration tests should have links to testgrid, see the template (https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md).

not sure I follow 🤔

We need a proof that the tests are stable/not flaky. Can you go over all the added tests and confirm that? An example of this would be: https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/3335-statefulset-slice/README.md#integration-tests

Will do, thanks

So those tests (and actually, before that, the fixes) need to be merged before we merge the KEP I assume?

The KEP is just an indictation of the plan for beta and not actually transitioning to it. It should be enough to add only the tests that are available now and add the upcoming tests once they are merged.

But we don't currently have relevant integration tests that would have reports to add here, do we?

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml

kerthcet · 2024-02-05T14:10:32Z

Thanks @knelasevero

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml

atiratree · 2024-02-05T15:46:07Z

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

 will expect. Implementing Choice 4 using PMP would be the easiest.

 #### Implementation

 The alpha release we are going with Choice 4 with support for both PMP=Parallel and PMP=OrderedReady.
 For PMP=Parallel, we will use Choice 2
-For PMP=OrderedReady, we will use Choice 3 to ensure we can support ordering guarantees while also 
+For PMP=OrderedReady, we will use Choice 3 to ensure we can support ordering guarantees while also


It seems to me we are using Choice one with PMP=OrderedReady in our implementation.

this part does not happen today as described in choice 3

At this time both 2 and 3 are terminating.

Can you please double check the choices and see if they are up to date?

+1

Also please update if we're continuing with the choice for Beta or are we changing something?

I went over it and fixed the text. Also I prefer the simpler route here and not change expected behavior. If behavior would change I would keep this in alpha for a couple more releases before coming back to promotion again.

I added the fix of the newly found bug in the implementation description

I do not see big drawbacks using the simpler choice 1 instead of choice 3, but it might be a good idea to discuss this with a sig.

atiratree · 2024-02-05T15:55:38Z

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

 out of order Terminations of pods.
-  2. Pods with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). When any of 4 or 3 are running and ready, pods with ordinal 2 will start Terminating. This could violate 
-ordering guarantees, since if 3 is running and ready, then both 4 and 2 are terminating at the same 
+  2. Pods with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). When any of 4 or 3 are running and ready, pods with ordinal 2 will start Terminating. This could violate


Ouch. I think this should be running and available. It seems to me that we should consider minReadySeconds in PMP=Parallel. We do not care about available when burst starting for the first time, but we should care about this when scaling down. I am able to bring down the available replicas to 1 for the following statefulset.

apiVersion: apps/v1 kind: StatefulSet metadata: name: nginx-roll spec: replicas: 5 minReadySeconds: 20 podManagementPolicy: Parallel updateStrategy: type: RollingUpdate rollingUpdate: partition: 1 maxUnavailable: 2 selector: matchLabels: app: nginx-roll template: metadata: labels: app: nginx-roll spec: containers: - name: nginx image: ghcr.io/nginxinc/nginx-unprivileged:latest ports: - containerPort: 80 name: web

yup; there is a bug for that already: kubernetes/kubernetes#112307

blocker for beta^

Thanks for finding this. I added mentions to this in the implementation part and in the e2e as a requirement. I changed all mentions to running and ready to running and available in the text

jeremyrickard · 2024-02-06T18:28:02Z

/assign

as PRR shadow

jeremyrickard

There are a few references to a metric that isn't introduced anywhere else and seems to have been rejected in a previous revision of this along with some other suggestions/questions from other reviewers. I also think that this would be challenging to actively monitor on large clusters and/or across fleets.

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

keps/sig-apps/3335-statefulset-slice/README.md

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

k8s-ci-robot · 2024-02-07T02:03:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: knelasevero
Once this PR has been reviewed and has the lgtm label, please ask for approval from jeremyrickard, wojtek-t. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-apps/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2024-02-07T09:00:35Z

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

 will expect. Implementing Choice 4 using PMP would be the easiest.

 #### Implementation

 The alpha release we are going with Choice 4 with support for both PMP=Parallel and PMP=OrderedReady.
 For PMP=Parallel, we will use Choice 2
-For PMP=OrderedReady, we will use Choice 3 to ensure we can support ordering guarantees while also 
+For PMP=OrderedReady, we will use Choice 3 to ensure we can support ordering guarantees while also


+1

Also please update if we're continuing with the choice for Beta or are we changing something?

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

wojtek-t · 2024-02-07T09:12:23Z

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

+
+Metric Name: statefulset_unavailability_violation
+
+Description: This metric counts the number of times a StatefulSet exceeds its maxUnavailable threshold during a rolling update. This metric increases whenever a StatefulSet is processed and it's observed that spec.replicas - status.readyReplicas > maxUnavailable. The metric is labeled with namespace, statefulset, and a reason label, where reason could be exceededMaxUnavailable. This provides a clear indication of which StatefulSets are not complying with the defined maxUnavailable constraint, allowing for quick identification and remediation.


I don't necessary agree here, because of two reasons
(1) not everyone is using kube-state-metrics, so depending on that is not ideal
(2) I woud expect changes to maxUnavailable in the middle of the rolllout and things like that a real corner case - and from the perspective a proposed metric actually gives us visibility into whether this feature works.

As I mentioned above, we don't want name/namespace to be on the metric though - to just have a high-level monitoring of the feature.

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

knelasevero · 2024-02-07T14:48:23Z

Hey, just so you prioritize accordingly, #961 (comment)

This in the end won't make 1.30, if you want to go through other PRs first that will fit in

atiratree · 2024-03-12T16:23:20Z

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

+
+- kube_statefulset_spec_strategy_rollingupdate_max_unavailable: This metric reflects the configured maxUnavailable value for StatefulSets. Significant deviations from expected values during updates, or if the actual unavailable pod count consistently exceeds this configuration, may indicate misconfigurations or issues with feature behavior.
+- StatefulSet Update Progress: Observing the stability and speed of StatefulSet rollouts relative to the maxUnavailable setting. Unusually slow updates or increased pod unavailability beyond the configured threshold could signal problems.
+- Cluster Stability Metrics: Key indicators of cluster health, such as increased error rates in control plane components or higher pod restart rates, post-feature adoption, can also guide the decision to rollback.


the last two are not actionable, can we mention specific metrics that should be observed? E.g. what kube-state-metrics and statefulset queue metrics?

added the unresolved block to get back to this after the metric discussion

atiratree · 2024-03-12T16:25:17Z

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

+
+Administrators should monitor the following metrics to assess the need for a rollback:
+
+- kube_statefulset_spec_strategy_rollingupdate_max_unavailable: This metric reflects the configured maxUnavailable value for StatefulSets. Significant deviations from expected values during updates, or if the actual unavailable pod count consistently exceeds this configuration, may indicate misconfigurations or issues with feature behavior.


we could be more concrete and mention the available pod count metric and replicas metric here as well

do you mean metrics like kube_statefulset_status_replicas, kube_statefulset_status_replicas_available, kube_statefulset_status_replicas_updated in ksm?

yes, also note that kube_statefulset_replicas is better than kube_statefulset_status_replicas for this

keps/sig-apps/961-maxunavailable-for-statefulset/README.md

Signed-off-by: Lucas Severo Alves <lseveroa@redhat.com>

k8s-triage-robot · 2024-07-02T15:08:33Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-08-01T15:11:33Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-08-31T15:21:27Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-08-31T15:21:32Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kerthcet and others added 6 commits June 15, 2023 20:40

Adapt to the latest template

f58c046

Signed-off-by: kerthcet <kerthcet@gmail.com>

Bump to beta

21a765b

Signed-off-by: kerthcet <kerthcet@gmail.com>

address comments

301e2d8

Signed-off-by: kerthcet <kerthcet@gmail.com>

Address commetns

ddd8c0c

Signed-off-by: kerthcet <kerthcet@gmail.com>

Add new metric rollingUpdateDurationSeconds to statefulset

c9a8b37

Signed-off-by: kerthcet <kerthcet@gmail.com>

go over review from kubernetes#3997, change metric

8d52fc9

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Feb 1, 2024

k8s-ci-robot requested review from johnbelamaric and kow3ns February 1, 2024 19:17

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 1, 2024

knelasevero commented Feb 1, 2024

View reviewed changes

keps/sig-apps/961-maxunavailable-for-statefulset/README.md Show resolved Hide resolved

knelasevero commented Feb 1, 2024

View reviewed changes

keps/sig-apps/961-maxunavailable-for-statefulset/README.md Outdated Show resolved Hide resolved

soltysh mentioned this pull request Feb 5, 2024

KEP-961: Bump Statefulset maxUnavailable to beta #3997

Closed

soltysh reviewed Feb 5, 2024

View reviewed changes

atiratree reviewed Feb 5, 2024

View reviewed changes

atiratree mentioned this pull request Feb 5, 2024

add metrics about statefulset rolling update durations kubernetes/kubernetes#120074

Closed

wojtek-t self-assigned this Feb 5, 2024

k8s-ci-robot assigned jeremyrickard Feb 6, 2024

jeremyrickard requested changes Feb 7, 2024

View reviewed changes

wojtek-t reviewed Feb 7, 2024

View reviewed changes

knelasevero added 2 commits February 7, 2024 12:15

review notes, still need approval on metrics

9ae0e4a

running and available instead of running and ready

c857622

knelasevero commented Feb 7, 2024

View reviewed changes

keps/sig-apps/961-maxunavailable-for-statefulset/README.md Outdated Show resolved Hide resolved

knelasevero commented Feb 7, 2024

View reviewed changes

keps/sig-apps/961-maxunavailable-for-statefulset/README.md Outdated Show resolved Hide resolved

knelasevero commented Feb 7, 2024

View reviewed changes

keps/sig-apps/961-maxunavailable-for-statefulset/README.md Outdated Show resolved Hide resolved

kerthcet mentioned this pull request Feb 19, 2024

[WIP]Bump statefulset maxUnavailable to beta kubernetes/kubernetes#117966

Closed

leomichalski mentioned this pull request Mar 13, 2024

Fix StatefulSet MaxUnavailable with MinReadySeconds, and add necessary e2e tests kubernetes/kubernetes#123915

Closed

atiratree reviewed Mar 13, 2024

View reviewed changes

leomichalski mentioned this pull request Mar 18, 2024

Fix StatefulSetMinReadySeconds healthy concept kubernetes/kubernetes#123975

Closed

review notes and UNRESOLVED blocks

63eeffb

Signed-off-by: Lucas Severo Alves <lseveroa@redhat.com>

knelasevero force-pushed the feat/upgrade-maxUnavailable-to-beta branch from 04243ba to 63eeffb Compare April 3, 2024 11:05

knelasevero mentioned this pull request May 22, 2024

maxUnavailable for StatefulSets #961

Open

8 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 2, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 1, 2024

k8s-ci-robot closed this Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-961: Bump Statefulset maxUnavailable to beta (built on top of #3997) #4474

KEP-961: Bump Statefulset maxUnavailable to beta (built on top of #3997) #4474

knelasevero commented Feb 1, 2024 •

edited

Loading

soltysh left a comment

soltysh Feb 5, 2024

knelasevero Feb 7, 2024

atiratree Feb 7, 2024 •

edited

Loading

knelasevero Feb 7, 2024

knelasevero Apr 3, 2024

atiratree Apr 3, 2024

knelasevero Apr 3, 2024 •

edited

Loading

kerthcet commented Feb 5, 2024

atiratree Feb 5, 2024

wojtek-t Feb 7, 2024

knelasevero Feb 7, 2024 •

edited

Loading

atiratree Mar 13, 2024

atiratree Feb 5, 2024

atiratree Feb 5, 2024

knelasevero Feb 7, 2024 •

edited

Loading

jeremyrickard commented Feb 6, 2024

jeremyrickard left a comment

k8s-ci-robot commented Feb 7, 2024

wojtek-t Feb 7, 2024

wojtek-t Feb 7, 2024

knelasevero commented Feb 7, 2024

atiratree Mar 12, 2024

knelasevero Apr 3, 2024

knelasevero Apr 3, 2024

atiratree Mar 12, 2024

knelasevero Apr 3, 2024

atiratree Apr 3, 2024

k8s-triage-robot commented Jul 2, 2024

k8s-triage-robot commented Aug 1, 2024

k8s-triage-robot commented Aug 31, 2024

k8s-ci-robot commented Aug 31, 2024


		Metric Name: statefulset_unavailability_violation

		Description: This metric counts the number of times a StatefulSet exceeds its maxUnavailable threshold during a rolling update. This metric increases whenever a StatefulSet is processed and it's observed that spec.replicas - status.readyReplicas > maxUnavailable. The metric is labeled with namespace, statefulset, and a reason label, where reason could be exceededMaxUnavailable. This provides a clear indication of which StatefulSets are not complying with the defined maxUnavailable constraint, allowing for quick identification and remediation.


		Administrators should monitor the following metrics to assess the need for a rollback:

		- kube_statefulset_spec_strategy_rollingupdate_max_unavailable: This metric reflects the configured maxUnavailable value for StatefulSets. Significant deviations from expected values during updates, or if the actual unavailable pod count consistently exceeds this configuration, may indicate misconfigurations or issues with feature behavior.

KEP-961: Bump Statefulset maxUnavailable to beta (built on top of #3997) #4474

KEP-961: Bump Statefulset maxUnavailable to beta (built on top of #3997) #4474

Conversation

knelasevero commented Feb 1, 2024 • edited Loading

soltysh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atiratree Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knelasevero Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

kerthcet commented Feb 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knelasevero Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knelasevero Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

jeremyrickard commented Feb 6, 2024

jeremyrickard left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knelasevero commented Feb 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-triage-robot commented Jul 2, 2024

k8s-triage-robot commented Aug 1, 2024

k8s-triage-robot commented Aug 31, 2024

k8s-ci-robot commented Aug 31, 2024

knelasevero commented Feb 1, 2024 •

edited

Loading

atiratree Feb 7, 2024 •

edited

Loading

knelasevero Apr 3, 2024 •

edited

Loading

knelasevero Feb 7, 2024 •

edited

Loading

knelasevero Feb 7, 2024 •

edited

Loading