Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(statefulset): fix maxUnavailable for rolling upgrades not taking … #1480

Merged
merged 1 commit into from
Jan 23, 2024

Conversation

Yesphet
Copy link
Contributor

@Yesphet Yesphet commented Dec 29, 2023

…into account pods that fail later in the updateIndexes

0. Backgroud

Advanced statefulset provides MaxUnavailable guarantee that the number of unavailable pods during the update cannot exceed this value. However, in the current implementation code, there is no consideration for unavailable pods with smaller ordinal or later in the update order. For example, for an asts with maxUnavailable= 1 and replicas=3, if the asts updates while pod-0 is in a crash, it will still trigger a rolling upgrade, and 2 pods will be unavailable because pod-2 is updated and restarted.

It can be easily reproduced in this way:

  1. create a asts
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
metadata:
  name: test
spec:
  selector:
    matchLabels:
      app: busybox-test
  replicas: 3
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: busybox-test
    spec:
      containers:
      - name: busybox
        image: busybox
        env:
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: Restart
          value: "2"
        command: ["sh","-c"]
        args:
        - |-
          [ "$MY_POD_NAME" == "test-0" ] && exit 1;tail -f /dev/null
  1. List pods, test-0 is in crash
image
  1. Update an env of the container in asts to trigger rolling update
kubectl patch asts test --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/env/1/value", "value":"3"}]'
  1. We can see that test-2 is deleted which breaking the maxUnavailable guarantee
image

I have tested k8s official statefulset with MaxUnavailable feature gate enabled, it can still guarantee that the number of unavailable pods does not exceed in this scenario.

Ⅰ. Describe what this PR does

So in this PR, I separate out the update pods logics to a new function rollingUpdateStatefulsetPods, counts all unavailable pods before update pods.

Ⅱ. Does this pull request fix one issue?

no

Ⅲ. Describe how to verify it

refer to part 0

Ⅳ. Special notes for reviews

In k8s official statefulsets controller, if the unavailable pods count reach the MaxUnavailable limit, the rolling update won't make progress even if the first pod in the update sequence is unavailable. But in my implementation, rolling updates are still process in this case because update one unavailable pod does not increase the total number of unavailable Pods.

@kruise-bot kruise-bot requested review from Fei-Guo and veophi December 29, 2023 15:26
@kruise-bot
Copy link

Welcome @Yesphet! It looks like this is your first PR to openkruise/kruise 🎉

@kruise-bot kruise-bot added the size/L size/L: 100-499 label Dec 29, 2023
@Yesphet Yesphet force-pushed the fix/asts_maxunavailable branch from 9f7882e to de224fa Compare December 30, 2023 04:02
@Yesphet Yesphet changed the title fix(statefulset): fix maxUnavailable for rolling upgrades not taking … [WIP] fix(statefulset): fix maxUnavailable for rolling upgrades not taking … Dec 30, 2023
@Yesphet Yesphet force-pushed the fix/asts_maxunavailable branch 2 times, most recently from a322891 to f5997f3 Compare January 2, 2024 14:06
Copy link

codecov bot commented Jan 2, 2024

Codecov Report

Attention: 23 lines in your changes are missing coverage. Please review.

Comparison is base (fa7a1da) 47.66% compared to head (46e4e41) 48.41%.
Report is 4 commits behind head on master.

Files Patch % Lines
pkg/controller/statefulset/stateful_set_control.go 66.66% 17 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1480      +/-   ##
==========================================
+ Coverage   47.66%   48.41%   +0.75%     
==========================================
  Files         157      157              
  Lines       22427    22460      +33     
==========================================
+ Hits        10689    10875     +186     
+ Misses      10570    10391     -179     
- Partials     1168     1194      +26     
Flag Coverage Δ
unittests 48.41% <66.66%> (+0.75%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Yesphet Yesphet changed the title [WIP] fix(statefulset): fix maxUnavailable for rolling upgrades not taking … fix(statefulset): fix maxUnavailable for rolling upgrades not taking … Jan 2, 2024
@Yesphet Yesphet force-pushed the fix/asts_maxunavailable branch from f5997f3 to d4cecdd Compare January 4, 2024 07:29
@Yesphet
Copy link
Contributor Author

Yesphet commented Jan 10, 2024

@veophi @Fei-Guo Could you please help to review this PR when you are free?

@Yesphet
Copy link
Contributor Author

Yesphet commented Jan 19, 2024

@veophi @Fei-Guo @zmberg It's been two weeks, and I still haven't received any feedback on this PR. I understand that everyone is busy, I just want to ensure this PR hasn't been overlooked. Your time and attention to it would be much appreciated ~

@furykerry
Copy link
Member

sorry, guys are fully occupied these days, i will take a look at this patch.

…into account pods that fail later in the updateIndexes.

Signed-off-by: Yesphet <mildtheorem@gmail.com>
@Yesphet Yesphet force-pushed the fix/asts_maxunavailable branch from d4cecdd to 46e4e41 Compare January 23, 2024 03:27
@furykerry
Copy link
Member

/lgtm

@zmberg
Copy link
Member

zmberg commented Jan 23, 2024

@Yesphet good.

@zmberg
Copy link
Member

zmberg commented Jan 23, 2024

/approve

@kruise-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zmberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kruise-bot kruise-bot merged commit 8af135d into openkruise:master Jan 23, 2024
27 checks passed
@Yesphet
Copy link
Contributor Author

Yesphet commented Jan 23, 2024

@furykerry @zmberg thanks guys~

@zmberg zmberg added this to the 1.6 milestone Mar 7, 2024
ppbits pushed a commit to ppbits/kruise that referenced this pull request Apr 4, 2024
…into account pods that fail later in the updateIndexes. (openkruise#1480)
ABNER-1 pushed a commit that referenced this pull request Sep 19, 2024
…into account pods that fail later in the updateIndexes. (#1480)

Signed-off-by: Yesphet <mildtheorem@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants