-
Notifications
You must be signed in to change notification settings - Fork 771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update stateful_set_utils.go #1136
Conversation
Welcome @ivanszl! It looks like this is your first PR to openkruise/kruise 🎉 |
/assign @FillZpp |
@ivanszl Can you describe the process in detail? I did not reproduce your problem. In addition, I re-reviewed the code and feel that the current val <= 0 judgment logic should meet expectations. |
hi, this is a example kind: StatefulSet
apiVersion: apps.kruise.io/v1beta1
metadata:
name: busybox-test
namespace: default
labels:
app: busybox-test
spec:
replicas: 1
selector:
matchLabels:
app: busybox-test
template:
metadata:
creationTimestamp: null
labels:
app: busybox-test
spec:
containers:
- name: busybox
image: busybox:latest
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
securityContext: {}
schedulerName: default-scheduler
podManagementPolicy: Parallel
scaleStrategy:
maxUnavailable: 100%
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
maxUnavailable: 100%
revisionHistoryLimit: 10 $ kubectl get pods busybox-test-0
NAME READY STATUS RESTARTS AGE
busybox-test-0 0/1 CrashLoopBackOff 5 (95s ago) 4m45s modify yaml and apply it kind: StatefulSet
apiVersion: apps.kruise.io/v1beta1
metadata:
name: busybox-test
namespace: default
labels:
app: busybox-test
spec:
replicas: 1
selector:
matchLabels:
app: busybox-test
template:
metadata:
creationTimestamp: null
labels:
app: busybox-test
test-label: 'yes'
spec:
containers:
- name: busybox
image: busybox:latest
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
securityContext: {}
schedulerName: default-scheduler
podManagementPolicy: Parallel
scaleStrategy:
maxUnavailable: 100%
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
maxUnavailable: 100%
revisionHistoryLimit: 10 and then look the pod kubectl describe pods busybox-test-0
Name: busybox-test-0
Namespace: default
Priority: 0
Node: minikube/192.168.49.2
Start Time: Wed, 30 Nov 2022 15:16:28 +0800
Labels: app=busybox-test
controller-revision-hash=busybox-test-f89f864cc
lifecycle.apps.kruise.io/state=Normal
statefulset.kubernetes.io/pod-name=busybox-test-0
Annotations: lifecycle.apps.kruise.io/timestamp: 2022-11-30T07:16:28Z
Status: Running
IP: 172.17.0.6
IPs:
IP: 172.17.0.6
Controlled By: StatefulSet/busybox-test
Containers:
busybox:
Container ID: docker://a1227f31ef27e0d4fbf61aa29f585da912ce5dbe2fcc662a1e37a2ca3b4329b0
Image: busybox:latest
Image ID: docker-pullable://busybox@sha256:fcd85228d7a25feb59f101ac3a955d27c80df4ad824d65f5757a954831450185
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 30 Nov 2022 15:22:20 +0800
Finished: Wed, 30 Nov 2022 15:22:20 +0800
Ready: False
Restart Count: 6
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bml2r (ro)
Readiness Gates:
Type Status
KruisePodReady True
Conditions:
Type Status
KruisePodReady True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-bml2r:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m25s default-scheduler Successfully assigned default/busybox-test-0 to minikube
Normal Pulling 6m24s kubelet Pulling image "busybox:latest"
Normal Pulled 6m17s kubelet Successfully pulled image "busybox:latest" in 6.646627307s
Normal Created 4m41s (x5 over 6m17s) kubelet Created container busybox
Normal Started 4m41s (x5 over 6m17s) kubelet Started container busybox
Normal Pulled 4m41s (x4 over 6m16s) kubelet Container image "busybox:latest" already present on machine
Warning BackOff 83s (x26 over 6m15s) kubelet Back-off restarting failed container we find the pod was't modified |
@ivanszl I don't think it can be modified directly, because it will destroy the semantics of scaleStrategy. The core of this problem is the corner case for 100% maxUnavailable, so i think it can be modified as follows:
|
@zmberg If it was modified like you code, it can't cover same case. |
@ivanszl You can look at this document https://openkruise.io/docs/user-manuals/advancedstatefulset/#scaling-with-rate-limiting. Your modification method will break the semantics of the document |
@zmberg Ok, I see. |
@ivanszl We also discussed it internally, and felt that scaleStrategy should only affect Create Pod, not Update Pod, so i think there are three places that need to be modified, for example:
|
@zmberg
|
The monotonic parameter is still more preferred, and the scale Strategy is only considered when it is false. |
Codecov Report
@@ Coverage Diff @@
## master #1136 +/- ##
==========================================
+ Coverage 49.43% 49.60% +0.17%
==========================================
Files 137 137
Lines 19326 19423 +97
==========================================
+ Hits 9553 9635 +82
- Misses 8721 8735 +14
- Partials 1052 1053 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@zmberg |
@ivanszl Relevant places need to be modified. The above is just an example, not all of them are listed. |
@@ -687,5 +687,5 @@ func decreaseAndCheckMaxUnavailable(maxUnavailable *int) bool { | |||
} | |||
val := *maxUnavailable - 1 | |||
*maxUnavailable = val | |||
return val <= 0 | |||
return val < 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no further adjustments here, and the original writing method should be maintained.
@@ -542,7 +547,7 @@ func (ssc *defaultStatefulSetControl) updateStatefulSet( | |||
status.UpdatedReplicas++ | |||
} | |||
// if the set does not allow bursting, return immediately | |||
if monotonic || decreaseAndCheckMaxUnavailable(scaleMaxUnavailable) { | |||
if monotonic { | |||
return &status, nil | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if monotonic {
return &status, nil
} else if decreaseAndCheckMaxUnavailable(scaleMaxUnavailable) {
break
}
if isTerminating(replicas[i]) && monotonic {
klog.V(4).Infof(
"StatefulSet %s/%s is waiting for Pod %s to Terminate",
set.Namespace,
set.Name,
replicas[i].Name)
return &status, nil
} else if isTerminating(replicas[i]) && decreaseAndCheckMaxUnavailable(scaleMaxUnavailable) {
break
}
if monotonic || scaleMaxUnavailable != nil {
isAvailable, waitTime := isRunningAndAvailable(replicas[i], minReadySeconds)
if !isAvailable && monotonic {
if waitTime > 0 {
// make sure we check later
durationStore.Push(getStatefulSetKey(set), waitTime)
klog.V(4).Infof(
"StatefulSet %s/%s needs to wait %s for the Pod %s to be Running and Available after being"+
" Ready for %d seconds",
set.Namespace,
set.Name,
waitTime,
replicas[i].Name,
minReadySeconds)
} else {
klog.V(4).Infof(
"StatefulSet %s/%s is waiting for Pod %s to be Running and Ready",
set.Namespace,
set.Name,
replicas[i].Name)
}
return &status, nil
} else if !isAvailable && decreaseAndCheckMaxUnavailable(scaleMaxUnavailable) {
break
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 545 line, the original writing method should be maintained. It will be not destroy the semantics of scaleStrategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is convenient for you, you can communicate on the zoom of the community meeting on 12.8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
return &status, nil | ||
} else if decreaseAndCheckMaxUnavailable(scaleMaxUnavailable) { | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.V(4).Infof(
"StatefulSet %s/%s Pod %s is Creating, and break pods scale",
set.Namespace,
set.Name,
replicas[i].Name)
klog.V(4).Infof( | ||
"StatefulSet %s/%s is waiting for Pod %s to Terminate", | ||
set.Namespace, | ||
set.Name, | ||
replicas[i].Name) | ||
return &status, nil | ||
} else if isTerminating(replicas[i]) && decreaseAndCheckMaxUnavailable(scaleMaxUnavailable) { | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.V(4).Infof(
"StatefulSet %s/%s Pod %s is Terminating, and break pods scale",
set.Namespace,
set.Name,
replicas[i].Name)
// make sure we check later | ||
durationStore.Push(getStatefulSetKey(set), waitTime) | ||
} | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.V(4).Infof(
"StatefulSet %s/%s Pod %s is unavailable, and break pods scale",
set.Namespace,
set.Name,
replicas[i].Name)
please git commit -s, and git push again. |
@ivanszl you need resolve the issue https://github.com/openkruise/kruise/pull/1136/checks?check_run_id=10079268512, otherwise there is no way to merge the pr. |
Fix all unavailable pod update new version fail Signed-off-by: linsongzheng <linsongzheng@onething.net>
Signed-off-by: linsongzheng <linsongzheng@onething.net>
Signed-off-by: linsongzheng <linsongzheng@onething.net>
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: furykerry The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Update stateful_set_utils.go Fix all unavailable pod update new version fail * Fix will be create one more pod * add scaleStrategy e2e test * add log * add log Co-authored-by: linsongzheng <linsongzheng@onething.net>
Signed-off-by: ivan.lan ivan.lin.1985@gmail.com
Ⅰ. Describe what this PR does
设置sts的spec.podManagementPolicy = Parallel
spec.scaleStrategy.maxUnavailable = 100%
由于修改sts的模版,导致Pod启动不了,状态一直处于pending,后续重新修改会卡住
重新修改后会作用于所有的Pod,可以快速修复业务服务