We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What happened:
After running for 3 days, it was found that the koordinator-manager pod restarted.
What you expected to happen:
The koordinator-manager pod has been running stably without any abnormal restarts.
How to reproduce it (as minimally and precisely as possible):
(base) [root@demo demo]# cat pod-group.yaml apiVersion: scheduling.sigs.k8s.io/v1alpha1 kind: PodGroup metadata: name: gang-example namespace: default spec: scheduleTimeoutSeconds: 100 minMember: 10 (base) [root@demo demo]# cat setup.sh #!/bin/bash for i in {1..50} do new_name="test-$i" template='apiVersion: v1 kind: Pod metadata: name: pod-example1 namespace: default labels: pod-group.scheduling.sigs.k8s.io: gang-example spec: schedulerName: koord-scheduler containers: - command: - sleep - 365d image: busybox imagePullPolicy: IfNotPresent name: curlimage resources: limits: cpu: 40m memory: 40Mi requests: cpu: 40m memory: 40Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File restartPolicy: Always' new_template=$(echo "$template" | sed "/^ *name: /s/:.*/: $new_name/") echo "$new_template" >> ./demo2/$new_name.yaml done (base) [root@demo demo]# kubectl apply -f demo2/
(base) [root@demo demo]# kubectl -n koordinator-system get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES koord-descheduler-f7fb57d46-82npl 1/1 Running 0 3d17h 10.0.10.158 10.0.10.197 <none> <none> koord-descheduler-f7fb57d46-z8lwv 1/1 Running 0 3d17h 10.0.10.141 10.0.10.123 <none> <none> koord-manager-74866c758d-h68kc 1/1 Running 1 (8h ago) 3d17h 10.0.10.252 10.0.10.123 <none> <none> koord-manager-74866c758d-pjjhw 1/1 Running 1 (9h ago) 3d17h 10.0.10.90 10.0.10.167 <none> <none> koord-scheduler-689ff4b98d-9hngg 1/1 Running 0 3d17h 10.0.10.113 10.0.10.167 <none> <none> koord-scheduler-689ff4b98d-zgmr6 1/1 Running 0 3d17h 10.0.10.133 10.0.10.123 <none> <none> koordlet-czwrr 1/1 Running 0 3d17h 10.0.10.123 10.0.10.123 <none> <none> koordlet-k2pd7 1/1 Running 0 3d17h 10.0.10.167 10.0.10.167 <none> <none> koordlet-rsxhs 1/1 Running 0 3d17h 10.0.10.197 10.0.10.197 <none> <none>
koord-manager-74866c758d-pjjhw's description as below
(base) [root@demo ~]# kubectl -n koordinator-system describe pod koord-manager-74866c758d-pjjhw Name: koord-manager-74866c758d-pjjhw Namespace: koordinator-system Priority: 0 Service Account: koord-manager Node: 10.0.10.167/10.0.10.167 Start Time: Fri, 07 Jun 2024 10:54:44 +0000 Labels: koord-app=koord-manager pod-template-hash=74866c758d Annotations: <none> Status: Running IP: 10.0.10.90 IPs: IP: 10.0.10.90 Controlled By: ReplicaSet/koord-manager-74866c758d Containers: manager: Container ID: cri-o://e90d0ba9eecc114a3bc3de1f5aa13ba7260fb12eeacf4d36a40e94261d6fb6cb Image: registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0 Image ID: bce57a219f5cd58028e6df2968c8fa1bedc5607bb44f0f7c25ee2a03616c4403 Ports: 9876/TCP, 8080/TCP, 8000/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Command: /koord-manager Args: --enable-leader-election --metrics-addr=:8080 --health-probe-addr=:8000 --logtostderr=true --leader-election-namespace=koordinator-system --v=4 --feature-gates= --sync-period=0 --config-namespace=koordinator-system State: Running Started: Mon, 10 Jun 2024 19:09:29 +0000 Last State: Terminated Reason: Error Exit Code: 1 Started: Fri, 07 Jun 2024 10:55:08 +0000 Finished: Mon, 10 Jun 2024 19:09:25 +0000 Ready: True Restart Count: 1 Limits: cpu: 1 memory: 1Gi Requests: cpu: 500m memory: 256Mi Readiness: http-get http://:8000/readyz delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: POD_NAMESPACE: koordinator-system (v1:metadata.namespace) WEBHOOK_PORT: 9876 WEBHOOK_CONFIGURATION_FAILURE_POLICY_PODS: Ignore Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jbhc8 (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-jbhc8: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: <none>
The relevant log information is as follows:
# koord-manager-74866c758d-pjjhw 2024-06-10T19:09:25.669286185+00:00 stderr F E0610 19:09:25.668766 1 leaderelection.go:332] error retrieving resource lock koordinator-system/koordinator-manager: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/koordinator-system/leases/koordinator-manager": context deadline exceeded 2024-06-10T19:09:25.669286185+00:00 stderr F I0610 19:09:25.669278 1 leaderelection.go:285] failed to renew lease koordinator-system/koordinator-manager: timed out waiting for the condition 2024-06-10T19:09:25.669483039+00:00 stderr F E0610 19:09:25.669356 1 main.go:196] setup "msg"="problem running manager" "error"="leader election lost"
# kubelet on 10.0.10.167 Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 kubelet[7591]: I0610 19:09:26.475498 7591 kubelet.go:2447] "SyncLoop (PLEG): event for pod" pod="koordinator-system/koord-manager-74866c758d-pjjhw" event={"ID":"be73fb71-158b-45f2-b94d-cc60b53c6df3","Type":"ContainerDied","Data":"223e8d1e3e62e7da67674d57fcf53c768375fe9ed5b6af2ebb4fd2021e3fb6f6"} Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:26.476383199Z" level=info msg="Checking image status: registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0" id=5e764961-15f5-4cc9-a2e0-014c879ada64 name=/runtime.v1.ImageService/ImageStatus Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:26.476990674Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:bce57a219f5cd58028e6df2968c8fa1bedc5607bb44f0f7c25ee2a03616c4403,RepoTags:[registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0],RepoDigests:[registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:136fb81deaba5c17b5cb4d5e575686a509bc5ece49e90baef182b25aaf6842c3 registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:e15c4ca3119ad851d1ddc8705f563b51f8a6c7ed7f7637440ffcba53509f0d5e],Size_:68533133,Uid:&Int64Value{Value:0,},Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},UserSpecifiedImage:,RuntimeHandler:,},Pinned:false,},Info:map[string]string{},}" id=5e764961-15f5-4cc9-a2e0-014c879ada64 name=/runtime.v1.ImageService/ImageStatus Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:26.477546051Z" level=info msg="Pulling image: registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0" id=1bfc38a2-4e42-476f-8b94-80656c513ac1 name=/runtime.v1.ImageService/PullImage Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:26.614887179Z" level=info msg="Trying to access \"registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0\"" Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.670336576Z" level=info msg="Pulled image: registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:136fb81deaba5c17b5cb4d5e575686a509bc5ece49e90baef182b25aaf6842c3" id=1bfc38a2-4e42-476f-8b94-80656c513ac1 name=/runtime.v1.ImageService/PullImage Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.672562149Z" level=info msg="Checking image status: registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0" id=95c3ddc1-c6bd-4671-a087-82db868aa6ef name=/runtime.v1.ImageService/ImageStatus Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.672757630Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:bce57a219f5cd58028e6df2968c8fa1bedc5607bb44f0f7c25ee2a03616c4403,RepoTags:[registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0],RepoDigests:[registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:136fb81deaba5c17b5cb4d5e575686a509bc5ece49e90baef182b25aaf6842c3 registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:e15c4ca3119ad851d1ddc8705f563b51f8a6c7ed7f7637440ffcba53509f0d5e],Size_:68533133,Uid:&Int64Value{Value:0,},Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},UserSpecifiedImage:,RuntimeHandler:,},Pinned:false,},Info:map[string]string{},}" id=95c3ddc1-c6bd-4671-a087-82db868aa6ef name=/runtime.v1.ImageService/ImageStatus Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.673626493Z" level=info msg="Creating container: koordinator-system/koord-manager-74866c758d-pjjhw/manager" id=1415010d-3d2f-4d83-83b8-99119f3fbdff name=/runtime.v1.RuntimeService/CreateContainer Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.745286162Z" level=info msg="Created container e90d0ba9eecc114a3bc3de1f5aa13ba7260fb12eeacf4d36a40e94261d6fb6cb: koordinator-system/koord-manager-74866c758d-pjjhw/manager" id=1415010d-3d2f-4d83-83b8-99119f3fbdff name=/runtime.v1.RuntimeService/CreateContainer Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.752170566Z" level=info msg="Started container" PID=733631 containerID=e90d0ba9eecc114a3bc3de1f5aa13ba7260fb12eeacf4d36a40e94261d6fb6cb description=koordinator-system/koord-manager-74866c758d-pjjhw/manager id=73ea7af0-bcd4-4825-9ed3-23d2d83fecef name=/runtime.v1.RuntimeService/StartContainer sandboxID=95c667f01abc6b00a828364108a0bafbf07b433a24e72cdd29da7930c82ae32b Jun 10 19:09:30 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 kubelet[7591]: I0610 19:09:30.486423 7591 kubelet.go:2447] "SyncLoop (PLEG): event for pod" pod="koordinator-system/koord-manager-74866c758d-pjjhw" event={"ID":"be73fb71-158b-45f2-b94d-cc60b53c6df3","Type":"ContainerStarted","Data":"e90d0ba9eecc114a3bc3de1f5aa13ba7260fb12eeacf4d36a40e94261d6fb6cb"} Jun 10 19:09:30 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 kubelet[7591]: I0610 19:09:30.487997 7591 kubelet.go:2519] "SyncLoop (probe)" probe="readiness" status="" pod="koordinator-system/koord-manager-74866c758d-pjjhw" Jun 10 19:09:30 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 kubelet[7591]: I0610 19:09:30.495249 7591 kubelet.go:2519] "SyncLoop (probe)" probe="readiness" status="ready" pod="koordinator-system/koord-manager-74866c758d-pjjhw"
lease information as below
(base) [root@demo demo]# kubectl get lease -n koordinator-system NAME HOLDER AGE koord-descheduler koord-descheduler-f7fb57d46-z8lwv_7674cd19-b9cd-4ab7-a03a-d860816d5953 3d17h koord-scheduler koord-scheduler-689ff4b98d-zgmr6_44de532f-3e2d-4903-a98a-8013f99155db 3d17h koordinator-manager koord-manager-74866c758d-h68kc_835791dc-866c-475b-89ad-cc48862d79b2 3d17h (base) [root@demo demo]# kubectl get lease koordinator-manager -o yaml -n koordinator-system apiVersion: coordination.k8s.io/v1 kind: Lease metadata: creationTimestamp: "2024-06-07T10:55:09Z" name: koordinator-manager namespace: koordinator-system resourceVersion: "1998501" uid: 1e9bc283-9264-48cc-bab8-848584859e75 spec: acquireTime: "2024-06-10T19:11:21.606454Z" holderIdentity: koord-manager-74866c758d-h68kc_835791dc-866c-475b-89ad-cc48862d79b2 leaseDurationSeconds: 15 leaseTransitions: 2 renewTime: "2024-06-11T04:19:37.976706Z"
Anything else we need to know?:
Environment:
kubectl version
(base) [root@demo ~]# kubectl version Client Version: v1.28.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.1
helm install koordinator koordinator-sh/koordinator --version 1.5.0
The text was updated successfully, but these errors were encountered:
duplicate with #2096
Sorry, something went wrong.
No branches or pull requests
What happened:
After running for 3 days, it was found that the koordinator-manager pod restarted.
What you expected to happen:
The koordinator-manager pod has been running stably without any abnormal restarts.
How to reproduce it (as minimally and precisely as possible):
koord-manager-74866c758d-pjjhw's description as below
The relevant log information is as follows:
lease information as below
Anything else we need to know?:
Environment:
kubectl version
):The text was updated successfully, but these errors were encountered: