[BUG]koord-manager failed to renew lease #2096

b43646 · 2024-06-11T04:26:27Z

What happened:

After running for 3 days, it was found that the koordinator-manager pod restarted.

What you expected to happen:

The koordinator-manager pod has been running stably without any abnormal restarts.

How to reproduce it (as minimally and precisely as possible):

Deploy Koordinator version 1.5.0.
Run the workload.

(base) [root@demo demo]# cat pod-group.yaml
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: gang-example
  namespace: default
spec:
  scheduleTimeoutSeconds: 100
  minMember: 10
(base) [root@demo demo]# cat setup.sh
#!/bin/bash

for i in {1..50}
do
    new_name="test-$i"
template='apiVersion: v1
kind: Pod
metadata:
  name: pod-example1
  namespace: default
  labels:
    pod-group.scheduling.sigs.k8s.io: gang-example
spec:
  schedulerName: koord-scheduler
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    imagePullPolicy: IfNotPresent
    name: curlimage
    resources:
      limits:
        cpu: 40m
        memory: 40Mi
      requests:
        cpu: 40m
        memory: 40Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  restartPolicy: Always'

new_template=$(echo "$template" | sed "/^ *name: /s/:.*/: $new_name/")

echo "$new_template"  >> ./demo2/$new_name.yaml
done

(base) [root@demo demo]# kubectl  apply -f demo2/

Check the running status

(base) [root@demo demo]# kubectl -n koordinator-system get pods -o wide
NAME                                READY   STATUS    RESTARTS     AGE     IP            NODE          NOMINATED NODE   READINESS GATES
koord-descheduler-f7fb57d46-82npl   1/1     Running   0            3d17h   10.0.10.158   10.0.10.197   <none>           <none>
koord-descheduler-f7fb57d46-z8lwv   1/1     Running   0            3d17h   10.0.10.141   10.0.10.123   <none>           <none>
koord-manager-74866c758d-h68kc      1/1     Running   1 (8h ago)   3d17h   10.0.10.252   10.0.10.123   <none>           <none>
koord-manager-74866c758d-pjjhw      1/1     Running   1 (9h ago)   3d17h   10.0.10.90    10.0.10.167   <none>           <none>
koord-scheduler-689ff4b98d-9hngg    1/1     Running   0            3d17h   10.0.10.113   10.0.10.167   <none>           <none>
koord-scheduler-689ff4b98d-zgmr6    1/1     Running   0            3d17h   10.0.10.133   10.0.10.123   <none>           <none>
koordlet-czwrr                      1/1     Running   0            3d17h   10.0.10.123   10.0.10.123   <none>           <none>
koordlet-k2pd7                      1/1     Running   0            3d17h   10.0.10.167   10.0.10.167   <none>           <none>
koordlet-rsxhs                      1/1     Running   0            3d17h   10.0.10.197   10.0.10.197   <none>           <none>

koord-manager-74866c758d-pjjhw's description as below

(base) [root@demo ~]# kubectl  -n koordinator-system describe pod koord-manager-74866c758d-pjjhw
Name:             koord-manager-74866c758d-pjjhw
Namespace:        koordinator-system
Priority:         0
Service Account:  koord-manager
Node:             10.0.10.167/10.0.10.167
Start Time:       Fri, 07 Jun 2024 10:54:44 +0000
Labels:           koord-app=koord-manager
                  pod-template-hash=74866c758d
Annotations:      <none>
Status:           Running
IP:               10.0.10.90
IPs:
  IP:           10.0.10.90
Controlled By:  ReplicaSet/koord-manager-74866c758d
Containers:
  manager:
    Container ID:  cri-o://e90d0ba9eecc114a3bc3de1f5aa13ba7260fb12eeacf4d36a40e94261d6fb6cb
    Image:         registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0
    Image ID:      bce57a219f5cd58028e6df2968c8fa1bedc5607bb44f0f7c25ee2a03616c4403
    Ports:         9876/TCP, 8080/TCP, 8000/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Command:
      /koord-manager
    Args:
      --enable-leader-election
      --metrics-addr=:8080
      --health-probe-addr=:8000
      --logtostderr=true
      --leader-election-namespace=koordinator-system
      --v=4
      --feature-gates=
      --sync-period=0
      --config-namespace=koordinator-system
    State:          Running
      Started:      Mon, 10 Jun 2024 19:09:29 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 07 Jun 2024 10:55:08 +0000
      Finished:     Mon, 10 Jun 2024 19:09:25 +0000
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      500m
      memory:   256Mi
    Readiness:  http-get http://:8000/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAMESPACE:                              koordinator-system (v1:metadata.namespace)
      WEBHOOK_PORT:                               9876
      WEBHOOK_CONFIGURATION_FAILURE_POLICY_PODS:  Ignore
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jbhc8 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  kube-api-access-jbhc8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

The relevant log information is as follows：

# koord-manager-74866c758d-pjjhw
2024-06-10T19:09:25.669286185+00:00 stderr F E0610 19:09:25.668766       1 leaderelection.go:332] error retrieving resource lock koordinator-system/koordinator-manager: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/koordinator-system/leases/koordinator-manager": context deadline exceeded
2024-06-10T19:09:25.669286185+00:00 stderr F I0610 19:09:25.669278       1 leaderelection.go:285] failed to renew lease koordinator-system/koordinator-manager: timed out waiting for the condition
2024-06-10T19:09:25.669483039+00:00 stderr F E0610 19:09:25.669356       1 main.go:196] setup "msg"="problem running manager" "error"="leader election lost"

# kubelet on 10.0.10.167

Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 kubelet[7591]: I0610 19:09:26.475498    7591 kubelet.go:2447] "SyncLoop (PLEG): event for pod" pod="koordinator-system/koord-manager-74866c758d-pjjhw" event={"ID":"be73fb71-158b-45f2-b94d-cc60b53c6df3","Type":"ContainerDied","Data":"223e8d1e3e62e7da67674d57fcf53c768375fe9ed5b6af2ebb4fd2021e3fb6f6"}
Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:26.476383199Z" level=info msg="Checking image status: registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0" id=5e764961-15f5-4cc9-a2e0-014c879ada64 name=/runtime.v1.ImageService/ImageStatus
Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:26.476990674Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:bce57a219f5cd58028e6df2968c8fa1bedc5607bb44f0f7c25ee2a03616c4403,RepoTags:[registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0],RepoDigests:[registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:136fb81deaba5c17b5cb4d5e575686a509bc5ece49e90baef182b25aaf6842c3 registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:e15c4ca3119ad851d1ddc8705f563b51f8a6c7ed7f7637440ffcba53509f0d5e],Size_:68533133,Uid:&Int64Value{Value:0,},Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},UserSpecifiedImage:,RuntimeHandler:,},Pinned:false,},Info:map[string]string{},}" id=5e764961-15f5-4cc9-a2e0-014c879ada64 name=/runtime.v1.ImageService/ImageStatus
Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:26.477546051Z" level=info msg="Pulling image: registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0" id=1bfc38a2-4e42-476f-8b94-80656c513ac1 name=/runtime.v1.ImageService/PullImage
Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:26.614887179Z" level=info msg="Trying to access \"registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0\""
Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.670336576Z" level=info msg="Pulled image: registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:136fb81deaba5c17b5cb4d5e575686a509bc5ece49e90baef182b25aaf6842c3" id=1bfc38a2-4e42-476f-8b94-80656c513ac1 name=/runtime.v1.ImageService/PullImage
Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.672562149Z" level=info msg="Checking image status: registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0" id=95c3ddc1-c6bd-4671-a087-82db868aa6ef name=/runtime.v1.ImageService/ImageStatus
Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.672757630Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:bce57a219f5cd58028e6df2968c8fa1bedc5607bb44f0f7c25ee2a03616c4403,RepoTags:[registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0],RepoDigests:[registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:136fb81deaba5c17b5cb4d5e575686a509bc5ece49e90baef182b25aaf6842c3 registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:e15c4ca3119ad851d1ddc8705f563b51f8a6c7ed7f7637440ffcba53509f0d5e],Size_:68533133,Uid:&Int64Value{Value:0,},Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},UserSpecifiedImage:,RuntimeHandler:,},Pinned:false,},Info:map[string]string{},}" id=95c3ddc1-c6bd-4671-a087-82db868aa6ef name=/runtime.v1.ImageService/ImageStatus
Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.673626493Z" level=info msg="Creating container: koordinator-system/koord-manager-74866c758d-pjjhw/manager" id=1415010d-3d2f-4d83-83b8-99119f3fbdff name=/runtime.v1.RuntimeService/CreateContainer
Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.745286162Z" level=info msg="Created container e90d0ba9eecc114a3bc3de1f5aa13ba7260fb12eeacf4d36a40e94261d6fb6cb: koordinator-system/koord-manager-74866c758d-pjjhw/manager" id=1415010d-3d2f-4d83-83b8-99119f3fbdff name=/runtime.v1.RuntimeService/CreateContainer
Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.752170566Z" level=info msg="Started container" PID=733631 containerID=e90d0ba9eecc114a3bc3de1f5aa13ba7260fb12eeacf4d36a40e94261d6fb6cb description=koordinator-system/koord-manager-74866c758d-pjjhw/manager id=73ea7af0-bcd4-4825-9ed3-23d2d83fecef name=/runtime.v1.RuntimeService/StartContainer sandboxID=95c667f01abc6b00a828364108a0bafbf07b433a24e72cdd29da7930c82ae32b
Jun 10 19:09:30 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 kubelet[7591]: I0610 19:09:30.486423    7591 kubelet.go:2447] "SyncLoop (PLEG): event for pod" pod="koordinator-system/koord-manager-74866c758d-pjjhw" event={"ID":"be73fb71-158b-45f2-b94d-cc60b53c6df3","Type":"ContainerStarted","Data":"e90d0ba9eecc114a3bc3de1f5aa13ba7260fb12eeacf4d36a40e94261d6fb6cb"}
Jun 10 19:09:30 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 kubelet[7591]: I0610 19:09:30.487997    7591 kubelet.go:2519] "SyncLoop (probe)" probe="readiness" status="" pod="koordinator-system/koord-manager-74866c758d-pjjhw"
Jun 10 19:09:30 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 kubelet[7591]: I0610 19:09:30.495249    7591 kubelet.go:2519] "SyncLoop (probe)" probe="readiness" status="ready" pod="koordinator-system/koord-manager-74866c758d-pjjhw"

lease information as below

(base) [root@demo demo]# kubectl get lease -n koordinator-system
NAME                  HOLDER                                                                   AGE
koord-descheduler     koord-descheduler-f7fb57d46-z8lwv_7674cd19-b9cd-4ab7-a03a-d860816d5953   3d17h
koord-scheduler       koord-scheduler-689ff4b98d-zgmr6_44de532f-3e2d-4903-a98a-8013f99155db    3d17h
koordinator-manager   koord-manager-74866c758d-h68kc_835791dc-866c-475b-89ad-cc48862d79b2      3d17h
(base) [root@demo demo]# kubectl get lease koordinator-manager -o yaml   -n koordinator-system
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2024-06-07T10:55:09Z"
  name: koordinator-manager
  namespace: koordinator-system
  resourceVersion: "1998501"
  uid: 1e9bc283-9264-48cc-bab8-848584859e75
spec:
  acquireTime: "2024-06-10T19:11:21.606454Z"
  holderIdentity: koord-manager-74866c758d-h68kc_835791dc-866c-475b-89ad-cc48862d79b2
  leaseDurationSeconds: 15
  leaseTransitions: 2
  renewTime: "2024-06-11T04:19:37.976706Z"

Anything else we need to know?:

Environment:

App version:
Kubernetes version (use kubectl version):

(base) [root@demo ~]# kubectl version
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1

Install details (e.g. helm install args):

helm install koordinator koordinator-sh/koordinator --version 1.5.0

Node environment (for koordlet/runtime-proxy issue):
- Containerd/Docker version:
- OS version:
- Kernal version:
- Cgroup driver: cgroupfs/systemd
Others:

The text was updated successfully, but these errors were encountered:

saintube · 2024-06-11T13:17:42Z

@b43646 Do you mean the koord-manager restarts when the PodGroup is submitted?

b43646 · 2024-06-12T11:52:52Z

@saintube It seems that the pod restarted three days after the podgroup was submitted because the koord-manager failed to renew the lease.

saintube · 2024-06-12T12:37:55Z

@saintube It seems that the pod restarted three days after the podgroup was submitted because the koord-manager failed to renew the lease.

@b43646 It might be a common issue of the cluster environment where the koord-manager queries the apiserver and timeout. Is there any additional clue to show the koord-manager works abnormally so we can investigate?

b43646 · 2024-06-13T02:40:50Z

@saintube The network is fluctuating and unstable. Does the koord-manager have a retry mechanism when renewing the lease?

saintube · 2024-06-13T06:07:41Z

@saintube The network is fluctuating and unstable. Does the koord-manager have a retry mechanism when renewing the lease?

@b43646 Yes. The koord-manager uses the leader election mechanism of the controller-runtime, so you can find the manager is still working after the twice lease lost.

b43646 added the kind/bug Create a report to help us improve label Jun 11, 2024

b43646 mentioned this issue Jun 11, 2024

[BUG]koord-manager failed to renew lease #2095

Closed

ZiMengSheng added the area/koord-manager label Jun 11, 2024

saintube added the kind/question Support request or question relating to Koordinator label Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]koord-manager failed to renew lease #2096

[BUG]koord-manager failed to renew lease #2096

b43646 commented Jun 11, 2024

saintube commented Jun 11, 2024

b43646 commented Jun 12, 2024

saintube commented Jun 12, 2024 •

edited

Loading

b43646 commented Jun 13, 2024

saintube commented Jun 13, 2024

[BUG]koord-manager failed to renew lease #2096

[BUG]koord-manager failed to renew lease #2096

Comments

b43646 commented Jun 11, 2024

saintube commented Jun 11, 2024

b43646 commented Jun 12, 2024

saintube commented Jun 12, 2024 • edited Loading

b43646 commented Jun 13, 2024

saintube commented Jun 13, 2024

saintube commented Jun 12, 2024 •

edited

Loading