Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]koord-manager failed to renew lease #2096

Open
b43646 opened this issue Jun 11, 2024 · 5 comments
Open

[BUG]koord-manager failed to renew lease #2096

b43646 opened this issue Jun 11, 2024 · 5 comments
Labels
area/koord-manager kind/bug Create a report to help us improve kind/question Support request or question relating to Koordinator

Comments

@b43646
Copy link

b43646 commented Jun 11, 2024

What happened:

After running for 3 days, it was found that the koordinator-manager pod restarted.

What you expected to happen:

The koordinator-manager pod has been running stably without any abnormal restarts.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy Koordinator version 1.5.0.
  2. Run the workload.
(base) [root@demo demo]# cat pod-group.yaml
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: gang-example
  namespace: default
spec:
  scheduleTimeoutSeconds: 100
  minMember: 10
(base) [root@demo demo]# cat setup.sh
#!/bin/bash

for i in {1..50}
do
    new_name="test-$i"
template='apiVersion: v1
kind: Pod
metadata:
  name: pod-example1
  namespace: default
  labels:
    pod-group.scheduling.sigs.k8s.io: gang-example
spec:
  schedulerName: koord-scheduler
  containers:
  - command:
    - sleep
    - 365d
    image: busybox
    imagePullPolicy: IfNotPresent
    name: curlimage
    resources:
      limits:
        cpu: 40m
        memory: 40Mi
      requests:
        cpu: 40m
        memory: 40Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  restartPolicy: Always'

new_template=$(echo "$template" | sed "/^ *name: /s/:.*/: $new_name/")

echo "$new_template"  >> ./demo2/$new_name.yaml
done

(base) [root@demo demo]# kubectl  apply -f demo2/
  1. Check the running status
(base) [root@demo demo]# kubectl -n koordinator-system get pods -o wide
NAME                                READY   STATUS    RESTARTS     AGE     IP            NODE          NOMINATED NODE   READINESS GATES
koord-descheduler-f7fb57d46-82npl   1/1     Running   0            3d17h   10.0.10.158   10.0.10.197   <none>           <none>
koord-descheduler-f7fb57d46-z8lwv   1/1     Running   0            3d17h   10.0.10.141   10.0.10.123   <none>           <none>
koord-manager-74866c758d-h68kc      1/1     Running   1 (8h ago)   3d17h   10.0.10.252   10.0.10.123   <none>           <none>
koord-manager-74866c758d-pjjhw      1/1     Running   1 (9h ago)   3d17h   10.0.10.90    10.0.10.167   <none>           <none>
koord-scheduler-689ff4b98d-9hngg    1/1     Running   0            3d17h   10.0.10.113   10.0.10.167   <none>           <none>
koord-scheduler-689ff4b98d-zgmr6    1/1     Running   0            3d17h   10.0.10.133   10.0.10.123   <none>           <none>
koordlet-czwrr                      1/1     Running   0            3d17h   10.0.10.123   10.0.10.123   <none>           <none>
koordlet-k2pd7                      1/1     Running   0            3d17h   10.0.10.167   10.0.10.167   <none>           <none>
koordlet-rsxhs                      1/1     Running   0            3d17h   10.0.10.197   10.0.10.197   <none>           <none>

koord-manager-74866c758d-pjjhw's description as below

(base) [root@demo ~]# kubectl  -n koordinator-system describe pod koord-manager-74866c758d-pjjhw
Name:             koord-manager-74866c758d-pjjhw
Namespace:        koordinator-system
Priority:         0
Service Account:  koord-manager
Node:             10.0.10.167/10.0.10.167
Start Time:       Fri, 07 Jun 2024 10:54:44 +0000
Labels:           koord-app=koord-manager
                  pod-template-hash=74866c758d
Annotations:      <none>
Status:           Running
IP:               10.0.10.90
IPs:
  IP:           10.0.10.90
Controlled By:  ReplicaSet/koord-manager-74866c758d
Containers:
  manager:
    Container ID:  cri-o://e90d0ba9eecc114a3bc3de1f5aa13ba7260fb12eeacf4d36a40e94261d6fb6cb
    Image:         registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0
    Image ID:      bce57a219f5cd58028e6df2968c8fa1bedc5607bb44f0f7c25ee2a03616c4403
    Ports:         9876/TCP, 8080/TCP, 8000/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Command:
      /koord-manager
    Args:
      --enable-leader-election
      --metrics-addr=:8080
      --health-probe-addr=:8000
      --logtostderr=true
      --leader-election-namespace=koordinator-system
      --v=4
      --feature-gates=
      --sync-period=0
      --config-namespace=koordinator-system
    State:          Running
      Started:      Mon, 10 Jun 2024 19:09:29 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 07 Jun 2024 10:55:08 +0000
      Finished:     Mon, 10 Jun 2024 19:09:25 +0000
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      500m
      memory:   256Mi
    Readiness:  http-get http://:8000/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAMESPACE:                              koordinator-system (v1:metadata.namespace)
      WEBHOOK_PORT:                               9876
      WEBHOOK_CONFIGURATION_FAILURE_POLICY_PODS:  Ignore
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jbhc8 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  kube-api-access-jbhc8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

The relevant log information is as follows:

# koord-manager-74866c758d-pjjhw
2024-06-10T19:09:25.669286185+00:00 stderr F E0610 19:09:25.668766       1 leaderelection.go:332] error retrieving resource lock koordinator-system/koordinator-manager: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/koordinator-system/leases/koordinator-manager": context deadline exceeded
2024-06-10T19:09:25.669286185+00:00 stderr F I0610 19:09:25.669278       1 leaderelection.go:285] failed to renew lease koordinator-system/koordinator-manager: timed out waiting for the condition
2024-06-10T19:09:25.669483039+00:00 stderr F E0610 19:09:25.669356       1 main.go:196] setup "msg"="problem running manager" "error"="leader election lost"
# kubelet on 10.0.10.167

Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 kubelet[7591]: I0610 19:09:26.475498    7591 kubelet.go:2447] "SyncLoop (PLEG): event for pod" pod="koordinator-system/koord-manager-74866c758d-pjjhw" event={"ID":"be73fb71-158b-45f2-b94d-cc60b53c6df3","Type":"ContainerDied","Data":"223e8d1e3e62e7da67674d57fcf53c768375fe9ed5b6af2ebb4fd2021e3fb6f6"}
Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:26.476383199Z" level=info msg="Checking image status: registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0" id=5e764961-15f5-4cc9-a2e0-014c879ada64 name=/runtime.v1.ImageService/ImageStatus
Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:26.476990674Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:bce57a219f5cd58028e6df2968c8fa1bedc5607bb44f0f7c25ee2a03616c4403,RepoTags:[registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0],RepoDigests:[registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:136fb81deaba5c17b5cb4d5e575686a509bc5ece49e90baef182b25aaf6842c3 registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:e15c4ca3119ad851d1ddc8705f563b51f8a6c7ed7f7637440ffcba53509f0d5e],Size_:68533133,Uid:&Int64Value{Value:0,},Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},UserSpecifiedImage:,RuntimeHandler:,},Pinned:false,},Info:map[string]string{},}" id=5e764961-15f5-4cc9-a2e0-014c879ada64 name=/runtime.v1.ImageService/ImageStatus
Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:26.477546051Z" level=info msg="Pulling image: registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0" id=1bfc38a2-4e42-476f-8b94-80656c513ac1 name=/runtime.v1.ImageService/PullImage
Jun 10 19:09:26 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:26.614887179Z" level=info msg="Trying to access \"registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0\""
Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.670336576Z" level=info msg="Pulled image: registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:136fb81deaba5c17b5cb4d5e575686a509bc5ece49e90baef182b25aaf6842c3" id=1bfc38a2-4e42-476f-8b94-80656c513ac1 name=/runtime.v1.ImageService/PullImage
Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.672562149Z" level=info msg="Checking image status: registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0" id=95c3ddc1-c6bd-4671-a087-82db868aa6ef name=/runtime.v1.ImageService/ImageStatus
Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.672757630Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:bce57a219f5cd58028e6df2968c8fa1bedc5607bb44f0f7c25ee2a03616c4403,RepoTags:[registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager:v1.5.0],RepoDigests:[registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:136fb81deaba5c17b5cb4d5e575686a509bc5ece49e90baef182b25aaf6842c3 registry.cn-beijing.aliyuncs.com/koordinator-sh/koord-manager@sha256:e15c4ca3119ad851d1ddc8705f563b51f8a6c7ed7f7637440ffcba53509f0d5e],Size_:68533133,Uid:&Int64Value{Value:0,},Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},UserSpecifiedImage:,RuntimeHandler:,},Pinned:false,},Info:map[string]string{},}" id=95c3ddc1-c6bd-4671-a087-82db868aa6ef name=/runtime.v1.ImageService/ImageStatus
Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.673626493Z" level=info msg="Creating container: koordinator-system/koord-manager-74866c758d-pjjhw/manager" id=1415010d-3d2f-4d83-83b8-99119f3fbdff name=/runtime.v1.RuntimeService/CreateContainer
Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.745286162Z" level=info msg="Created container e90d0ba9eecc114a3bc3de1f5aa13ba7260fb12eeacf4d36a40e94261d6fb6cb: koordinator-system/koord-manager-74866c758d-pjjhw/manager" id=1415010d-3d2f-4d83-83b8-99119f3fbdff name=/runtime.v1.RuntimeService/CreateContainer
Jun 10 19:09:29 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 crio[6426]: time="2024-06-10 19:09:29.752170566Z" level=info msg="Started container" PID=733631 containerID=e90d0ba9eecc114a3bc3de1f5aa13ba7260fb12eeacf4d36a40e94261d6fb6cb description=koordinator-system/koord-manager-74866c758d-pjjhw/manager id=73ea7af0-bcd4-4825-9ed3-23d2d83fecef name=/runtime.v1.RuntimeService/StartContainer sandboxID=95c667f01abc6b00a828364108a0bafbf07b433a24e72cdd29da7930c82ae32b
Jun 10 19:09:30 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 kubelet[7591]: I0610 19:09:30.486423    7591 kubelet.go:2447] "SyncLoop (PLEG): event for pod" pod="koordinator-system/koord-manager-74866c758d-pjjhw" event={"ID":"be73fb71-158b-45f2-b94d-cc60b53c6df3","Type":"ContainerStarted","Data":"e90d0ba9eecc114a3bc3de1f5aa13ba7260fb12eeacf4d36a40e94261d6fb6cb"}
Jun 10 19:09:30 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 kubelet[7591]: I0610 19:09:30.487997    7591 kubelet.go:2519] "SyncLoop (probe)" probe="readiness" status="" pod="koordinator-system/koord-manager-74866c758d-pjjhw"
Jun 10 19:09:30 loren-cqogpfbtd7a-ndweaqlbrza-sfwwelgwt5a-1 kubelet[7591]: I0610 19:09:30.495249    7591 kubelet.go:2519] "SyncLoop (probe)" probe="readiness" status="ready" pod="koordinator-system/koord-manager-74866c758d-pjjhw"

lease information as below

(base) [root@demo demo]# kubectl get lease -n koordinator-system
NAME                  HOLDER                                                                   AGE
koord-descheduler     koord-descheduler-f7fb57d46-z8lwv_7674cd19-b9cd-4ab7-a03a-d860816d5953   3d17h
koord-scheduler       koord-scheduler-689ff4b98d-zgmr6_44de532f-3e2d-4903-a98a-8013f99155db    3d17h
koordinator-manager   koord-manager-74866c758d-h68kc_835791dc-866c-475b-89ad-cc48862d79b2      3d17h
(base) [root@demo demo]# kubectl get lease koordinator-manager -o yaml   -n koordinator-system
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2024-06-07T10:55:09Z"
  name: koordinator-manager
  namespace: koordinator-system
  resourceVersion: "1998501"
  uid: 1e9bc283-9264-48cc-bab8-848584859e75
spec:
  acquireTime: "2024-06-10T19:11:21.606454Z"
  holderIdentity: koord-manager-74866c758d-h68kc_835791dc-866c-475b-89ad-cc48862d79b2
  leaseDurationSeconds: 15
  leaseTransitions: 2
  renewTime: "2024-06-11T04:19:37.976706Z"

Anything else we need to know?:

Environment:

  • App version:
  • Kubernetes version (use kubectl version):
(base) [root@demo ~]# kubectl version
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.1

  • Install details (e.g. helm install args):
helm install koordinator koordinator-sh/koordinator --version 1.5.0
  • Node environment (for koordlet/runtime-proxy issue):
    • Containerd/Docker version:
    • OS version:
    • Kernal version:
    • Cgroup driver: cgroupfs/systemd
  • Others:
@b43646 b43646 added the kind/bug Create a report to help us improve label Jun 11, 2024
@saintube
Copy link
Member

@b43646 Do you mean the koord-manager restarts when the PodGroup is submitted?

@b43646
Copy link
Author

b43646 commented Jun 12, 2024

@saintube It seems that the pod restarted three days after the podgroup was submitted because the koord-manager failed to renew the lease.

@saintube
Copy link
Member

saintube commented Jun 12, 2024

@saintube It seems that the pod restarted three days after the podgroup was submitted because the koord-manager failed to renew the lease.

@b43646 It might be a common issue of the cluster environment where the koord-manager queries the apiserver and timeout. Is there any additional clue to show the koord-manager works abnormally so we can investigate?

@b43646
Copy link
Author

b43646 commented Jun 13, 2024

@saintube The network is fluctuating and unstable. Does the koord-manager have a retry mechanism when renewing the lease?

@saintube saintube added the kind/question Support request or question relating to Koordinator label Jun 13, 2024
@saintube
Copy link
Member

@saintube The network is fluctuating and unstable. Does the koord-manager have a retry mechanism when renewing the lease?

@b43646 Yes. The koord-manager uses the leader election mechanism of the controller-runtime, so you can find the manager is still working after the twice lease lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/koord-manager kind/bug Create a report to help us improve kind/question Support request or question relating to Koordinator
Projects
None yet
Development

No branches or pull requests

3 participants