Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] AdvancedCronJob BroadcastJob ttlSecondsAfterFinished 重复创建BroadcastJob #792

Closed
Daryljocky opened this issue Nov 3, 2021 · 4 comments
Assignees
Labels
kind/bug Something isn't working

Comments

@Daryljocky
Copy link

Daryljocky commented Nov 3, 2021

What happened:
在AdvancedCronJob中配置了BroadcastJob, 并设置了ttlSecondsAfterFinished: 120
当BroadcastJob执行完成并结束了BroadcastJob和pod的删除后, 重复创建了一个新的BroadcastJob, 然后创建新的pod进行任务
此现象以tlSecondsAfterFinished设置的时间间隔反复循环

What you expected to happen:
任务执行完成, 垃圾回收结束, 等待下一次定时任务

Anything else we need to know?:
AdvancedCronJob 配置如下

apiVersion: apps.kruise.io/v1alpha1
kind: AdvancedCronJob
metadata:
  name: cutted-log-archive
spec:
  schedule: '25 18-23,0-14 * * *''
  template:
    broadcastJobTemplate:
      metadata:
        name: cutted-log-archive
      spec:
        completionPolicy:
          type: Always
          ttlSecondsAfterFinished: 120
        template:
          spec:
            affinity:
              nodeAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                  nodeSelectorTerms:
                  - matchExpressions:
                    - key: type
                      operator: In
                      values:
                      - app
            containers:
            - image: harbor.homecredit.cn/hcc/handle-applog:cephs3-1.0
              name: cutted-log-archive
              command:
                - /usr/bin/python3
                - /usr/local/src/cutted-log-archive.py
                - ">> /tmp/backup_cutted.log 2>&1"
              volumeMounts:
                - mountPath: /usr/local/src/
                  name: archive-applog-cm
                - mountPath: /tmp/
                  name: archive-record
                - mountPath: /data/logs/
                  name: local-path
            restartPolicy: OnFailure
            volumes:
              - configMap:
                  name: archive-log
                name: archive-applog-cm
              - hostPath:
                  path: /data/log-archive/
                  type: ""
                name: archive-record
              - hostPath:
                  path: /data/logs/
                  type: ""
                name: local-path

kruise controller 重复创建BroadcastJob 日志如下:

I1103 11:53:17.550729       1 advancedcronjob_controller.go:167] Updating job cutted-log-archive status v1alpha1.AdvancedCronJobStatus{Type:"BroadcastJob", Active:[]v1.ObjectReference(nil), LastScheduleTime:(*v1.Time)(0xc0052b6c60)}
I1103 11:53:17.615841       1 advancedcronjob_broadcastjob_controller.go:290] no upcoming scheduled times, sleeping until next now 2021-11-03 11:53:17.615792441 +0000 UTC m=+165223.181144067 and next run 2021-11-03 12:25:00 +0000 UTC log-archive/cutted-log-archive
I1103 11:53:22.500891       1 webhook_controller.go:230] Starting to sync webhook certs and configurations
I1103 11:53:22.566797       1 webhook_controller.go:232] Finished to sync webhook certs and configurations
I1103 11:53:40.292338       1 reflector.go:496] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225: Watch close - *v1alpha1.DaemonSet total 0 items received
I1103 11:53:55.224994       1 reflector.go:496] k8s.io/client-go/informers/factory.go:135: Watch close - *v1beta1.MutatingWebhookConfiguration total 0 items received
I1103 11:54:17.225667       1 reflector.go:496] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:137: Watch close - *v1.Pod total 1582 items received
I1103 11:54:22.567021       1 webhook_controller.go:230] Starting to sync webhook certs and configurations
I1103 11:54:22.670612       1 webhook_controller.go:232] Finished to sync webhook certs and configurations
I1103 11:54:28.135719       1 reflector.go:496] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:137: Watch close - *v1.Job total 0 items received
I1103 11:54:46.004101       1 reflector.go:496] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225: Watch close - *v1.StatefulSet total 0 items received
I1103 11:54:46.225937       1 reflector.go:496] pkg/webhook/util/controller/webhook_controller.go:194: Watch close - *v1beta1.CustomResourceDefinition total 0 items received
I1103 11:55:17.000254       1 broadcastjob_controller.go:184] deleting the job cutted-log-archive-1635938700
I1103 11:55:17.009546       1 controller.go:270] controller "msg"="Successfully Reconciled" "controller"="broadcastjob-controller" "name"="cutted-log-archive-1635938700" "namespace"="log-archive"
I1103 11:55:17.009619       1 advancedcronjob_controller.go:131] Running AdvancedCronJob job log-archive/cutted-log-archive
I1103 11:55:17.009645       1 controller.go:270] controller "msg"="Successfully Reconciled" "controller"="broadcastjob-controller" "name"="cutted-log-archive-1635938700" "namespace"="log-archive"
I1103 11:55:17.009673       1 advancedcronjob_broadcastjob_controller.go:134] advancedCronJob count  active advancedCronJob 0 successful advancedCronJob 0 failed advancedCronJob 0 log-archive/cutted-log-archive
I1103 11:55:17.009687       1 advancedcronjob_controller.go:167] Updating job cutted-log-archive status v1alpha1.AdvancedCronJobStatus{Type:"BroadcastJob", Active:[]v1.ObjectReference(nil), LastScheduleTime:(*v1.Time)(nil)}
I1103 11:55:17.037224       1 http.go:84] controller-runtime/webhook/webhooks "msg"="received request" "webhook"="/validate-apps-kruise-io-v1alpha1-broadcastjob" "UID"="b9f5e753-b75d-41c7-8bba-99ab6ed0534d" "kind"={"group":"apps.kruise.io","version":"v1alpha1","kind":"BroadcastJob"} "resource"={"group":"apps.kruise.io","version":"v1alpha1","resource":"broadcastjobs"}
I1103 11:55:17.037470       1 http.go:106] controller-runtime/webhook/webhooks "msg"="wrote response" "code"=200 "reason"="allowed to be admitted" "webhook"="/validate-apps-kruise-io-v1alpha1-broadcastjob" "UID"="b9f5e753-b75d-41c7-8bba-99ab6ed0534d" "allowed"=true
I1103 11:55:17.046298       1 advancedcronjob_broadcastjob_controller.go:379] created BroadcastJob for CronJob runjob&{{ } {cutted-log-archive-1635938700  log-archive  80d87f53-fc07-4210-99d8-07336b244b00 29742170 1 2021-11-03 11:55:17 +0000 UTC <nil> <nil> map[] map[apps.kruise.io/scheduled-at:2021-11-03T11:25:00Z] [{apps.kruise.io/v1alpha1 AdvancedCronJob cutted-log-archive 64165edf-06ce-4a78-b2fd-49e398a3b4f4 0xc0012189ed 0xc0012189ec}] []  [{kruise-manager Update apps.kruise.io/v1alpha1 2021-11-03 11:55:17 +0000 UTC FieldsV1 &FieldsV1{Raw:*[123 34 102 58 109 101 116 97 100 97 116 97 34 58 123 34 102 58 97 110 110 111 116 97 116 105 111 110 115 34 58 123 34 46 34 58 123 125 44 34 102 58 97 112 112 115 46 107 114 117 105 115 101 46 105 111 47 115 99 104 101 100 117 108 101 100 45 97 116 34 58 123 125 125 44 34 102 58 111 119 110 101 114 82 101 102 101 114 101 110 99 101 115 34 58 123 125 125 44 34 102 58 115 112 101 99 34 58 123 34 46 34 58 123 125 44 34 102 58 99 111 109 112 108 101 116 105 111 110 80 111 108 105 99 121 34 58 123 34 46 34 58 123 125 44 34 102 58 116 116 108 83 101 99 111 110 100 115 65 102 116 101 114 70 105 110 105 115 104 101 100 34 58 123 125 44 34 102 58 116 121 112 101 34 58 123 125 125 44 34 102 58 102 97 105 108 117 114 101 80 111 108 105 99 121 34 58 123 125 44 34 102 58 116 101 109 112 108 97 116 101 34 58 123 34 46 34 58 123 125 44 34 102 58 109 101 116 97 100 97 116 97 34 58 123 34 46 34 58 123 125 44 34 102 58 99 114 101 97 116 105 111 110 84 105 109 101 115 116 97 109 112 34 58 123 125 125 44 34 102 58 115 112 101 99 34 58 123 34 46 34 58 123 125 44 34 102 58 97 102 102 105 110 105 116 121 34 58 123 34 46 34 58 123 125 44 34 102 58 110 111 100 101 65 102 102 105 110 105 116 121 34 58 123 34 46 34 58 123 125 44 34 102 58 114 101 113 117 105 114 101 100 68 117 114 105 110 103 83 99 104 101 100 117 108 105 110 103 73 103 110 111 114 101 100 68 117 114 105 110 103 69 120 101 99 117 116 105 111 110 34 58 123 34 46 34 58 123 125 44 34 102 58 110 111 100 101 83 101 108 101 99 116 111 114 84 101 114 109 115 34 58 123 125 125 125 125 44 34 102 58 99 111 110 116 97 105 110 101 114 115 34 58 123 125 44 34 102 58 100 110 115 80 111 108 105 99 121 34 58 123 125 44 34 102 58 114 101 115 116 97 114 116 80 111 108 105 99 121 34 58 123 125 44 34 102 58 115 99 104 101 100 117 108 101 114 78 97 109 101 34 58 123 125 44 34 102 58 115 101 99 117 114 105 116 121 67 111 110 116 101 120 116 34 58 123 125 44 34 102 58 116 101 114 109 105 110 97 116 105 111 110 71 114 97 99 101 80 101 114 105 111 100 83 101 99 111 110 100 115 34 58 123 125 44 34 102 58 118 111 108 117 109 101 115 34 58 123 125 125 125 125 44 34 102 58 115 116 97 116 117 115 34 58 123 34 46 34 58 123 125 44 34 102 58 97 99 116 105 118 101 34 58 123 125 44 34 102 58 100 101 115 105 114 101 100 34 58 123 125 44 34 102 58 102 97 105 108 101 100 34 58 123 125 44 34 102 58 112 104 97 115 101 34 58 123 125 44 34 102 58 115 117 99 99 101 101 100 101 100 34 58 123 125 125 125],}}]} {2147483647 {{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []} {[{archive-applog-cm {nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil &ConfigMapVolumeSource{LocalObjectReference:LocalObjectReference{Name:archive-log,},Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil}} {archive-record {&HostPathVolumeSource{Path:/data/log-archive/,Type:*,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}} {local-path {&HostPathVolumeSource{Path:/data/logs/,Type:*,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}] [] [{cutted-log-archive harbor.homecredit.cn/hcc/handle-applog:cephs3-1.0 [/usr/bin/python3 /usr/local/src/cutted-log-archive.py >> /tmp/backup_cutted.log 2>&1] []  [] [] [] {map[] map[]} [{archive-applog-cm false /usr/local/src/  <nil> } {archive-record false /tmp/  <nil> } {local-path false /data/logs/  <nil> }] [] nil nil nil nil /dev/termination-log File Always nil false false false}] [] OnFailure 0xc0012189e0 <nil> ClusterFirst map[]   <nil>  false false false <nil> &PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,FSGroupChangePolicy:nil,} []   &Affinity{NodeAffinity:&NodeAffinity{RequiredDuringSchedulingIgnoredDuringExecution:&NodeSelector{NodeSelectorTerms:[]NodeSelectorTerm{NodeSelectorTerm{MatchExpressions:[]NodeSelectorRequirement{NodeSelectorRequirement{Key:type,Operator:In,Values:[app],},},MatchFields:[]NodeSelectorRequirement{},},},},PreferredDuringSchedulingIgnoredDuringExecution:[]PreferredSchedulingTerm{},},PodAffinity:nil,PodAntiAffinity:nil,} default-scheduler [] []  <nil> nil [] <nil> <nil> <nil> map[] []}} {Always <nil> 0xc0012189e8} false {FailFast 0}} {[] <nil> <nil> 0 0 0 0 }} log-archive/cutted-log-archive
I1103 11:55:17.046670       1 advancedcronjob_controller.go:131] Running AdvancedCronJob job log-archive/cutted-log-archive
I1103 11:55:17.046741       1 advancedcronjob_broadcastjob_controller.go:134] advancedCronJob count  active advancedCronJob 1 successful advancedCronJob 0 failed advancedCronJob 0 log-archive/cutted-log-archive
I1103 11:55:17.046766       1 advancedcronjob_controller.go:167] Updating job cutted-log-archive status v1alpha1.AdvancedCronJobStatus{Type:"BroadcastJob", Active:[]v1.ObjectReference{v1.ObjectReference{Kind:"BroadcastJob", Namespace:"log-archive", Name:"cutted-log-archive-1635938700", UID:"80d87f53-fc07-4210-99d8-07336b244b00", APIVersion:"apps.kruise.io/v1alpha1", ResourceVersion:"29742170", FieldPath:""}}, LastScheduleTime:(*v1.Time)(0xc0034e6b60)}
E1103 11:55:17.047757       1 broadcastjob_controller.go:646] Failed predicate on node whdck8s001.cn.prod : node(s) didn't match node selector

截图
image

Environment:

  • Kruise version: 0.10.0
  • Kubernetes version (use kubectl version): 1.20.8
  • OS (e.g: cat /etc/os-release): oracle-linux 8.4
  • Kernel (e.g. uname -a): 5.4.17-2102.203.6.el8uek.x86_64
@Daryljocky Daryljocky added the kind/bug Something isn't working label Nov 3, 2021
@FillZpp
Copy link
Member

FillZpp commented Nov 3, 2021

Let me reproduce it later.

@FillZpp
Copy link
Member

FillZpp commented Nov 5, 2021

任务执行完成, 垃圾回收结束, 等待下一次定时任务

@Daryljocky 简单来说,AdvancedCronJob 有一个机制叫补偿运行(原生 CronJob 也有)。如果 AdvancedCronJob 能看到前一次运行的 BroadcastJob 完成(比如是 11:25 运行的),那它会等下一次 12:25 再触发执行。但是如果前一个 BroadcastJob 被删掉了,AdvancedCronJob 就会认为 11:25 这次没有执行,它会先尝试去补偿运行 11:25 这次的缺漏。(其实很好理解,controller 本身是异步执行的,所以即使用户 schedule 写的是明确 11:25,controller 在实际执行的时候也可能会比这个时间晚个几毫秒到几秒钟都有可能,但这个时候 controller 不能因为超过了一点时间就不执行这次任务)。

因此,你需要配置 AdvancedCronJob 里的 startingDeadlineSeconds 时间,这个时间的含义是如果当前的时间减去缺漏的前一次预期运行时间已经超过了 startingDeadlineSeconds 定义的时间,则 controller 就不再去补偿运行前一次的任务。这个 startingDeadlineSeconds 时间你需要定义为小于 BroadcastJob 里的 TTL 清理时间(这样就保证 BroadcastJob 被清理的时候,其实已经超过 AdvancedCronJob 可补偿的时段),应该就能解决你的问题了。

apiVersion: apps.kruise.io/v1alpha1
kind: AdvancedCronJob
metadata:
  name: cutted-log-archive
spec:
  schedule: '25 18-23,0-14 * * *'
  startingDeadlineSeconds: 60
  template:
    ...

(可参考原生 CronJob 里对 startingDeadlineSeconds 的介绍 https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-job-limitations

@Daryljocky
Copy link
Author

明白了, 感谢!

@FillZpp
Copy link
Member

FillZpp commented Nov 5, 2021

不客气,方便的话可以帮社区登记一下 #289

@FillZpp FillZpp closed this as completed Nov 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants