Podgroup is constantly created and deleted after tfjob is success or failure #1426

qiankunli · 2021-09-30T08:28:39Z

tf-operator version:v1.2.1

Name:         v1-tensorflow-0930032344277
Namespace:    xdl-system
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         TFJob
Metadata:
  Creation Timestamp:  2021-09-29T19:23:53Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:runPolicy:
          .:
          f:cleanPodPolicy:
          f:ttlSecondsAfterFinished:
        f:tfReplicaSpecs:
          .:
          f:Worker:
            .:
            f:replicas:
            f:restartPolicy:
            f:template:
              .:
              f:metadata:
                .:
                f:labels:
                  .:
                  f:volcano.sh/queue-name:
              f:spec:
                .:
                f:affinity:
                  .:
                  f:nodeAffinity:
                    .:
                    f:requiredDuringSchedulingIgnoredDuringExecution:
                      .:
                      f:nodeSelectorTerms:
                f:volumes:
    Manager:      OpenAPI-Generator
    Operation:    Update
    Time:         2021-09-29T19:23:53Z
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:successPolicy:
        f:tfReplicaSpecs:
          f:Worker:
            f:template:
              f:metadata:
                f:creationTimestamp:
              f:spec:
                f:containers:
      f:status:
        .:
        f:completionTime:
        f:conditions:
        f:replicaStatuses:
          .:
          f:Worker:
            .:
            f:succeeded:
        f:startTime:
    Manager:         tf-operator.v1
    Operation:       Update
    Time:            2021-09-29T19:25:26Z
  Resource Version:  11170043394
  Self Link:         /apis/kubeflow.org/v1/namespaces/xdl-system/tfjobs/v1-tensorflow-0930032344277
  UID:               a0c9da24-0930-43c8-9f4d-7a0f9ccfb364
Spec:
  Run Policy:
    Clean Pod Policy:            Running
    Ttl Seconds After Finished:  259200
  Tf Replica Specs:
    Worker:
      Replicas:        1
      Restart Policy:  Never
      Template:
        Metadata:
          Labels:
            volcano.sh/queue-name:  default
        Spec:
          Affinity:
            Node Affinity:
              Required During Scheduling Ignored During Execution:
                Node Selector Terms:
                  Match Expressions:
                    Key:       queue
                    Operator:  In
                    Values:
                      product-ad
          Containers:
            Command:
              python
              run.py
            Env:xx
            Env From:
              Config Map Ref:
                Name:           sync-properties
            Image:              xx
            Image Pull Policy:  Always
            Name:               tensorflow
            Resources:
              Limits:
                Cpu:     6
                Memory:  15Gi
            Volume Mounts:
              Mount Path:  /dashboard
              Name:        dashboard-volume
              Read Only:   false
          Volumes:
            Host Path:
              Path:  /mnt/cephfs/xdl/dashboard
              Type:  DirectoryOrCreate
            Name:    dashboard-volume
Status:
  Completion Time:  2021-09-29T19:25:26Z
  Conditions:
    Last Transition Time:  2021-09-29T19:23:53Z
    Last Update Time:      2021-09-29T19:23:53Z
    Message:               TFJob v1-tensorflow-0930032344277 is created.
    Reason:                TFJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2021-09-29T19:25:26Z
    Last Update Time:      2021-09-29T19:25:26Z
    Message:               TFJob xdl-system/v1-tensorflow-0930032344277 successfully completed.
    Reason:                TFJobSucceeded
    Status:                True
    Type:                  Succeeded
  Replica Statuses:
    Worker:
      Succeeded:  1
  Start Time:     2021-09-29T19:24:40Z
Events:
  Type    Reason                    Age                  From         Message
  ----    ------                    ----                 ----         -------
  Normal  JobTerminated             18m (x759 over 12h)  tf-operator  Job has been terminated. Deleting PodGroup
  Normal  JobTerminated             16m                  tf-operator  Job has been terminated. Deleting PodGroup
  Normal  SuccessfulDeletePodGroup  11m (x2 over 16m)    tf-operator  Deleted PodGroup: v1-tensorflow-0930032344277
  Normal  JobTerminated             66s (x7 over 6m33s)  tf-operator  Job has been terminated. Deleting PodGroup
  Normal  SuccessfulDeletePodGroup  66s (x7 over 6m33s)  tf-operator  Deleted PodGroup: v1-tensorflow-0930032344277

tjob is successfully sucess in 2021/09/29, but constantly created and delete podgroup on 2021/09/30
speciallly we set ttlSecondsAfterFinished=3day

deleting podgroup by accessing apiserver directly may cost some time, leading to long schedule latency for other job.

The text was updated successfully, but these errors were encountered:

gaocegege · 2021-09-30T08:52:12Z

The PodGroup is created and deleted many times. It is weird.

/cc @kubeflow/wg-training-leads

gaocegege · 2021-09-30T08:53:00Z

Maybe it is related to Expectations.

Jeffwan · 2021-10-02T17:52:33Z

If the job status is success or failure, we should skip reconciling actually.. Does this problem happens after long time? like 24 hr? I am thinking why it restart to reconcile pod group that long.

qiankunli · 2021-10-08T06:49:01Z

If the job status is success or failure, we should skip reconciling actually.. Does this problem happens after long time? like 24 hr? I am thinking why it restart to reconcile pod group that long.

there is a job completed on 2021-10-06T08:05:58Z, but reconciling now(2021-10-08)

Name:         v1-tensorflow-1006160359409
Namespace:    xdl-system
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         TFJob
Metadata:
  Creation Timestamp:  2021-10-06T08:04:01Z
  Resource Version:  11629460633
  Self Link:         /apis/kubeflow.org/v1/namespaces/xdl-system/tfjobs/v1-tensorflow-1006160359409
  UID:               f80466ce-f93d-4e01-8f4b-56fe855f6798
Spec:
  Run Policy:
    Clean Pod Policy:            Running
    Ttl Seconds After Finished:  259200
  Tf Replica Specs:
    Worker:
      Replicas:        1
      Restart Policy:  Never
      Template:
        Metadata:
          Labels:
            volcano.sh/queue-name:  default
        Spec:
          Containers:
            Command:
              python
              run.py
            Image:              xx
            Image Pull Policy:  Always
            Name:               tensorflow
            Resources:
              Limits:
                Cpu:     6
                Memory:  15Gi
            Volume Mounts:
              Mount Path:  /dashboard
              Name:        dashboard-volume
              Read Only:   false
          Volumes:
            Host Path:
              Path:  /mnt/cephfs/xdl/dashboard
              Type:  DirectoryOrCreate
            Name:    dashboard-volume
Status:
  Completion Time:  2021-10-06T08:05:58Z
  Conditions:
    Last Transition Time:  2021-10-06T08:04:01Z
    Last Update Time:      2021-10-06T08:04:01Z
    Message:               TFJob v1-tensorflow-1006160359409 is created.
    Reason:                TFJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2021-10-06T08:04:05Z
    Last Update Time:      2021-10-06T08:04:05Z
    Message:               TFJob xdl-system/v1-tensorflow-1006160359409 is running.
    Reason:                TFJobRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2021-10-06T08:05:58Z
    Last Update Time:      2021-10-06T08:05:58Z
    Message:               TFJob xdl-system/v1-tensorflow-1006160359409 successfully completed.
    Reason:                TFJobSucceeded
    Status:                True
    Type:                  Succeeded
  Replica Statuses:
    Worker:
      Succeeded:  1
  Start Time:     2021-10-06T08:04:01Z
Events:
  Type    Reason         Age                     From         Message
  ----    ------         ----                    ----         -------
  Normal  JobTerminated  2m12s (x2078 over 46h)  tf-operator  Job has been terminated. Deleting PodGroup

tf-operator log

{"filename":"common/job.go:144","level":"info","msg":"Reconciling for job v1-tensorflow-1006160359409","time":"2021-10-08T05:43:17Z"}
{"filename":"record/event.go:274","level":"info","msg":"Event(v1.ObjectReference{Kind:\"TFJob\", Namespace:\"xdl-system\", Name:\"v1-tensorflow-1006160359409\", UID:\"f80466ce-f93d-4e01-8f4b-56fe855f6798\", APIVersion:\"kubeflow.org/v1\", ResourceVersion:\"11629460633\", FieldPath:\"\"}): type: 'Normal' reason: 'JobTerminated' Job has been terminated. Deleting PodGroup","time":"2021-10-08T05:43:17Z"}
{"filename":"tensorflow/controller.go:308","job":"xdl-system.v1-tensorflow-1006160359409","level":"info","msg":"Finished syncing tfjob \"xdl-system/v1-tensorflow-1006160359409\" (33.814446ms)","time":"2021-10-08T05:43:17Z"}
{"filename":"record/event.go:274","level":"info","msg":"Event(v1.ObjectReference{Kind:\"TFJob\", Namespace:\"xdl-system\", Name:\"v1-tensorflow-1006160359409\", UID:\"f80466ce-f93d-4e01-8f4b-56fe855f6798\", APIVersion:\"kubeflow.org/v1\", ResourceVersion:\"11629460633\", FieldPath:\"\"}): type: 'Normal' reason: 'SuccessfulDeletePodGroup' Deleted PodGroup: v1-tensorflow-1006160359409","time":"2021-10-08T05:43:17Z"}

pod/sevice/podgroup of tfjob v1-tensorflow-1006160359409

[dev@VM-90-5-centos ~]$ kubectl get pod -n xdl-system | grep v1-tensorflow-1006160359409
v1-tensorflow-1006160359409-worker-0      0/1     Completed   0          2d
[dev@VM-90-5-centos ~]$ kubectl get service -n xdl-system | grep v1-tensorflow-1006160359409
v1-tensorflow-1006160359409-worker-0   ClusterIP   None             <none>        2222/TCP   2d
[dev@VM-90-5-centos ~]$ kubectl get podgroup-v1beta1 -n xdl-system | grep v1-tensorflow-1006160359409

qiankunli · 2021-10-11T06:08:54Z

I upgrade tf-operator v1.2.1 to training-operator v1.3.0, it is resolved.

qiankunli closed this as completed Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Podgroup is constantly created and deleted after tfjob is success or failure #1426

Podgroup is constantly created and deleted after tfjob is success or failure #1426

qiankunli commented Sep 30, 2021 •

edited

Loading

gaocegege commented Sep 30, 2021

gaocegege commented Sep 30, 2021

Jeffwan commented Oct 2, 2021 •

edited

Loading

qiankunli commented Oct 8, 2021 •

edited

Loading

qiankunli commented Oct 11, 2021

Podgroup is constantly created and deleted after tfjob is success or failure #1426

Podgroup is constantly created and deleted after tfjob is success or failure #1426

Comments

qiankunli commented Sep 30, 2021 • edited Loading

gaocegege commented Sep 30, 2021

gaocegege commented Sep 30, 2021

Jeffwan commented Oct 2, 2021 • edited Loading

qiankunli commented Oct 8, 2021 • edited Loading

qiankunli commented Oct 11, 2021

qiankunli commented Sep 30, 2021 •

edited

Loading

Jeffwan commented Oct 2, 2021 •

edited

Loading

qiankunli commented Oct 8, 2021 •

edited

Loading