Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Podgroup is constantly created and deleted after tfjob is success or failure #1426

Closed
qiankunli opened this issue Sep 30, 2021 · 5 comments
Closed

Comments

@qiankunli
Copy link
Contributor

qiankunli commented Sep 30, 2021

tf-operator version:v1.2.1

Name:         v1-tensorflow-0930032344277
Namespace:    xdl-system
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         TFJob
Metadata:
  Creation Timestamp:  2021-09-29T19:23:53Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:runPolicy:
          .:
          f:cleanPodPolicy:
          f:ttlSecondsAfterFinished:
        f:tfReplicaSpecs:
          .:
          f:Worker:
            .:
            f:replicas:
            f:restartPolicy:
            f:template:
              .:
              f:metadata:
                .:
                f:labels:
                  .:
                  f:volcano.sh/queue-name:
              f:spec:
                .:
                f:affinity:
                  .:
                  f:nodeAffinity:
                    .:
                    f:requiredDuringSchedulingIgnoredDuringExecution:
                      .:
                      f:nodeSelectorTerms:
                f:volumes:
    Manager:      OpenAPI-Generator
    Operation:    Update
    Time:         2021-09-29T19:23:53Z
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:successPolicy:
        f:tfReplicaSpecs:
          f:Worker:
            f:template:
              f:metadata:
                f:creationTimestamp:
              f:spec:
                f:containers:
      f:status:
        .:
        f:completionTime:
        f:conditions:
        f:replicaStatuses:
          .:
          f:Worker:
            .:
            f:succeeded:
        f:startTime:
    Manager:         tf-operator.v1
    Operation:       Update
    Time:            2021-09-29T19:25:26Z
  Resource Version:  11170043394
  Self Link:         /apis/kubeflow.org/v1/namespaces/xdl-system/tfjobs/v1-tensorflow-0930032344277
  UID:               a0c9da24-0930-43c8-9f4d-7a0f9ccfb364
Spec:
  Run Policy:
    Clean Pod Policy:            Running
    Ttl Seconds After Finished:  259200
  Tf Replica Specs:
    Worker:
      Replicas:        1
      Restart Policy:  Never
      Template:
        Metadata:
          Labels:
            volcano.sh/queue-name:  default
        Spec:
          Affinity:
            Node Affinity:
              Required During Scheduling Ignored During Execution:
                Node Selector Terms:
                  Match Expressions:
                    Key:       queue
                    Operator:  In
                    Values:
                      product-ad
          Containers:
            Command:
              python
              run.py
            Env:xx
            Env From:
              Config Map Ref:
                Name:           sync-properties
            Image:              xx
            Image Pull Policy:  Always
            Name:               tensorflow
            Resources:
              Limits:
                Cpu:     6
                Memory:  15Gi
            Volume Mounts:
              Mount Path:  /dashboard
              Name:        dashboard-volume
              Read Only:   false
          Volumes:
            Host Path:
              Path:  /mnt/cephfs/xdl/dashboard
              Type:  DirectoryOrCreate
            Name:    dashboard-volume
Status:
  Completion Time:  2021-09-29T19:25:26Z
  Conditions:
    Last Transition Time:  2021-09-29T19:23:53Z
    Last Update Time:      2021-09-29T19:23:53Z
    Message:               TFJob v1-tensorflow-0930032344277 is created.
    Reason:                TFJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2021-09-29T19:25:26Z
    Last Update Time:      2021-09-29T19:25:26Z
    Message:               TFJob xdl-system/v1-tensorflow-0930032344277 successfully completed.
    Reason:                TFJobSucceeded
    Status:                True
    Type:                  Succeeded
  Replica Statuses:
    Worker:
      Succeeded:  1
  Start Time:     2021-09-29T19:24:40Z
Events:
  Type    Reason                    Age                  From         Message
  ----    ------                    ----                 ----         -------
  Normal  JobTerminated             18m (x759 over 12h)  tf-operator  Job has been terminated. Deleting PodGroup
  Normal  JobTerminated             16m                  tf-operator  Job has been terminated. Deleting PodGroup
  Normal  SuccessfulDeletePodGroup  11m (x2 over 16m)    tf-operator  Deleted PodGroup: v1-tensorflow-0930032344277
  Normal  JobTerminated             66s (x7 over 6m33s)  tf-operator  Job has been terminated. Deleting PodGroup
  Normal  SuccessfulDeletePodGroup  66s (x7 over 6m33s)  tf-operator  Deleted PodGroup: v1-tensorflow-0930032344277

tjob is successfully sucess in 2021/09/29, but constantly created and delete podgroup on 2021/09/30
speciallly we set ttlSecondsAfterFinished=3day

deleting podgroup by accessing apiserver directly may cost some time, leading to long schedule latency for other job.

@gaocegege
Copy link
Member

The PodGroup is created and deleted many times. It is weird.

/cc @kubeflow/wg-training-leads

@gaocegege
Copy link
Member

Maybe it is related to Expectations.

@Jeffwan
Copy link
Member

Jeffwan commented Oct 2, 2021

If the job status is success or failure, we should skip reconciling actually.. Does this problem happens after long time? like 24 hr? I am thinking why it restart to reconcile pod group that long.

@qiankunli
Copy link
Contributor Author

qiankunli commented Oct 8, 2021

If the job status is success or failure, we should skip reconciling actually.. Does this problem happens after long time? like 24 hr? I am thinking why it restart to reconcile pod group that long.

there is a job completed on 2021-10-06T08:05:58Z, but reconciling now(2021-10-08)

Name:         v1-tensorflow-1006160359409
Namespace:    xdl-system
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         TFJob
Metadata:
  Creation Timestamp:  2021-10-06T08:04:01Z
  Resource Version:  11629460633
  Self Link:         /apis/kubeflow.org/v1/namespaces/xdl-system/tfjobs/v1-tensorflow-1006160359409
  UID:               f80466ce-f93d-4e01-8f4b-56fe855f6798
Spec:
  Run Policy:
    Clean Pod Policy:            Running
    Ttl Seconds After Finished:  259200
  Tf Replica Specs:
    Worker:
      Replicas:        1
      Restart Policy:  Never
      Template:
        Metadata:
          Labels:
            volcano.sh/queue-name:  default
        Spec:
          Containers:
            Command:
              python
              run.py
            Image:              xx
            Image Pull Policy:  Always
            Name:               tensorflow
            Resources:
              Limits:
                Cpu:     6
                Memory:  15Gi
            Volume Mounts:
              Mount Path:  /dashboard
              Name:        dashboard-volume
              Read Only:   false
          Volumes:
            Host Path:
              Path:  /mnt/cephfs/xdl/dashboard
              Type:  DirectoryOrCreate
            Name:    dashboard-volume
Status:
  Completion Time:  2021-10-06T08:05:58Z
  Conditions:
    Last Transition Time:  2021-10-06T08:04:01Z
    Last Update Time:      2021-10-06T08:04:01Z
    Message:               TFJob v1-tensorflow-1006160359409 is created.
    Reason:                TFJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2021-10-06T08:04:05Z
    Last Update Time:      2021-10-06T08:04:05Z
    Message:               TFJob xdl-system/v1-tensorflow-1006160359409 is running.
    Reason:                TFJobRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2021-10-06T08:05:58Z
    Last Update Time:      2021-10-06T08:05:58Z
    Message:               TFJob xdl-system/v1-tensorflow-1006160359409 successfully completed.
    Reason:                TFJobSucceeded
    Status:                True
    Type:                  Succeeded
  Replica Statuses:
    Worker:
      Succeeded:  1
  Start Time:     2021-10-06T08:04:01Z
Events:
  Type    Reason         Age                     From         Message
  ----    ------         ----                    ----         -------
  Normal  JobTerminated  2m12s (x2078 over 46h)  tf-operator  Job has been terminated. Deleting PodGroup

tf-operator log

{"filename":"common/job.go:144","level":"info","msg":"Reconciling for job v1-tensorflow-1006160359409","time":"2021-10-08T05:43:17Z"}
{"filename":"record/event.go:274","level":"info","msg":"Event(v1.ObjectReference{Kind:\"TFJob\", Namespace:\"xdl-system\", Name:\"v1-tensorflow-1006160359409\", UID:\"f80466ce-f93d-4e01-8f4b-56fe855f6798\", APIVersion:\"kubeflow.org/v1\", ResourceVersion:\"11629460633\", FieldPath:\"\"}): type: 'Normal' reason: 'JobTerminated' Job has been terminated. Deleting PodGroup","time":"2021-10-08T05:43:17Z"}
{"filename":"tensorflow/controller.go:308","job":"xdl-system.v1-tensorflow-1006160359409","level":"info","msg":"Finished syncing tfjob \"xdl-system/v1-tensorflow-1006160359409\" (33.814446ms)","time":"2021-10-08T05:43:17Z"}
{"filename":"record/event.go:274","level":"info","msg":"Event(v1.ObjectReference{Kind:\"TFJob\", Namespace:\"xdl-system\", Name:\"v1-tensorflow-1006160359409\", UID:\"f80466ce-f93d-4e01-8f4b-56fe855f6798\", APIVersion:\"kubeflow.org/v1\", ResourceVersion:\"11629460633\", FieldPath:\"\"}): type: 'Normal' reason: 'SuccessfulDeletePodGroup' Deleted PodGroup: v1-tensorflow-1006160359409","time":"2021-10-08T05:43:17Z"}

pod/sevice/podgroup of tfjob v1-tensorflow-1006160359409

[dev@VM-90-5-centos ~]$ kubectl get pod -n xdl-system | grep v1-tensorflow-1006160359409
v1-tensorflow-1006160359409-worker-0      0/1     Completed   0          2d
[dev@VM-90-5-centos ~]$ kubectl get service -n xdl-system | grep v1-tensorflow-1006160359409
v1-tensorflow-1006160359409-worker-0   ClusterIP   None             <none>        2222/TCP   2d
[dev@VM-90-5-centos ~]$ kubectl get podgroup-v1beta1 -n xdl-system | grep v1-tensorflow-1006160359409

@qiankunli
Copy link
Contributor Author

I upgrade tf-operator v1.2.1 to training-operator v1.3.0, it is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants