Training operator fails to create HPA for TorchElastic jobs #1626

zhypku · 2022-06-30T14:24:20Z

Hi,

We were testing PyTorchJob with ElasticPolicy and HPA configurations on Kubeflow. But it seems that training-operator cannot create HPA for the job.

Environment: K8s 1.20.11, Kubeflow 1.5, training-operator 1.4

I used the ImageNet elastic training example in https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/imagenet/imagenet.yaml and added metrics configurations to enable HPA. The full yaml:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: elastic-example-imagenet
spec:
  elasticPolicy:
    rdzvBackend: c10d
    minReplicas: 1
    maxReplicas: 2
    maxRestarts: 100
    metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 60
  pytorchReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: kubeflow/pytorch-elastic-example-imagenet:1.0.0-sigterm
              imagePullPolicy: IfNotPresent
              env:
              - name: LOGLEVEL
                value: DEBUG
              command:
                - python
                - -m
                - torch.distributed.run
                - /workspace/examples/imagenet.py
                - "--arch=resnet18"
                - "--epochs=20"
                - "--batch-size=32"
                - "--workers=0"
                - "/workspace/data/tiny-imagenet-200"

It turned out that the created HPA was not configured with a correct object reference. The reference had only the Kind PyTorchJob, but no APIVersion, which should be kubeflow.org.

autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"False","lastTransitionTime":"2022-06-30T11:55:04Z","reason":"FailedGetScale","message":"the
      HPA controller was unable to get the target''s current scale: no matches for
      kind \"PyTorchJob\" in group \"\""}]'

I took a loot at the code and found that the operator did not set the APIVersion for the ScaleTargetRef https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/pytorch/hpa.go#L76

The autoscalingv2beta2.CrossVersionObjectReference struct does have an APIVersion field.

I'm not sure if this was due to my configuration errors or a bug.
Please let me know if you need further information.

Thanks,
Hanyu

The text was updated successfully, but these errors were encountered:

gaocegege · 2022-06-30T14:27:38Z

/cc @kubeflow/wg-training-leads

johnugeorge · 2022-12-20T12:49:07Z

@zhypku Feel free to try this patch for the fix - #1701

johnugeorge · 2022-12-21T16:03:10Z

Fixed by #1701

johnugeorge mentioned this issue Nov 2, 2022

Training operator 1.6 Roadmap #1683

Closed

9 tasks

This was referenced Dec 17, 2022

Changing label selector type for HPA kubeflow/common#197

Merged

HPA support for PyTorch Elastic #1701

Merged

johnugeorge closed this as completed Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training operator fails to create HPA for TorchElastic jobs #1626

Training operator fails to create HPA for TorchElastic jobs #1626

zhypku commented Jun 30, 2022

gaocegege commented Jun 30, 2022

johnugeorge commented Dec 20, 2022

johnugeorge commented Dec 21, 2022

Training operator fails to create HPA for TorchElastic jobs #1626

Training operator fails to create HPA for TorchElastic jobs #1626

Comments

zhypku commented Jun 30, 2022

gaocegege commented Jun 30, 2022

johnugeorge commented Dec 20, 2022

johnugeorge commented Dec 21, 2022