Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training operator fails to create HPA for TorchElastic jobs #1626

Closed
zhypku opened this issue Jun 30, 2022 · 3 comments
Closed

Training operator fails to create HPA for TorchElastic jobs #1626

zhypku opened this issue Jun 30, 2022 · 3 comments

Comments

@zhypku
Copy link

zhypku commented Jun 30, 2022

Hi,

We were testing PyTorchJob with ElasticPolicy and HPA configurations on Kubeflow. But it seems that training-operator cannot create HPA for the job.

Environment: K8s 1.20.11, Kubeflow 1.5, training-operator 1.4

I used the ImageNet elastic training example in https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/imagenet/imagenet.yaml and added metrics configurations to enable HPA. The full yaml:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: elastic-example-imagenet
spec:
  elasticPolicy:
    rdzvBackend: c10d
    minReplicas: 1
    maxReplicas: 2
    maxRestarts: 100
    metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 60
  pytorchReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: kubeflow/pytorch-elastic-example-imagenet:1.0.0-sigterm
              imagePullPolicy: IfNotPresent
              env:
              - name: LOGLEVEL
                value: DEBUG
              command:
                - python
                - -m
                - torch.distributed.run
                - /workspace/examples/imagenet.py
                - "--arch=resnet18"
                - "--epochs=20"
                - "--batch-size=32"
                - "--workers=0"
                - "/workspace/data/tiny-imagenet-200"

It turned out that the created HPA was not configured with a correct object reference. The reference had only the Kind PyTorchJob, but no APIVersion, which should be kubeflow.org.

autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"False","lastTransitionTime":"2022-06-30T11:55:04Z","reason":"FailedGetScale","message":"the
      HPA controller was unable to get the target''s current scale: no matches for
      kind \"PyTorchJob\" in group \"\""}]'

I took a loot at the code and found that the operator did not set the APIVersion for the ScaleTargetRef https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/pytorch/hpa.go#L76

image

The autoscalingv2beta2.CrossVersionObjectReference struct does have an APIVersion field.
image

I'm not sure if this was due to my configuration errors or a bug.
Please let me know if you need further information.

Thanks,
Hanyu

@gaocegege
Copy link
Member

/cc @kubeflow/wg-training-leads

@johnugeorge
Copy link
Member

@zhypku Feel free to try this patch for the fix - #1701

@johnugeorge
Copy link
Member

Fixed by #1701

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants