Kubernetes HPA doesn't work with elastic PytorchJob #1645

zclyne · 2022-08-04T14:30:33Z

Background

Hi! I'm trying to launch elastic PytorchJob on training-operator with horizontal pod autoscaling (HPA), but I cannot get it work.

Here is the manifest of the job that I try to launch:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: elastic-example-imagenet-c10d-tcp
  namespace: default
spec:
  elasticPolicy:
    rdzvBackend: c10d
    minReplicas: 1
    maxReplicas: 5
    maxRestarts: 100
    metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 50
  pytorchReplicaSpecs:
    Worker:
      replicas: 5
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: artprod.dev.bloomberg.com/ds/yzhang2343/elastic-imagenet:latest
              imagePullPolicy: Always
              resources:
                limits:
                  cpu: 1
                  memory: 8Gi
                requests:
                  cpu: 1
                  memory: 8Gi
              env:
              - name: LOGLEVEL
                value: DEBUG
              command:
                - python
                - -m
                - torch.distributed.run
                - --module
                - imagenet
                - "--arch=resnet18"
                - "--epochs=20"
                - "--batch-size=32"
                - "--workers=0"
                - "/workspace/data/tiny-imagenet-200"

The image contains the imagenet example training code from the examples given under this repo.

Cannot Find PyTorchJob Kind

The first problem I got is that HPA cannot find PyTorchJob Kind.

yzhang2343@C02WC268HTDD ds-repos % kubectl get horizontalpodautoscaler/elastic-example-imagenet-c10d-tcp -o yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"False","lastTransitionTime":"2022-08-01T16:16:44Z","reason":"FailedGetScale","message":"the
      HPA controller was unable to get the target''s current scale: no matches for kind \"PyTorchJob\" in group \"\""}]'

According to this link, the HPA must specify .spec.scaleTargetRef.apiVersion so that it can successfully find the resource, but training-operator does not set this field while creating the HPA:

func desiredHPA(pytorchJob *kubeflowv1.PyTorchJob, scheme *runtime.Scheme) (
    *autoscalingv2beta2.HorizontalPodAutoscaler, error) {
    hpa := &autoscalingv2beta2.HorizontalPodAutoscaler{
        ObjectMeta: metav1.ObjectMeta{
            Name:      pytorchJob.Name,
            Namespace: pytorchJob.Namespace,
        },
        Spec: autoscalingv2beta2.HorizontalPodAutoscalerSpec{
            ScaleTargetRef: autoscalingv2beta2.CrossVersionObjectReference{
                // apiVersion: kubeflow.org/v1 should be set here
                Kind: pytorchJob.Kind,
                Name: pytorchJob.Name,
            },
            MinReplicas: pytorchJob.Spec.ElasticPolicy.MinReplicas,
            MaxReplicas: *pytorchJob.Spec.ElasticPolicy.MaxReplicas,
            Metrics:     pytorchJob.Spec.ElasticPolicy.Metrics,
        },
    }
    if err := controllerruntime.SetControllerReference(pytorchJob, hpa, scheme); err != nil {
        return nil, err
    }
    return hpa, nil
}

Missing a Selector

After I manually added the apiVersion to HPA, it could successfully find PyTorchJob kind but it still could not perform auto scaling because of this new error:

yzhang2343@C02WC268HTDD test % kubectl get horizontalpodautoscaler/elastic-example-imagenet-c10d-tcp -o yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"True","lastTransitionTime":"2022-08-01T16:51:48Z","reason":"SucceededGetScale","message":"the
      HPA controller was able to get the target''s current scale"},{"type":"ScalingActive","status":"False","lastTransitionTime":"2022-08-01T16:51:48Z","reason":"InvalidSelector","message":"the
      HPA target''s scale is missing a selector"}]'

According to the Kubernetes HPA doc, the target resource should have a .spec.selector field so that HPA can find pods using the selectors. However, PyTorchJob struct does not have .spec.selector field at all:

type PyTorchJobSpec struct {
    // RunPolicy encapsulates various runtime policies of the distributed training
    // job, for example how to clean up resources and how long the job can stay
    // active.
    //+kubebuilder:validation:Optional
    RunPolicy commonv1.RunPolicy `json:"runPolicy"`

    ElasticPolicy *ElasticPolicy `json:"elasticPolicy,omitempty"`

    // A map of PyTorchReplicaType (type) to ReplicaSpec (value). Specifies the PyTorch cluster configuration.
    // For example,
    //   {
    //     "Master": PyTorchReplicaSpec,
    //     "Worker": PyTorchReplicaSpec,
    //   }
    PyTorchReplicaSpecs map[commonv1.ReplicaType]*commonv1.ReplicaSpec `json:"pytorchReplicaSpecs"`
}

The expected behavior is that a HorizontalPodAutoscaler should be launched successfully and the number of pods of the training job should be dynamically and automatically updated by HPA according to the resource utilization.

It seems to me that the training-operator's support for HPA is incomplete. Does anybody know how to launch elastic pytorch jobs with HPA? Thanks!

The text was updated successfully, but these errors were encountered:

johnugeorge · 2022-08-04T14:32:43Z

/cc @gaocegege
/cc @Jeffwan

gaocegege · 2022-08-12T08:16:01Z

We defined the scale resource at https://github.com/kubeflow/training-operator/blob/master/manifests/base/crds/kubeflow.org_pytorchjobs.yaml#L8455

I will have a deep dive to see what happened here.

zclyne · 2022-08-12T17:34:48Z

Thank you so much!

johnugeorge · 2022-12-20T12:49:03Z

@zclyne Feel free to try this patch for the fix - #1701

gaocegege added the area/api label Aug 13, 2022

johnugeorge mentioned this issue Nov 2, 2022

Training operator 1.6 Roadmap #1683

Closed

9 tasks

This was referenced Dec 17, 2022

Changing label selector type for HPA kubeflow/common#197

Merged

HPA support for PyTorch Elastic #1701

Merged

google-oss-prow bot closed this as completed in #1701 Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes HPA doesn't work with elastic PytorchJob #1645

Kubernetes HPA doesn't work with elastic PytorchJob #1645

zclyne commented Aug 4, 2022 •

edited

Loading

johnugeorge commented Aug 4, 2022 •

edited

Loading

gaocegege commented Aug 12, 2022

zclyne commented Aug 12, 2022

johnugeorge commented Dec 20, 2022

Kubernetes HPA doesn't work with elastic PytorchJob #1645

Kubernetes HPA doesn't work with elastic PytorchJob #1645

Comments

zclyne commented Aug 4, 2022 • edited Loading

Background

Cannot Find PyTorchJob Kind

Missing a Selector

johnugeorge commented Aug 4, 2022 • edited Loading

gaocegege commented Aug 12, 2022

zclyne commented Aug 12, 2022

johnugeorge commented Dec 20, 2022

zclyne commented Aug 4, 2022 •

edited

Loading

johnugeorge commented Aug 4, 2022 •

edited

Loading