The WORLD_SIZE environment variable for elastic policy is not getting set correctly #1947

deepanker13 · 2023-11-17T07:54:15Z

According to this PR the nProcPerNode in elastic policy is deprecated and it is suggested to use nprocPerNode in the spec.
However after trying the above, the world size is getting set to the number of replicas when using elastic mode and not replicas * nprocs_per_node.
The following yaml is not working, instead the commented line , when uncommented works.

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "torchrun-test"
  namespace: training 
spec:
  nprocPerNode: "2"
  elasticPolicy:
    rdzvBackend: c10d
    # nProcPerNode: 2
  pytorchReplicaSpecs:
    worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            kind: "deepanker2"
          annotations:
            test_vesrion: "1"
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: quay.io/deepanker_gupta/kubeflow_training:torchrun
              imagePullPolicy: Always
              args: ["torchrun","/workspace/exp/training.py"]
              resources: 
                limits:
                  nvidia.com/gpu: 2
                  cpu: 10
                  memory: '10Gi'

deepanker13 · 2023-11-17T08:06:22Z

@kuizhiqing @tenzen-y @johnugeorge

terrytangyuan · 2023-11-20T20:13:58Z

We can close this? @kuizhiqing

kuizhiqing · 2023-11-21T03:29:03Z

We can close this? @kuizhiqing

Yes, I think so. @deepanker13 Can you confirm with the fix.

deepanker13 · 2023-11-21T13:48:06Z

@kuizhiqing I am getting the same error, world size is getting set to 2 and not 4, I deleted and redeployed training operator on my Kubernetes cluster.
Also I am not able to exec into training operator pod's container to verify if I have the latest code or not using
kubectl exec -it training-operator-69575765df-snkgk -c training-operator -n kubeflow /bin/bash or /bin/sh or /bash or /sh

kuizhiqing · 2023-11-21T15:02:49Z

If the uncommented yaml works as you said, it should work with my fix. I've confirm the equivalence of the two version.

I suggest to verify it with either following way, first of all, pull the latest code with my fix, then

rebuild an image to update training-operator deployment
or delete the deployment(set the replicas to 0 is OK), build the binary locally with go build -o bin/manager cmd/training-operator.v1/main.go, if your kubectl is work(or you should set KUBECONFIG), run the bin/manager locally

finally, , restart your job.

deepanker13 · 2023-11-22T09:42:23Z

Yes, it is working as expected. Thanks @kuizhiqing , we can close this issue.
Ps - Building image using golang 1.17+ on M1 chip Mac is failing when target platform is set to linux/amd64. I had to build the image on a linux VM.

kuizhiqing mentioned this issue Nov 19, 2023

fix nproc env in elastic mode for pytorchjob #1948

Merged

1 task

deepanker13 closed this as completed Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The WORLD_SIZE environment variable for elastic policy is not getting set correctly #1947

The WORLD_SIZE environment variable for elastic policy is not getting set correctly #1947

deepanker13 commented Nov 17, 2023 •

edited

Loading

deepanker13 commented Nov 17, 2023 •

edited

Loading

terrytangyuan commented Nov 20, 2023 •

edited

Loading

kuizhiqing commented Nov 21, 2023

deepanker13 commented Nov 21, 2023 •

edited

Loading

kuizhiqing commented Nov 21, 2023 •

edited

Loading

deepanker13 commented Nov 22, 2023

The WORLD_SIZE environment variable for elastic policy is not getting set correctly #1947

The WORLD_SIZE environment variable for elastic policy is not getting set correctly #1947

Comments

deepanker13 commented Nov 17, 2023 • edited Loading

deepanker13 commented Nov 17, 2023 • edited Loading

terrytangyuan commented Nov 20, 2023 • edited Loading

kuizhiqing commented Nov 21, 2023

deepanker13 commented Nov 21, 2023 • edited Loading

kuizhiqing commented Nov 21, 2023 • edited Loading

deepanker13 commented Nov 22, 2023

deepanker13 commented Nov 17, 2023 •

edited

Loading

deepanker13 commented Nov 17, 2023 •

edited

Loading

terrytangyuan commented Nov 20, 2023 •

edited

Loading

deepanker13 commented Nov 21, 2023 •

edited

Loading

kuizhiqing commented Nov 21, 2023 •

edited

Loading