Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The WORLD_SIZE environment variable for elastic policy is not getting set correctly #1947

Closed
deepanker13 opened this issue Nov 17, 2023 · 6 comments

Comments

@deepanker13
Copy link
Contributor

deepanker13 commented Nov 17, 2023

According to this PR the nProcPerNode in elastic policy is deprecated and it is suggested to use nprocPerNode in the spec.
However after trying the above, the world size is getting set to the number of replicas when using elastic mode and not replicas * nprocs_per_node.
The following yaml is not working, instead the commented line , when uncommented works.

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "torchrun-test"
  namespace: training 
spec:
  nprocPerNode: "2"
  elasticPolicy:
    rdzvBackend: c10d
    # nProcPerNode: 2
  pytorchReplicaSpecs:
    worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            kind: "deepanker2"
          annotations:
            test_vesrion: "1"
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: quay.io/deepanker_gupta/kubeflow_training:torchrun
              imagePullPolicy: Always
              args: ["torchrun","/workspace/exp/training.py"]
              resources: 
                limits:
                  nvidia.com/gpu: 2
                  cpu: 10
                  memory: '10Gi'
@deepanker13
Copy link
Contributor Author

deepanker13 commented Nov 17, 2023

@terrytangyuan
Copy link
Member

terrytangyuan commented Nov 20, 2023

We can close this? @kuizhiqing

@kuizhiqing
Copy link
Member

We can close this? @kuizhiqing

Yes, I think so. @deepanker13 Can you confirm with the fix.

@deepanker13
Copy link
Contributor Author

deepanker13 commented Nov 21, 2023

@kuizhiqing I am getting the same error, world size is getting set to 2 and not 4, I deleted and redeployed training operator on my Kubernetes cluster.
Also I am not able to exec into training operator pod's container to verify if I have the latest code or not using
kubectl exec -it training-operator-69575765df-snkgk -c training-operator -n kubeflow /bin/bash or /bin/sh or /bash or /sh

@kuizhiqing
Copy link
Member

kuizhiqing commented Nov 21, 2023

If the uncommented yaml works as you said, it should work with my fix. I've confirm the equivalence of the two version.

I suggest to verify it with either following way, first of all, pull the latest code with my fix, then

  • rebuild an image to update training-operator deployment
  • or delete the deployment(set the replicas to 0 is OK), build the binary locally with go build -o bin/manager cmd/training-operator.v1/main.go, if your kubectl is work(or you should set KUBECONFIG), run the bin/manager locally

finally, , restart your job.

@deepanker13
Copy link
Contributor Author

Yes, it is working as expected. Thanks @kuizhiqing , we can close this issue.
Ps - Building image using golang 1.17+ on M1 chip Mac is failing when target platform is set to linux/amd64. I had to build the image on a linux VM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants