Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Fix Worker and Master templates for PyTorchJob #1988

Merged

Conversation

andreyvelich
Copy link
Member

Currently, user has to set Master and Worker replica since we assign RANK and WORLD_SIZE only if Master is set: https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/pytorch/envvar.go#L67.
Thus, I modified SDK to always set Master replicas.
Worker replicas is equal to num_workers - 1 since we assign RANK=0 to Master and it behaves like first Worker.

I added TODO in the SDK to track if we could create PyTorchJob without the Master, but it will require Controller changes.

Also, I renamed num_worker_replicas to num_workers to keep it consistent with train API.

/assign @tenzen-y @droctothorpe @johnugeorge @deepanker13

Copy link

@andreyvelich: GitHub didn't allow me to assign the following users: droctothorpe, deepanker13.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Currently, user has to set Master and Worker replica since we assign RANK and WORLD_SIZE only if Master is set: https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/pytorch/envvar.go#L67.
Thus, I modified SDK to always set Master replicas.
Worker replicas is equal to num_workers - 1 since we assign RANK=0 to Master and it behaves like first Worker.

I added TODO in the SDK to track if we could create PyTorchJob without the Master, but it will require Controller changes.

Also, I renamed num_worker_replicas to num_workers to keep it consistent with train API.

/assign @tenzen-y @droctothorpe @johnugeorge @deepanker13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

Pull Request Test Coverage Report for Build 7505127919

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall first build on sdk-fix-create-replicas at 42.828%

Totals Coverage Status
Change from base Build 7505106731: 42.8%
Covered Lines: 3750
Relevant Lines: 8756

💛 - Coveralls

if master_pod_template_spec:
pytorchjob.spec.pytorch_replica_specs[
constants.REPLICA_TYPE_MASTER
] = models.KubeflowOrgV1ReplicaSpec(
replicas=1,
template=master_pod_template_spec,
)
# If we don't define Master template, use the Worker template.
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bit confusing to me. Typically, master is explictly set when it is required by the user. User sets num_workers but we use master replica with workers=num_workers-1 This fix is bit contounter intuituve.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If master is mandatory, we can add needsMaster flag in train() which is defaulted to True. And add a validation check to prevent user overriding it. (for now)

Copy link
Member Author

@andreyvelich andreyvelich Jan 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, master is explictly set when it is required by the user.

@johnugeorge As I said, currently our controller doesn't set appropriate ENV variables if Master is not set in PyTorchJob. Without RANK and WORLD_SIZE we just can't run PyTorch DDP properly.

User sets num_workers but we use master replica with workers=num_workers-1 This fix is bit contounter intuituve.

That is because Master is the Worker with RANK=0. It just works like a coordinator across all workers.

If master is mandatory, we can add needsMaster flag in train() which is defaulted to True.

Why do we need to add this flag to the train() func if that is not necessary for user ? Users just need to define how many workers they want to use and we are going to create PyTorchJob with appropriate spec.

In the future, if we are going to support creation of PyTorchJob without Master replica, we can modify the SDK.

Copy link
Contributor

@deepanker13 deepanker13 Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich
I am able to run the following spec with only workers when using elastic policy.

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "torchrun-test"
  namespace: training
spec:
  nprocPerNode: "1"
  elasticPolicy:
    rdzvBackend: c10d
  pytorchReplicaSpecs:
    worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            kind: "deepanker2"
          annotations:
            test_vesrion: "1"
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: quay.io/deepanker_gupta/kubeflow_training:torchrun
              imagePullPolicy: Always
              args: ["torchrun","/workspace/exp/training.py"]
              resources: 
                limits:
                  nvidia.com/gpu: 1
                  cpu: 10
                  memory: '10Gi'

You can try it running on your cluster as the image is public.

Copy link
Contributor

@deepanker13 deepanker13 Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the following environment variables were getting set when I ran the above, rank and world size are getting set

environ({'PATH': '/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'HOSTNAME': 'torchrun-test-worker-1', 'NVIDIA_VISIBLE_DEVICES': 'GPU-28067a3b-25cd-58ed-216b-d315c6c9aeff', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64', 'PYTORCH_VERSION': '2.0.1', 'PYTHONUNBUFFERED': '1', 'PET_RDZV_ENDPOINT': 'torchrun-test-worker-0:23456', 'PET_RDZV_BACKEND': 'c10d', 'PET_NNODES': '2:2', 'KUBERNETES_PORT': 'tcp://172.19.0.1:443', 'KUBERNETES_PORT_443_TCP': 'tcp://172.19.0.1:443', 'KUBERNETES_PORT_443_TCP_PROTO': 'tcp', 'KUBERNETES_PORT_443_TCP_PORT': '443', 'KUBERNETES_PORT_443_TCP_ADDR': '172.19.0.1', 'KUBERNETES_SERVICE_HOST': '172.19.0.1', 'KUBERNETES_SERVICE_PORT': '443', 'KUBERNETES_SERVICE_PORT_HTTPS': '443', 'HOME': '/root', 'LC_CTYPE': 'C.UTF-8', 'LOCAL_RANK': '0', 'RANK': '1', 'GROUP_RANK': '1', 'ROLE_RANK': '1', 'ROLE_NAME': 'default', 'LOCAL_WORLD_SIZE': '1', 'WORLD_SIZE': '2', 'GROUP_WORLD_SIZE': '2', 'ROLE_WORLD_SIZE': '2', 'MASTER_ADDR': 'torchrun-test-worker-0', 'MASTER_PORT': '45614', 'TORCHELASTIC_RESTART_COUNT': '0', 'TORCHELASTIC_MAX_RESTARTS': '0', 'TORCHELASTIC_RUN_ID': 'none', 'TORCHELASTIC_USE_AGENT_STORE': 'False', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'TORCHELASTIC_ERROR_FILE': '/tmp/torchelastic_i2rkd3p1/none_1p31jibu/attempt_0/0/error.json'})

Copy link
Member Author

@andreyvelich andreyvelich Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepanker13 That's right. That is because when you run PyTorch with torchrun it automatically sets RANK and WORLD_SIZE env: https://pytorch.org/docs/stable/elastic/run.html#module-torch.distributed.run.
But if user just uses the nn.parallel.DistributedDataParallel package to set the PyTorch DDP (as in this example), we have to setup RANK and WORLD_SIZE for each pod.

In the future, we can configure our PyTorchJob to only run with torchrun, but it requires changes in SDK and Controller.

Currently, when user runs create_job via training function, we start the Training script using python3 start command.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich you are correct that the job fails when using only workers when we are using the spec without elastic policy. For example the below spec is failing:-

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "torchrun-test"
  namespace: training
spec:
  nprocPerNode: "1"
  pytorchReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            kind: "deepanker2"
          annotations:
            test_vesrion: "1"
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: quay.io/deepanker_gupta/kubeflow_training:torchrun
              imagePullPolicy: Always
              args: ["torchrun","/workspace/exp/training.py"]
              resources: 
                limits:
                  nvidia.com/gpu: 1
                  cpu: 10
                  memory: '10Gi'

@johnugeorge
Copy link
Member

johnugeorge commented Jan 12, 2024

The example doesn't have master: https://github.com/kubeflow/training-operator/blob/master/examples/sdk/train_api.ipynb

@deepanker13 How does it work?

@andreyvelich
Copy link
Member Author

The example doesn't have master: https://github.com/kubeflow/training-operator/blob/master/examples/sdk/train_api.ipynb

@deepanker13 How does it work?

It has master. If you take a look at the train API we always create Master template and pass it to get_pytorchjob_template func.

@johnugeorge
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Jan 16, 2024
@google-oss-prow google-oss-prow bot merged commit 0b6a30c into kubeflow:master Jan 16, 2024
35 checks passed
@andreyvelich andreyvelich deleted the sdk-fix-create-replicas branch January 16, 2024 20:06
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants