[SDK] Fix Worker and Master templates for PyTorchJob #1988

andreyvelich · 2024-01-12T16:59:08Z

Currently, user has to set Master and Worker replica since we assign RANK and WORLD_SIZE only if Master is set: https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/pytorch/envvar.go#L67.
Thus, I modified SDK to always set Master replicas.
Worker replicas is equal to num_workers - 1 since we assign RANK=0 to Master and it behaves like first Worker.

I added TODO in the SDK to track if we could create PyTorchJob without the Master, but it will require Controller changes.

Also, I renamed num_worker_replicas to num_workers to keep it consistent with train API.

/assign @tenzen-y @droctothorpe @johnugeorge @deepanker13

google-oss-prow · 2024-01-12T16:59:13Z

@andreyvelich: GitHub didn't allow me to assign the following users: droctothorpe, deepanker13.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Currently, user has to set Master and Worker replica since we assign RANK and WORLD_SIZE only if Master is set: https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/pytorch/envvar.go#L67.
Thus, I modified SDK to always set Master replicas.
Worker replicas is equal to num_workers - 1 since we assign RANK=0 to Master and it behaves like first Worker.

I added TODO in the SDK to track if we could create PyTorchJob without the Master, but it will require Controller changes.

Also, I renamed num_worker_replicas to num_workers to keep it consistent with train API.

/assign @tenzen-y @droctothorpe @johnugeorge @deepanker13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

review-notebook-app · 2024-01-12T16:59:14Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

google-oss-prow · 2024-01-12T16:59:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2024-01-12T17:02:25Z

Pull Request Test Coverage Report for Build 7505127919

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall first build on sdk-fix-create-replicas at 42.828%

Totals
Change from base Build 7505106731:	42.8%
Covered Lines:	3750
Relevant Lines:	8756

💛 - Coveralls

johnugeorge · 2024-01-12T18:27:38Z

sdk/python/kubeflow/training/utils/utils.py

    if master_pod_template_spec:
        pytorchjob.spec.pytorch_replica_specs[
            constants.REPLICA_TYPE_MASTER
        ] = models.KubeflowOrgV1ReplicaSpec(
            replicas=1,
            template=master_pod_template_spec,
        )
+    # If we don't define Master template, use the Worker template.
+    else:


This is bit confusing to me. Typically, master is explictly set when it is required by the user. User sets num_workers but we use master replica with workers=num_workers-1 This fix is bit contounter intuituve.

If master is mandatory, we can add needsMaster flag in train() which is defaulted to True. And add a validation check to prevent user overriding it. (for now)

Typically, master is explictly set when it is required by the user.

@johnugeorge As I said, currently our controller doesn't set appropriate ENV variables if Master is not set in PyTorchJob. Without RANK and WORLD_SIZE we just can't run PyTorch DDP properly.

User sets num_workers but we use master replica with workers=num_workers-1 This fix is bit contounter intuituve.

That is because Master is the Worker with RANK=0. It just works like a coordinator across all workers.

If master is mandatory, we can add needsMaster flag in train() which is defaulted to True.

Why do we need to add this flag to the train() func if that is not necessary for user ? Users just need to define how many workers they want to use and we are going to create PyTorchJob with appropriate spec.

In the future, if we are going to support creation of PyTorchJob without Master replica, we can modify the SDK.

@andreyvelich
I am able to run the following spec with only workers when using elastic policy.

apiVersion: "kubeflow.org/v1" kind: "PyTorchJob" metadata: name: "torchrun-test" namespace: training spec: nprocPerNode: "1" elasticPolicy: rdzvBackend: c10d pytorchReplicaSpecs: worker: replicas: 2 restartPolicy: OnFailure template: metadata: labels: kind: "deepanker2" annotations: test_vesrion: "1" sidecar.istio.io/inject: "false" spec: containers: - name: pytorch image: quay.io/deepanker_gupta/kubeflow_training:torchrun imagePullPolicy: Always args: ["torchrun","/workspace/exp/training.py"] resources: limits: nvidia.com/gpu: 1 cpu: 10 memory: '10Gi'

You can try it running on your cluster as the image is public.

the following environment variables were getting set when I ran the above, rank and world size are getting set

environ({'PATH': '/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'HOSTNAME': 'torchrun-test-worker-1', 'NVIDIA_VISIBLE_DEVICES': 'GPU-28067a3b-25cd-58ed-216b-d315c6c9aeff', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64', 'PYTORCH_VERSION': '2.0.1', 'PYTHONUNBUFFERED': '1', 'PET_RDZV_ENDPOINT': 'torchrun-test-worker-0:23456', 'PET_RDZV_BACKEND': 'c10d', 'PET_NNODES': '2:2', 'KUBERNETES_PORT': 'tcp://172.19.0.1:443', 'KUBERNETES_PORT_443_TCP': 'tcp://172.19.0.1:443', 'KUBERNETES_PORT_443_TCP_PROTO': 'tcp', 'KUBERNETES_PORT_443_TCP_PORT': '443', 'KUBERNETES_PORT_443_TCP_ADDR': '172.19.0.1', 'KUBERNETES_SERVICE_HOST': '172.19.0.1', 'KUBERNETES_SERVICE_PORT': '443', 'KUBERNETES_SERVICE_PORT_HTTPS': '443', 'HOME': '/root', 'LC_CTYPE': 'C.UTF-8', 'LOCAL_RANK': '0', 'RANK': '1', 'GROUP_RANK': '1', 'ROLE_RANK': '1', 'ROLE_NAME': 'default', 'LOCAL_WORLD_SIZE': '1', 'WORLD_SIZE': '2', 'GROUP_WORLD_SIZE': '2', 'ROLE_WORLD_SIZE': '2', 'MASTER_ADDR': 'torchrun-test-worker-0', 'MASTER_PORT': '45614', 'TORCHELASTIC_RESTART_COUNT': '0', 'TORCHELASTIC_MAX_RESTARTS': '0', 'TORCHELASTIC_RUN_ID': 'none', 'TORCHELASTIC_USE_AGENT_STORE': 'False', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'TORCHELASTIC_ERROR_FILE': '/tmp/torchelastic_i2rkd3p1/none_1p31jibu/attempt_0/0/error.json'})

@deepanker13 That's right. That is because when you run PyTorch with torchrun it automatically sets RANK and WORLD_SIZE env: https://pytorch.org/docs/stable/elastic/run.html#module-torch.distributed.run.
But if user just uses the nn.parallel.DistributedDataParallel package to set the PyTorch DDP (as in this example), we have to setup RANK and WORLD_SIZE for each pod.

In the future, we can configure our PyTorchJob to only run with torchrun, but it requires changes in SDK and Controller.

Currently, when user runs create_job via training function, we start the Training script using python3 start command.

@andreyvelich you are correct that the job fails when using only workers when we are using the spec without elastic policy. For example the below spec is failing:-

apiVersion: "kubeflow.org/v1" kind: "PyTorchJob" metadata: name: "torchrun-test" namespace: training spec: nprocPerNode: "1" pytorchReplicaSpecs: Worker: replicas: 2 restartPolicy: OnFailure template: metadata: labels: kind: "deepanker2" annotations: test_vesrion: "1" sidecar.istio.io/inject: "false" spec: containers: - name: pytorch image: quay.io/deepanker_gupta/kubeflow_training:torchrun imagePullPolicy: Always args: ["torchrun","/workspace/exp/training.py"] resources: limits: nvidia.com/gpu: 1 cpu: 10 memory: '10Gi'

johnugeorge · 2024-01-12T18:29:12Z

The example doesn't have master: https://github.com/kubeflow/training-operator/blob/master/examples/sdk/train_api.ipynb

@deepanker13 How does it work?

andreyvelich · 2024-01-13T16:55:39Z

The example doesn't have master: https://github.com/kubeflow/training-operator/blob/master/examples/sdk/train_api.ipynb

@deepanker13 How does it work?

It has master. If you take a look at the train API we always create Master template and pass it to get_pytorchjob_template func.

johnugeorge · 2024-01-16T19:09:08Z

/lgtm

[SDK] Fix Worker and Master templates for PyTorchJob

de55a09

google-oss-prow bot assigned johnugeorge and tenzen-y Jan 12, 2024

google-oss-prow bot requested review from jinchihe and kuizhiqing January 12, 2024 16:59

google-oss-prow bot added approved size/L labels Jan 12, 2024

johnugeorge reviewed Jan 12, 2024

View reviewed changes

andreyvelich mentioned this pull request Jan 16, 2024

[SDK] Add resources per worker for Create Job API #1990

Merged

google-oss-prow bot added the lgtm label Jan 16, 2024

google-oss-prow bot merged commit 0b6a30c into kubeflow:master Jan 16, 2024
35 checks passed

andreyvelich deleted the sdk-fix-create-replicas branch January 16, 2024 20:06

johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024

[SDK] Fix Worker and Master templates for PyTorchJob (kubeflow#1988)

de3e408

johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024

[SDK] Fix Worker and Master templates for PyTorchJob (kubeflow#1988)

11fa67c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDK] Fix Worker and Master templates for PyTorchJob #1988

[SDK] Fix Worker and Master templates for PyTorchJob #1988

andreyvelich commented Jan 12, 2024

google-oss-prow bot commented Jan 12, 2024

review-notebook-app bot commented Jan 12, 2024

google-oss-prow bot commented Jan 12, 2024

coveralls commented Jan 12, 2024

johnugeorge Jan 12, 2024

johnugeorge Jan 12, 2024

andreyvelich Jan 13, 2024 •

edited

Loading

deepanker13 Jan 16, 2024 •

edited

Loading

deepanker13 Jan 16, 2024 •

edited

Loading

andreyvelich Jan 16, 2024 •

edited

Loading

deepanker13 Jan 16, 2024

johnugeorge commented Jan 12, 2024 •

edited

Loading

andreyvelich commented Jan 13, 2024

johnugeorge commented Jan 16, 2024

[SDK] Fix Worker and Master templates for PyTorchJob #1988

[SDK] Fix Worker and Master templates for PyTorchJob #1988

Conversation

andreyvelich commented Jan 12, 2024

google-oss-prow bot commented Jan 12, 2024

review-notebook-app bot commented Jan 12, 2024

google-oss-prow bot commented Jan 12, 2024

coveralls commented Jan 12, 2024

Pull Request Test Coverage Report for Build 7505127919

💛 - Coveralls

johnugeorge Jan 12, 2024

Choose a reason for hiding this comment

johnugeorge Jan 12, 2024

Choose a reason for hiding this comment

andreyvelich Jan 13, 2024 • edited Loading

Choose a reason for hiding this comment

deepanker13 Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

deepanker13 Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

andreyvelich Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

deepanker13 Jan 16, 2024

Choose a reason for hiding this comment

johnugeorge commented Jan 12, 2024 • edited Loading

andreyvelich commented Jan 13, 2024

johnugeorge commented Jan 16, 2024

andreyvelich Jan 13, 2024 •

edited

Loading

deepanker13 Jan 16, 2024 •

edited

Loading

deepanker13 Jan 16, 2024 •

edited

Loading

andreyvelich Jan 16, 2024 •

edited

Loading

johnugeorge commented Jan 12, 2024 •

edited

Loading