-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SDK] Fix Worker and Master templates for PyTorchJob #1988
[SDK] Fix Worker and Master templates for PyTorchJob #1988
Conversation
@andreyvelich: GitHub didn't allow me to assign the following users: droctothorpe, deepanker13. Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 7505127919
💛 - Coveralls |
if master_pod_template_spec: | ||
pytorchjob.spec.pytorch_replica_specs[ | ||
constants.REPLICA_TYPE_MASTER | ||
] = models.KubeflowOrgV1ReplicaSpec( | ||
replicas=1, | ||
template=master_pod_template_spec, | ||
) | ||
# If we don't define Master template, use the Worker template. | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is bit confusing to me. Typically, master is explictly set when it is required by the user. User sets num_workers but we use master replica with workers=num_workers-1 This fix is bit contounter intuituve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If master is mandatory, we can add needsMaster
flag in train() which is defaulted to True. And add a validation check to prevent user overriding it. (for now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically, master is explictly set when it is required by the user.
@johnugeorge As I said, currently our controller doesn't set appropriate ENV variables if Master is not set in PyTorchJob. Without RANK and WORLD_SIZE we just can't run PyTorch DDP properly.
User sets num_workers but we use master replica with workers=num_workers-1 This fix is bit contounter intuituve.
That is because Master is the Worker with RANK=0
. It just works like a coordinator across all workers.
If master is mandatory, we can add needsMaster flag in train() which is defaulted to True.
Why do we need to add this flag to the train()
func if that is not necessary for user ? Users just need to define how many workers they want to use and we are going to create PyTorchJob with appropriate spec.
In the future, if we are going to support creation of PyTorchJob without Master replica, we can modify the SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich
I am able to run the following spec with only workers when using elastic policy.
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "torchrun-test"
namespace: training
spec:
nprocPerNode: "1"
elasticPolicy:
rdzvBackend: c10d
pytorchReplicaSpecs:
worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
labels:
kind: "deepanker2"
annotations:
test_vesrion: "1"
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: quay.io/deepanker_gupta/kubeflow_training:torchrun
imagePullPolicy: Always
args: ["torchrun","/workspace/exp/training.py"]
resources:
limits:
nvidia.com/gpu: 1
cpu: 10
memory: '10Gi'
You can try it running on your cluster as the image is public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the following environment variables were getting set when I ran the above, rank and world size are getting set
environ({'PATH': '/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'HOSTNAME': 'torchrun-test-worker-1', 'NVIDIA_VISIBLE_DEVICES': 'GPU-28067a3b-25cd-58ed-216b-d315c6c9aeff', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64', 'PYTORCH_VERSION': '2.0.1', 'PYTHONUNBUFFERED': '1', 'PET_RDZV_ENDPOINT': 'torchrun-test-worker-0:23456', 'PET_RDZV_BACKEND': 'c10d', 'PET_NNODES': '2:2', 'KUBERNETES_PORT': 'tcp://172.19.0.1:443', 'KUBERNETES_PORT_443_TCP': 'tcp://172.19.0.1:443', 'KUBERNETES_PORT_443_TCP_PROTO': 'tcp', 'KUBERNETES_PORT_443_TCP_PORT': '443', 'KUBERNETES_PORT_443_TCP_ADDR': '172.19.0.1', 'KUBERNETES_SERVICE_HOST': '172.19.0.1', 'KUBERNETES_SERVICE_PORT': '443', 'KUBERNETES_SERVICE_PORT_HTTPS': '443', 'HOME': '/root', 'LC_CTYPE': 'C.UTF-8', 'LOCAL_RANK': '0', 'RANK': '1', 'GROUP_RANK': '1', 'ROLE_RANK': '1', 'ROLE_NAME': 'default', 'LOCAL_WORLD_SIZE': '1', 'WORLD_SIZE': '2', 'GROUP_WORLD_SIZE': '2', 'ROLE_WORLD_SIZE': '2', 'MASTER_ADDR': 'torchrun-test-worker-0', 'MASTER_PORT': '45614', 'TORCHELASTIC_RESTART_COUNT': '0', 'TORCHELASTIC_MAX_RESTARTS': '0', 'TORCHELASTIC_RUN_ID': 'none', 'TORCHELASTIC_USE_AGENT_STORE': 'False', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'TORCHELASTIC_ERROR_FILE': '/tmp/torchelastic_i2rkd3p1/none_1p31jibu/attempt_0/0/error.json'})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepanker13 That's right. That is because when you run PyTorch with torchrun
it automatically sets RANK and WORLD_SIZE env: https://pytorch.org/docs/stable/elastic/run.html#module-torch.distributed.run.
But if user just uses the nn.parallel.DistributedDataParallel
package to set the PyTorch DDP (as in this example), we have to setup RANK and WORLD_SIZE for each pod.
In the future, we can configure our PyTorchJob to only run with torchrun
, but it requires changes in SDK and Controller.
Currently, when user runs create_job
via training function, we start the Training script using python3
start command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich you are correct that the job fails when using only workers when we are using the spec without elastic policy. For example the below spec is failing:-
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "torchrun-test"
namespace: training
spec:
nprocPerNode: "1"
pytorchReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
labels:
kind: "deepanker2"
annotations:
test_vesrion: "1"
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: quay.io/deepanker_gupta/kubeflow_training:torchrun
imagePullPolicy: Always
args: ["torchrun","/workspace/exp/training.py"]
resources:
limits:
nvidia.com/gpu: 1
cpu: 10
memory: '10Gi'
The example doesn't have master: https://github.com/kubeflow/training-operator/blob/master/examples/sdk/train_api.ipynb @deepanker13 How does it work? |
It has master. If you take a look at the |
/lgtm |
Currently, user has to set Master and Worker replica since we assign RANK and WORLD_SIZE only if Master is set: https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/pytorch/envvar.go#L67.
Thus, I modified SDK to always set Master replicas.
Worker replicas is equal to
num_workers - 1
since we assignRANK=0
to Master and it behaves like first Worker.I added TODO in the SDK to track if we could create PyTorchJob without the Master, but it will require Controller changes.
Also, I renamed
num_worker_replicas
tonum_workers
to keep it consistent withtrain
API./assign @tenzen-y @droctothorpe @johnugeorge @deepanker13