-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The WORLD_SIZE environment variable in PyTorch is different from its definition #1790
Comments
WORLD_SIZE
env value is different from the definition
This is total number of participating nodes |
I didn't understand. What I meant was that the term 'WORLD_SIZE' in PyTorch and the PyTorch training operator have different meanings. Thus, when we use the 'torchrun' command, it automatically overwrites the value of 'WORLD_SIZE' with 'nnodes * nprocs_per_node'. This is quite confusing. Isn't it? Please let me know if there is anything I am misunderstanding. |
In your case, can you take a dump where you see 'WORLD_SIZE' to 2 in training pods? |
Actually, |
@kuizhiqing What do you refer to a node here in operator context? |
@johnugeorge I mean a pod in operator. Given that we have 2 nodes/machines, and each node has 8 GPUs, we want to run a job with operator that launches 2 pods, with each pod requiring 8 GPUs. |
@kuizhiqing Does that mean we launcher multiple ranks in a pod in this case? |
If so, we may be able to add training-operator/pkg/apis/kubeflow.org/v1/mpi_types.go Lines 61 to 64 in 1e32c2f
|
Yes, since @isuyyy is using |
@kuizhiqing Agree. It might be worth mentioning that the "training operator doesn't support launch multiple ranks in one pod" in docs. |
@tenzen-y IMO, it would be better to support and encourage users to use |
@tenzen-y let me try to understand your statement "training operator doesn't support launch multiple ranks in one pod" In a cluster of 2 nodes with 8 gpus each, I see couple of options Option 1: With operator, this is equivalent to Worldsize is 2 in both cases Option2: In baremetal case, run With operator, we can simulate a similar behaviour Worldsize is 16 in both cases Are we talking about the same use case or something different? I am not sure if we want to bring |
@johnugeorge I wanted to say about the following case: apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: pytorch-simple
namespace: kubeflow
spec:
pytorchReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
imagePullPolicy: Always
command:
- "torchrun --nnodes=2 --nproc-per-node=8 python train.py"
resources:
limits:
example.com/gpu: 8 IIUC, the training operator doesn't support the nproc-per-node other than 1. |
@tenzen-y Is there a specific reason that you want to run the command explicitly |
@johnugeorge No, I don't have use cases. I just wanted to say mentioning the "training operator doesn't support launch multiple ranks in one pod" in docs might be better. This means we should suggest users set appropriate the number of replicas to |
However, @kuizhiqing has another opinion.
|
In fact, I have several rationales to substantiate my perspective:
In summary, based on my points, using 8 GPUs per pod, and specifying the ranks explicitly for hybrid parallelism are important for performance in PyTorch distributed training for large language models. Let me know if I should clarify any part of the arguments. |
@kuizhiqing Thanks for providing this information. I agree, usually running multiple GPUs per 1 pod is more efficient, so Training Operator should properly set env variables for different use-cases. All of our existing PyTorch examples don't use @kuizhiqing @johnugeorge @tenzen-y I think, we should discuss this on the upcoming AutoML + Training WG Community Meeting on May 17th. |
@kuizhiqing As you say, attaching multiple GPUs to a Pod is worth it. But I'm not sure why you prefer setting apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: pytorch-simple
namespace: kubeflow
spec:
pytorchReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
imagePullPolicy: Always
command:
+ - "torchrun --nnodes=2 --nproc-per-node=1 python train.py"
- - "torchrun --nnodes=2 --nproc-per-node=8 python train.py"
resources:
limits:
example.com/gpu: 8 Is there any reason you prefer |
@andreyvelich However, the multiple GPUs 1 pod problem exists in the launch version. I would like to attend the meeting, but I cannot guarantee my availability since 5:00 PM UTC is not Asian timezone friendly. |
@tenzen-y I think we should use --nproc-per-node=8 in this case since I assume we are using 1 process per GPU. This is consistent with the torchrun design. Typically, one GPU will be bound to 1 process or thread. It is possible to do it all in one process, but that may not be efficient. So I think 1 GPU per process is the default setup in production. Please let me know if you have any other use cases. More specific examples of using For the operator, we could implement it using environment variables, plz refer to #1573 |
Sounds good to me. In fact, we use torch.distributed.run on Elastic PyTorch Training. Note:
https://pytorch.org/docs/stable/distributed.html#launch-utility |
@kuizhiqing Thank you for clarifying :) That information makes much sense and is great to see! |
I agree with @kuizhiqing suggestions.
Any concerns? @johnugeorge |
@andreyvelich Sorry for the late response and thank you for suggesting that. I was missing your message. |
@kuizhiqing Sorry for late reply. I agree with your points on performance and related strategies on baremetal runs. But in the container world, what is the design change that you propose for "8 GPUs per pod and specifying the ranks" ? How do you define "LOCAL_RANK" in this case? |
@tenzen-y @johnugeorge Sorry for the silence recently, I was quite occupied. I'd like to make a proposal about PyTorch operator adaptation which may take PyTorch and its application like Megatron, DeepSpeed into consideration when I'm available. |
Thanks! @kuizhiqing |
Thanks @kuizhiqing Looking forward to it |
This also implies RANK is set to a value different from its definition right? I believe RANK should be in the range of 0 to (num_nodes * num procs per node). From the torchrun docs RANK is defined as "The global rank." In reality I think we're getting RANK values referring to the node idx not global worker rank. |
Was this solved by #1840? |
looks like it fixes WORLD_SIZE but doesn't make a corresponding change to RANK |
Any known workarounds at the moment? I suppose a user-defined worker process could overwrite the |
@brannondorsey @nairbv What's your launch mode , as I mentioned in the PR and discussion, we support 3 mode to launch, RANK may be overwrite actually. |
@kuizhiqing I'm looking to perform multi-node distributed training with a I'm looking to basically do what's described in #1872, or your proposal in #1836. |
@kuizhiqing same here, running training jobs with torchrun. Current workaround is to just use whatever values are available based on whatever they mean when passed to us, even if they don't match the documentation. I don't know much about kubeflow/training operators, but for examples I just see we have a helm chart with a bunch of commands doing things like Definitions and env variables expected by torchrun: Also the fix will be BC-breaking for the workaround, since we'll still need to pass values that means "number of nodes" (etc) for each of these commands. |
First of all, we do not need to pass env as args to command any more, pytorch will handle all that env with prefix PET_, ref. Well, nnodes means the number of nodes, pods in kubernetes context, so nnodes equals to the totals replicas defined in spec, see ref For RANK, it will be overwrite as it should be. Let's try to understand it as follows, we have 8 GPUs in one pod, so we have 8 process in it, they share the same env but with different RANK, while torchrun is the MAIN process in pod to manage the 8 process, so it will and should handle the env for its subprocess. For a short summary, env is assigned for pod, for the main process, not for the worker process bind to GPU. Hope it helps, feel free to continue discussion if it remains unclear. |
Right, that's clear, and that's what nnodes means to torchrun. After the referenced PR, WORLD_SIZE here will also mean the same thing that it means in torch run (num procs across all nodes). If RANK (in the pod env) doesn't change, it means that RANK will (still) be in the range of The title of this issue seems to suggest that the env variables should match what's expected by pytorch, but maybe that's not the goal. Personally I might prefer if RANK wasn't set, and instead the rank of the node/pod was called something like NODE_RANK or PET_NODE_RANK or POD_RANK. |
PyTorch Lightning's Unfortunately, Right now, I'm basically here...
|
You are right, maybe not. The goals of this operator is allow user to run pytorch job in different ways, there are many methods indeed. Well we take the compatibility as priority, we just set almost all env for all possible way and ensure it works, if you find it do not work for a conventional setting, please let me known. The architecture of distributed processes management is quite different for different launch methods, the situation is even worse if we take elastic mode into consideration. It would be easier for the operator to fix the entry mode, but it's not the case for now. Technically, the operator process is before user run process, we do not know the launch method. It's too tricky to elevate user command, and we cannot do that if it was wrapped by shell script. |
As a user, I think one of the things I found confusing about RANK was I assumed that since it was set then the operation must be happening at the level of individual processes (and was wondering how that could interact with torchrun, which would usually start the processes). In pytorch RANK (and LOCAL_RANK) are set at the process level, so in order to set a valid value to exist it would have to already have kicked off already kicked off the (e.g. 8) processes per node. It just made it hard for me to get my head around what the architecture was or what was going on, or how to properly set the parameters to torchrun. Without actually creating the per-node processes, I don't think it's possible to set correct RANK or LOCAL_RANK vars that would conform to the torchrun definition (which doesn't matter because torchrun can set them itself), but something like NODE_RANK would be clear and meaningful in this context. I agree it's challenging and requires coordination though since the change isn't backwards compatible. e.g. if RANK is removed and replaced with NODE_RANK then users need to change any code referencing RANK. Then again the definition of WORLD_SIZE is already changing from nnodes to the global number of procs (to conform to torchrun), so might be a good time to change both. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
In the PyTorch documentation, it is mentioned that:
However, in the code below, the WORLD_SIZE environment variable is set to the number of replicas.
training-operator/pkg/controller.v1/pytorch/envvar.go
Lines 73 to 76 in aae672f
Why is this the case?
The text was updated successfully, but these errors were encountered: