You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have noticed that the group_size is set to world_size in examples, but in fact the group_size can be set to other numbers according to my understanding.
I have also found that the get_world_size() will return the number of all processes.
The two findings make me confused in a multi-node setting, say 2 nodes with each node with 2 processes.
If the group_size is 2, then there are 2 distinct groups besides the default group (w/ overlap). However, get_world_size() is used without specifying a group can make a layer be splitted to 4 parts, which is expected to be 2 in our case.
Correct me if I am wrong.
The text was updated successfully, but these errors were encountered:
Sorry for the late. Yes, you are right. group_size can be other numbers. I'm glad to talk about this.
I didn't complete that part of the code (you found it!) because
group_size = world_size has the simplest implementation. Let's say, given world_size=8, if we want to slice weights into 4 shards, we need to write some codes to handle:
[1] keeping the same gradients on every 2 GPUs of 4 groups,
[2] gathering different gradients from 4 groups.
If we slice the weights into 8 shards, we just have [2], which simply uses PyTorch original dist APIs to do that.
Another thought was that I think if we want to distribute weights onto multiple GPUs to reduce memory usage, why don't we reduce it to the minimal case, i.e., group_size = world_size. So I did it in this way.
If you would like to set a different number for group_size for handling some cases that I don't know, maybe would add codes onto DDP and other distributed APIs by setting group ids.
https://github.com/KaiyuYue/torchshard/blob/89e21def180bf6063ceb2e312a61631173abc7e7/projects/minGPT/main.py#L150
I have noticed that the
group_size
is set toworld_size
in examples, but in fact thegroup_size
can be set to other numbers according to my understanding.https://github.com/KaiyuYue/torchshard/blob/main/torchshard/distributed/core.py#L18
I have also found that the
get_world_size()
will return the number of all processes.The two findings make me confused in a multi-node setting, say 2 nodes with each node with 2 processes.
If the group_size is 2, then there are 2 distinct groups besides the default group (w/ overlap). However,
get_world_size()
is used without specifying a group can make a layer be splitted to 4 parts, which is expected to be 2 in our case.Correct me if I am wrong.
The text was updated successfully, but these errors were encountered: