Multi-node setting? #8

GeneZC · 2022-04-26T11:39:56Z

https://github.com/KaiyuYue/torchshard/blob/89e21def180bf6063ceb2e312a61631173abc7e7/projects/minGPT/main.py#L150

I have noticed that the group_size is set to world_size in examples, but in fact the group_size can be set to other numbers according to my understanding.

https://github.com/KaiyuYue/torchshard/blob/main/torchshard/distributed/core.py#L18

I have also found that the get_world_size() will return the number of all processes.

The two findings make me confused in a multi-node setting, say 2 nodes with each node with 2 processes.

If the group_size is 2, then there are 2 distinct groups besides the default group (w/ overlap). However, get_world_size() is used without specifying a group can make a layer be splitted to 4 parts, which is expected to be 2 in our case.

Correct me if I am wrong.

The text was updated successfully, but these errors were encountered:

kaiyuyue · 2022-05-10T13:57:58Z

Hi,

Sorry for the late. Yes, you are right. group_size can be other numbers. I'm glad to talk about this.

I didn't complete that part of the code (you found it!) because

group_size = world_size has the simplest implementation. Let's say, given world_size=8, if we want to slice weights into 4 shards, we need to write some codes to handle:
- [1] keeping the same gradients on every 2 GPUs of 4 groups,
- [2] gathering different gradients from 4 groups.
If we slice the weights into 8 shards, we just have [2], which simply uses PyTorch original dist APIs to do that.
Another thought was that I think if we want to distribute weights onto multiple GPUs to reduce memory usage, why don't we reduce it to the minimal case, i.e., group_size = world_size. So I did it in this way.

If you would like to set a different number for group_size for handling some cases that I don't know, maybe would add codes onto DDP and other distributed APIs by setting group ids.

GeneZC · 2023-02-23T08:10:23Z

The case is actually that a Node can maximally hold 8 GPUs only. If we have 16 GPUs, we have to make them onto 2 Nodes, with 8 GPUs on each Node.

At the same time, synchronization in sharding is known to be slow between Nodes, so we have to make the group size to 8 but not 16.

kaiyuyue added the Good Issue Good reference for newcomers label May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node setting? #8

Multi-node setting? #8

GeneZC commented Apr 26, 2022

kaiyuyue commented May 10, 2022

GeneZC commented Feb 23, 2023

Multi-node setting? #8

Multi-node setting? #8

Comments

GeneZC commented Apr 26, 2022

kaiyuyue commented May 10, 2022

GeneZC commented Feb 23, 2023