Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-node setting? #8

Open
GeneZC opened this issue Apr 26, 2022 · 2 comments
Open

Multi-node setting? #8

GeneZC opened this issue Apr 26, 2022 · 2 comments
Labels
Good Issue Good reference for newcomers

Comments

@GeneZC
Copy link

GeneZC commented Apr 26, 2022

https://github.com/KaiyuYue/torchshard/blob/89e21def180bf6063ceb2e312a61631173abc7e7/projects/minGPT/main.py#L150

I have noticed that the group_size is set to world_size in examples, but in fact the group_size can be set to other numbers according to my understanding.

https://github.com/KaiyuYue/torchshard/blob/main/torchshard/distributed/core.py#L18

I have also found that the get_world_size() will return the number of all processes.

The two findings make me confused in a multi-node setting, say 2 nodes with each node with 2 processes.

If the group_size is 2, then there are 2 distinct groups besides the default group (w/ overlap). However, get_world_size() is used without specifying a group can make a layer be splitted to 4 parts, which is expected to be 2 in our case.

Correct me if I am wrong.

@kaiyuyue
Copy link
Owner

Hi,

Sorry for the late. Yes, you are right. group_size can be other numbers. I'm glad to talk about this.

I didn't complete that part of the code (you found it!) because

  • group_size = world_size has the simplest implementation. Let's say, given world_size=8, if we want to slice weights into 4 shards, we need to write some codes to handle:

    • [1] keeping the same gradients on every 2 GPUs of 4 groups,
    • [2] gathering different gradients from 4 groups.

    If we slice the weights into 8 shards, we just have [2], which simply uses PyTorch original dist APIs to do that.

  • Another thought was that I think if we want to distribute weights onto multiple GPUs to reduce memory usage, why don't we reduce it to the minimal case, i.e., group_size = world_size. So I did it in this way.

If you would like to set a different number for group_size for handling some cases that I don't know, maybe would add codes onto DDP and other distributed APIs by setting group ids.

@kaiyuyue kaiyuyue added the Good Issue Good reference for newcomers label May 10, 2022
@GeneZC
Copy link
Author

GeneZC commented Feb 23, 2023

The case is actually that a Node can maximally hold 8 GPUs only. If we have 16 GPUs, we have to make them onto 2 Nodes, with 8 GPUs on each Node.

At the same time, synchronization in sharding is known to be slow between Nodes, so we have to make the group size to 8 but not 16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good Issue Good reference for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants