Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it safe or recommended to use multiple communicators for real distributed training #1520

Open
ZhiyiHu1999 opened this issue Nov 19, 2024 · 0 comments

Comments

@ZhiyiHu1999
Copy link

Hello! In my parallelism strategy, for example with 4 nodes, 2GPUs per node. I hope to create a comm_0 for all 8 GPUs and comm_1 for 4 GPUs on node_0 and node_1 and comm_2 for the 4 other GPUs on node_2 and node_3.

If in my design, no collectives are intended to be concurrent for communicators consists of same GPU (Here, collectives on comm_0 and comm_1 not concurrent, but collectives on comm_1 and comm_2 may be concurrent) , is it a safe use of NCCL?

Also, does each communicator has its own Ring/Tree channels and need its own rank identifier from 0 to (nRanks_in_the_communicator - 1). Thanks a lot!

@ZhiyiHu1999 ZhiyiHu1999 changed the title Is it a safe or recommended to use multiple communicators for real distributed training Is it safe or recommended to use multiple communicators for real distributed training Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant