You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! In my parallelism strategy, for example with 4 nodes, 2GPUs per node. I hope to create a comm_0 for all 8 GPUs and comm_1 for 4 GPUs on node_0 and node_1 and comm_2 for the 4 other GPUs on node_2 and node_3.
If in my design, no collectives are intended to be concurrent for communicators consists of same GPU (Here, collectives on comm_0 and comm_1 not concurrent, but collectives on comm_1 and comm_2 may be concurrent) , is it a safe use of NCCL?
Also, does each communicator has its own Ring/Tree channels and need its own rank identifier from 0 to (nRanks_in_the_communicator - 1). Thanks a lot!
The text was updated successfully, but these errors were encountered:
ZhiyiHu1999
changed the title
Is it a safe or recommended to use multiple communicators for real distributed training
Is it safe or recommended to use multiple communicators for real distributed training
Nov 19, 2024
Hello! In my parallelism strategy, for example with 4 nodes, 2GPUs per node. I hope to create a comm_0 for all 8 GPUs and comm_1 for 4 GPUs on node_0 and node_1 and comm_2 for the 4 other GPUs on node_2 and node_3.
If in my design, no collectives are intended to be concurrent for communicators consists of same GPU (Here, collectives on comm_0 and comm_1 not concurrent, but collectives on comm_1 and comm_2 may be concurrent) , is it a safe use of NCCL?
Also, does each communicator has its own Ring/Tree channels and need its own rank identifier from 0 to (nRanks_in_the_communicator - 1). Thanks a lot!
The text was updated successfully, but these errors were encountered: