Is it safe or recommended to use multiple communicators for real distributed training #1520

ZhiyiHu1999 · 2024-11-19T13:58:13Z

Hello! In my parallelism strategy, for example with 4 nodes, 2GPUs per node. I hope to create a comm_0 for all 8 GPUs and comm_1 for 4 GPUs on node_0 and node_1 and comm_2 for the 4 other GPUs on node_2 and node_3.

If in my design, no collectives are intended to be concurrent for communicators consists of same GPU (Here, collectives on comm_0 and comm_1 not concurrent, but collectives on comm_1 and comm_2 may be concurrent) , is it a safe use of NCCL?

Also, does each communicator has its own Ring/Tree channels and need its own rank identifier from 0 to (nRanks_in_the_communicator - 1). Thanks a lot!

ZhiyiHu1999 changed the title ~~Is it a safe or recommended to use multiple communicators for real distributed training~~ Is it safe or recommended to use multiple communicators for real distributed training Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it safe or recommended to use multiple communicators for real distributed training #1520

Is it safe or recommended to use multiple communicators for real distributed training #1520

ZhiyiHu1999 commented Nov 19, 2024

Is it safe or recommended to use multiple communicators for real distributed training #1520

Is it safe or recommended to use multiple communicators for real distributed training #1520

Comments

ZhiyiHu1999 commented Nov 19, 2024