You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two hosts, each with 8 Hopper GPUs. One of the host has limited NV-link bandwidth, causing crossNic=1 graph search result.
CUDA_VISIBLE_DEVICES=0,2,4,6 or CUDA_VISIBLE_DEVICES=1,3,5,7, i.e., totally 8 ranks.
Set NCCL_ALGO=Ring NCCL_PROTO=Simple to run all_reduce_perf or all_gather or reduce_scatter.
Both machines have the same intra-topo:
NIC-A NIC-B NIC-C NIC-D
G0 pix node sys sys // same for G4
G1 node pix sys sys // same for G5
G2 sys sys pix node // same for G6
G3 sys sys node pix // same for G7
Test 1: The first node has crossNic=0 search result, while the second node has crossNic=1 search result.
+----------------------+-------- crossNic=0 -------+-------- crossNic=1 -------+
| Pattern 4 channel0 | A -> G0 G3 G2 G1 -> A | A -> G4 G7 G6 G5 -> B |
| Pattern 4 channel1 | B -> G1 G0 G3 G2 -> B | B -> G5 G6 G7 G4 -> A |
+----------------------+---------------------------+---------------------------+
Without alternating rings (by removing the Exchange rings codes in ncclTopoPostset): good performance (network not congested)
global ring 0: -> A -> G0 G3 G2 G1 -> A -> A -> G4 G7 G6 G5 -> B ->
global ring 1: -> A -> G1 G0 G3 G2 -> A -> B -> G4 G7 G6 G5 -> A ->
with alternating rings (ideally & actually): good performance (network is still not congested)
global ring 0: -> A -> G0 G3 G2 G1 -> A -> B -> G5 G7 G6 G4 -> A ->
global ring 1: -> A -> G1 G0 G3 G2 -> A -> A -> G4 G6 G7 G5 -> B ->
Test 2: The first node has crossNic=1 search result, while the second node has crossNic=2 search result.
+----------------------+-------- crossNic=1 -------+-------- crossNic=0 -------+
| Pattern 4 channel0 | A -> G0 G3 G2 G1 -> B | A -> G4 G7 G6 G5 -> A |
| Pattern 4 channel1 | B -> G1 G2 G3 G0 -> A | B -> G5 G4 G7 G6 -> B |
+----------------------+---------------------------+---------------------------+
without alternating rings (by removing the Exchange rings codes in ncclTopoPostset): good performance (network not congested)
global ring 0: -> A -> G0 G3 G2 G1 -> B -> A -> G4 G7 G6 G5 -> B ->
global ring 1: -> B -> G1 G2 G3 G0 -> A -> B -> G5 G4 G7 G6 -> A ->
with alternating rings (ideally):
global ring 0: -> A -> G0 G3 G2 G1 -> B -> B -> G5 G4 G7 G6 -> B ->
global ring 1: -> B -> G1 G2 G3 G0 -> A -> A -> G4 G7 G6 G5 -> A ->
with alternating rings (actually): bad performance (NIC send pfc to network) since the input NIC has node distance with the GPU
global ring 0: -> A -> G0 G3 G2 G1 -> B -> A -> G5 G4 G7 G6 -> A ->
global ring 1: -> B -> G1 G2 G3 G0 -> A -> B -> G4 G7 G6 G5 -> B ->
Reason of using the wrong nics:
Nccl bootstrapAllGather(allGather3Data) collects and takes the max value of crossNic, making it 1 for all ranks after that.
In ncclTopoPostset(), each rank considers half of the hosts (probably itself) exchanging ranks in ring 0 and ring1.
The exchange operation only affects comm→channels[xx]→ring, not including graph→inter or graph→intra.
Each GPU finds its NIC according to graph→inter and graph→intra in ncclTopoGetNetDev(), making the decisions outdated/wrong.
It seems to be a coincidence that crossNic=0 + crossNic=1 work without abnormal, since any given GPU’s closest NICs in 0&1 rings are the same in the crossNic=1 host.
@sjeaugey Hope you can see to this issue. We'll be appreciated.
The text was updated successfully, but these errors were encountered:
huzhiwen93
changed the title
Alternating rings cause bad performance (NIC sending PFC) on task with mixed crossNic=0/1 search results
Alternating rings cause bad performance (NIC sending PFC) in a cluster with mixed crossNic=0/1 nodes
Oct 25, 2024
NCCL version: 2.21.5 (the same for 2.20 ~ 2.23)
Test environment & conditions:
crossNic=1
graph search result.CUDA_VISIBLE_DEVICES=0,2,4,6
orCUDA_VISIBLE_DEVICES=1,3,5,7
, i.e., totally 8 ranks.NCCL_ALGO=Ring NCCL_PROTO=Simple
to runall_reduce_perf
orall_gather
orreduce_scatter
.Both machines have the same intra-topo:
Test 1: The first node has crossNic=0 search result, while the second node has crossNic=1 search result.
Exchange rings
codes inncclTopoPostset
): good performance (network not congested)Test 2: The first node has crossNic=1 search result, while the second node has crossNic=2 search result.
Exchange rings
codes inncclTopoPostset
): good performance (network not congested)node
distance with the GPUReason of using the wrong nics:
bootstrapAllGather(allGather3Data)
collects and takes the max value ofcrossNic
, making it1
for all ranks after that.ncclTopoPostset()
, each rank considers half of the hosts (probably itself) exchanging ranks in ring 0 and ring1.comm→channels[xx]→ring
, not includinggraph→inter
orgraph→intra
.graph→inter
andgraph→intra
inncclTopoGetNetDev()
, making the decisions outdated/wrong.crossNic=0
+crossNic=1
work without abnormal, since any given GPU’s closest NICs in 0&1 rings are the same in thecrossNic=1
host.@sjeaugey Hope you can see to this issue. We'll be appreciated.
The text was updated successfully, but these errors were encountered: