Alternating rings cause bad performance (NIC sending PFC) in a cluster with mixed crossNic=0/1 nodes #1494

huzhiwen93 · 2024-10-25T11:12:15Z

NCCL version: 2.21.5 (the same for 2.20 ~ 2.23)

Test environment & conditions:

Two hosts, each with 8 Hopper GPUs. One of the host has limited NV-link bandwidth, causing crossNic=1 graph search result.
CUDA_VISIBLE_DEVICES=0,2,4,6 or CUDA_VISIBLE_DEVICES=1,3,5,7, i.e., totally 8 ranks.
Set NCCL_ALGO=Ring NCCL_PROTO=Simple to run all_reduce_perf or all_gather or reduce_scatter.

Both machines have the same intra-topo：

     NIC-A  NIC-B  NIC-C  NIC-D  
G0    pix    node   sys    sys    // same for G4
G1    node   pix    sys    sys    // same for G5
G2    sys    sys    pix    node   // same for G6
G3    sys    sys    node   pix    // same for G7

Test 1: The first node has crossNic=0 search result, while the second node has crossNic=1 search result.

+----------------------+-------- crossNic=0 -------+-------- crossNic=1 -------+
|  Pattern 4 channel0  |   A -> G0 G3 G2 G1 -> A   |   A -> G4 G7 G6 G5 -> B   |
|  Pattern 4 channel1  |   B -> G1 G0 G3 G2 -> B   |   B -> G5 G6 G7 G4 -> A   |
+----------------------+---------------------------+---------------------------+

Without alternating rings (by removing the Exchange rings codes in ncclTopoPostset): good performance (network not congested)
- global ring 0: -> A -> G0 G3 G2 G1 -> A -> A -> G4 G7 G6 G5 -> B ->
- global ring 1: -> A -> G1 G0 G3 G2 -> A -> B -> G4 G7 G6 G5 -> A ->
with alternating rings (ideally & actually): good performance (network is still not congested)
- global ring 0: -> A -> G0 G3 G2 G1 -> A -> B -> G5 G7 G6 G4 -> A ->
- global ring 1: -> A -> G1 G0 G3 G2 -> A -> A -> G4 G6 G7 G5 -> B ->

Test 2: The first node has crossNic=1 search result, while the second node has crossNic=2 search result.

+----------------------+-------- crossNic=1 -------+-------- crossNic=0 -------+
|  Pattern 4 channel0  |   A -> G0 G3 G2 G1 -> B   |   A -> G4 G7 G6 G5 -> A   |
|  Pattern 4 channel1  |   B -> G1 G2 G3 G0 -> A   |   B -> G5 G4 G7 G6 -> B   |
+----------------------+---------------------------+---------------------------+

without alternating rings (by removing the Exchange rings codes in ncclTopoPostset): good performance (network not congested)
- global ring 0: -> A -> G0 G3 G2 G1 -> B -> A -> G4 G7 G6 G5 -> B ->
- global ring 1: -> B -> G1 G2 G3 G0 -> A -> B -> G5 G4 G7 G6 -> A ->
with alternating rings (ideally):
- global ring 0: -> A -> G0 G3 G2 G1 -> B -> B -> G5 G4 G7 G6 -> B ->
- global ring 1: -> B -> G1 G2 G3 G0 -> A -> A -> G4 G7 G6 G5 -> A ->
with alternating rings (actually): bad performance (NIC send pfc to network) since the input NIC has node distance with the GPU
- global ring 0: -> A -> G0 G3 G2 G1 -> B -> A -> G5 G4 G7 G6 -> A ->
- global ring 1: -> B -> G1 G2 G3 G0 -> A -> B -> G4 G7 G6 G5 -> B ->

Reason of using the wrong nics:

Nccl bootstrapAllGather(allGather3Data) collects and takes the max value of crossNic, making it 1 for all ranks after that.
In ncclTopoPostset(), each rank considers half of the hosts (probably itself) exchanging ranks in ring 0 and ring1.
The exchange operation only affects comm→channels[xx]→ring, not including graph→inter or graph→intra.
Each GPU finds its NIC according to graph→inter and graph→intra in ncclTopoGetNetDev(), making the decisions outdated/wrong.
It seems to be a coincidence that crossNic=0 + crossNic=1 work without abnormal, since any given GPU’s closest NICs in 0&1 rings are the same in the crossNic=1 host.

@sjeaugey Hope you can see to this issue. We'll be appreciated.

The text was updated successfully, but these errors were encountered:

huzhiwen93 changed the title ~~Alternating rings cause bad performance (NIC sending PFC) on task with mixed crossNic=0/1 search results~~ Alternating rings cause bad performance (NIC sending PFC) in a cluster with mixed crossNic=0/1 nodes Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternating rings cause bad performance (NIC sending PFC) in a cluster with mixed crossNic=0/1 nodes #1494

Alternating rings cause bad performance (NIC sending PFC) in a cluster with mixed crossNic=0/1 nodes #1494

huzhiwen93 commented Oct 25, 2024 •

edited

Loading

Alternating rings cause bad performance (NIC sending PFC) in a cluster with mixed crossNic=0/1 nodes #1494

Alternating rings cause bad performance (NIC sending PFC) in a cluster with mixed crossNic=0/1 nodes #1494

Comments

huzhiwen93 commented Oct 25, 2024 • edited Loading

huzhiwen93 commented Oct 25, 2024 •

edited

Loading