Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternating rings cause bad performance (NIC sending PFC) in a cluster with mixed crossNic=0/1 nodes #1494

Open
huzhiwen93 opened this issue Oct 25, 2024 · 0 comments

Comments

@huzhiwen93
Copy link

huzhiwen93 commented Oct 25, 2024

NCCL version: 2.21.5 (the same for 2.20 ~ 2.23)

Test environment & conditions:

  • Two hosts, each with 8 Hopper GPUs. One of the host has limited NV-link bandwidth, causing crossNic=1 graph search result.
  • CUDA_VISIBLE_DEVICES=0,2,4,6 or CUDA_VISIBLE_DEVICES=1,3,5,7, i.e., totally 8 ranks.
  • Set NCCL_ALGO=Ring NCCL_PROTO=Simple to run all_reduce_perf or all_gather or reduce_scatter.

Both machines have the same intra-topo:

     NIC-A  NIC-B  NIC-C  NIC-D  
G0    pix    node   sys    sys    // same for G4
G1    node   pix    sys    sys    // same for G5
G2    sys    sys    pix    node   // same for G6
G3    sys    sys    node   pix    // same for G7

Test 1: The first node has crossNic=0 search result, while the second node has crossNic=1 search result.

+----------------------+-------- crossNic=0 -------+-------- crossNic=1 -------+
|  Pattern 4 channel0  |   A -> G0 G3 G2 G1 -> A   |   A -> G4 G7 G6 G5 -> B   |
|  Pattern 4 channel1  |   B -> G1 G0 G3 G2 -> B   |   B -> G5 G6 G7 G4 -> A   |
+----------------------+---------------------------+---------------------------+
  • Without alternating rings (by removing the Exchange rings codes in ncclTopoPostset): good performance (network not congested)
    • global ring 0: -> A -> G0 G3 G2 G1 -> A -> A -> G4 G7 G6 G5 -> B ->
    • global ring 1: -> A -> G1 G0 G3 G2 -> A -> B -> G4 G7 G6 G5 -> A ->
  • with alternating rings (ideally & actually): good performance (network is still not congested)
    • global ring 0: -> A -> G0 G3 G2 G1 -> A -> B -> G5 G7 G6 G4 -> A ->
    • global ring 1: -> A -> G1 G0 G3 G2 -> A -> A -> G4 G6 G7 G5 -> B ->

Test 2: The first node has crossNic=1 search result, while the second node has crossNic=2 search result.

+----------------------+-------- crossNic=1 -------+-------- crossNic=0 -------+
|  Pattern 4 channel0  |   A -> G0 G3 G2 G1 -> B   |   A -> G4 G7 G6 G5 -> A   |
|  Pattern 4 channel1  |   B -> G1 G2 G3 G0 -> A   |   B -> G5 G4 G7 G6 -> B   |
+----------------------+---------------------------+---------------------------+
  • without alternating rings (by removing the Exchange rings codes in ncclTopoPostset): good performance (network not congested)
    • global ring 0: -> A -> G0 G3 G2 G1 -> B -> A -> G4 G7 G6 G5 -> B ->
    • global ring 1: -> B -> G1 G2 G3 G0 -> A -> B -> G5 G4 G7 G6 -> A ->
  • with alternating rings (ideally):
    • global ring 0: -> A -> G0 G3 G2 G1 -> B -> B -> G5 G4 G7 G6 -> B ->
    • global ring 1: -> B -> G1 G2 G3 G0 -> A -> A -> G4 G7 G6 G5 -> A ->
  • with alternating rings (actually): bad performance (NIC send pfc to network) since the input NIC has node distance with the GPU
    • global ring 0: -> A -> G0 G3 G2 G1 -> B -> A -> G5 G4 G7 G6 -> A ->
    • global ring 1: -> B -> G1 G2 G3 G0 -> A -> B -> G4 G7 G6 G5 -> B ->

Reason of using the wrong nics:

  • Nccl bootstrapAllGather(allGather3Data) collects and takes the max value of crossNic, making it 1 for all ranks after that.
  • In ncclTopoPostset(), each rank considers half of the hosts (probably itself) exchanging ranks in ring 0 and ring1.
  • The exchange operation only affects comm→channels[xx]→ring, not including graph→inter or graph→intra.
  • Each GPU finds its NIC according to graph→inter and graph→intra in ncclTopoGetNetDev(), making the decisions outdated/wrong.
  • It seems to be a coincidence that crossNic=0 + crossNic=1 work without abnormal, since any given GPU’s closest NICs in 0&1 rings are the same in the crossNic=1 host.

@sjeaugey Hope you can see to this issue. We'll be appreciated.

@huzhiwen93 huzhiwen93 changed the title Alternating rings cause bad performance (NIC sending PFC) on task with mixed crossNic=0/1 search results Alternating rings cause bad performance (NIC sending PFC) in a cluster with mixed crossNic=0/1 nodes Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant