Cannot reproduce performance from the paper for 50B and 100B models #1776

anijmlt · 2022-02-15T23:04:59Z

I am using one node of DGX2 to reproduce performance shown in the paper for 50B and 100B models.

I am using using commit dac9056 which includes the latest goodness from #1453. For the 10B model as described in the paper I am able to see ~63 TF/GPU, however for 50B and 100B models it is much lower at ~35 TF/GPU.

Attaching the scripts and config I used as well as output logs from the runs (paths edited). What settings am I missing that would give me the 40+ TF/GPU that is shown in the paper?
DGX2_ZeroInf_repro.zip

The text was updated successfully, but these errors were encountered:

tjruwase · 2022-02-16T00:01:30Z

@anijmlt, thanks for your question and sharing your log. Here are some suggestions

Enable --split-transformers and --memory-centric-tiling options as discussed here and defined here.
Your nvme swap seems slower, please see below for my log snippet. So configure aio as

"aio":{
  "block_size":1048576,
"overlap_events":true,
"queue_depth":16,
"single_submit":false,
 "thread_count":2
}

Increase micro batch size to 26
Below is the log snippet for the run that produced 44TFLOP/GPU showing the critical perf counters. Hope this helps your debugging.

worker-4: [2021-04-04 22:32:12,030] [INFO] [utils.py:559:see_memory_usage] before forward 9
worker-4: [2021-04-04 22:32:12,031] [INFO] [utils.py:564:see_memory_usage] MA 1.55 GB         Max_MA 1.56 GB         CA 25.07 GB         Max_CA 25 GB
worker-4: [2021-04-04 22:32:12,031] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory:  used = 416.83 GB, percent = 27.6%
worker-4: iteration 9 lm loss = 32.01115036010742 loss_reduced = {'lm loss': tensor(32.0044, device='cuda:0')}
worker-4: [2021-04-04 22:33:12,142] [INFO] [utils.py:559:see_memory_usage] before backward 9
worker-4: [2021-04-04 22:33:12,142] [INFO] [utils.py:564:see_memory_usage] MA 9.07 GB         Max_MA 11.57 GB         CA 25.07 GB         Max_CA 25 GB
worker-4: [2021-04-04 22:33:12,143] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory:  used = 819.81 GB, percent = 54.3%
worker-4: [2021-04-04 22:35:16,997] [INFO] [utils.py:559:see_memory_usage] before optimizer 9
worker-4: [2021-04-04 22:35:16,998] [INFO] [utils.py:564:see_memory_usage] MA 1.54 GB         Max_MA 21.27 GB         CA 25.07 GB         Max_CA 25 GB
worker-4: [2021-04-04 22:35:16,998] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory:  used = 416.88 GB, percent = 27.6%
worker-4: [2021-04-04 22:35:17,014] [INFO] [async_swapper.py:121:_report_statistics] Swapped out[Before flush] num_elems = 1610612736,  6.00 GB
worker-4: [2021-04-04 22:35:17,746] [INFO] [async_swapper.py:121:_report_statistics] Swapped out[After flush] num_elems = 3146842112, 11.72 GB
worker-4: [2021-04-04 22:36:15,733] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | swap_out_param: 24816.14 | swap_in_param: 15424.98 | swap_in_gradient: 7132.62 | async_swap_gradient_wait: 1613.43 | swap_out_gradient: 3775.40
worker-4: [2021-04-04 22:36:15,734] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 58447.52 | optimizer_swap_out_state: 24817.33 | optimizer_swap_in_state: 23293.92
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] step=10, skipped=0, lr=[0.0001470220264257707, 0.0001470220264257707], mom=[(0.9, 0.999), (0.9, 0.999)]
worker-4: [2021-04-04 22:36:15,736] [INFO] [timer.py:157:stop] 0/10, SamplesPerSec=0.5213996461948567
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 60036.95 | backward_microstep: 124798.40 | backward_inner_microstep: 124748.05 | backward_allreduce_microstep: 50.30 | step_microstep: 58737.44
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | forward: 60036.92 | backward: 124798.38 | backward_inner: 124748.03 | backward_allreduce: 50.28 | step: 58737.42
worker-4:  iteration       10/      10 | elapsed time per iteration (ms): 243776.5 | learning rate: 1.470E-04 | lm loss: 3.200439E+01 | loss scale: 1.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
worker-4: time (ms) | forward: 60052.21 | backward: 124798.50 | backward-backward: 124798.46 | backward-allreduce: 0.00 | optimizer: 58737.61 | batch generator: 1.54
worker-4: Effective Tera Flops per GPU: 44.0 and total parameters 50.356 B

anijmlt · 2022-02-23T00:55:23Z

@tjruwase Thanks so much for your quick response! I tried out the parameters you mentioned, and I actually get worse performance that before. zeroinf_50B_bs26_raw_repro_log.txt

Before:

iteration        4/       4 | elapsed time per iteration (ms): 298910.1 | learning rate: 1.451E-04 | lm loss: 4.533051E+01 | loss scale: 1024.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 53565.19 | backward: 152702.14 | backward-backward: 152702.08 | backward-allreduce: 0.00 | optimizer: 92637.16 | batch generator: 14.53
Effective Tera Flops per GPU: 35.88 and total parameters 50.356 B

After:

iteration        4/       4 | elapsed time per iteration (ms): 338502.3 | learning rate: 1.451E-04 | lm loss: 4.373392E+01 | loss scale: 1024.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 79386.24 | backward: 169619.15 | backward-backward: 169619.07 | backward-allreduce: 0.00 | optimizer: 89484.05 | batch generator: 1.23
Effective Tera Flops per GPU: 31.68 and total parameters 50.356 B

I notice that optimizer and batch generator got slightly better but fwd and bwd got much worse. Using --split-transformers always gives me worse performance because there are twice as many D2H/H2D transfers. Under what circumstances do you see better perf with --split-transformers ?
Otherwise, it looks like each of my components is slower than yours after the new parameters. I am still looking into this.

tjruwase · 2022-02-23T02:15:10Z

Hmm, @anijmlt, this is quite mysterious. Although, --split-transformers increases D2H/H2D transfers, it allows for larger micro-batch size and with overlapping of transfers and compute should improve throughput.

My log is from 2021 as seen in the timestamps, and so I may need to run with latest master to investigate further. Unfortunately, I may not have much time until towards the end of next week. Please share updates on your findings.

tjruwase · 2022-02-23T02:17:33Z

By the way, can you sanity check your nvme performance by running the benchmark here. You might find the entire discussion useful.

tjruwase · 2022-03-15T13:59:29Z

@anijmlt, do you have any updates on your nvme performance tuning?

anijmlt · 2022-05-19T17:55:27Z

@tjruwase Unfortunately I haven't been able to do this. I will close this issue for now and reopen it when I have more data.

luckyq · 2022-06-21T03:47:13Z

@anijmlt, thanks for your question and sharing your log. Here are some suggestions

Enable --split-transformers and --memory-centric-tiling options as discussed here and defined here.
Your nvme swap seems slower, please see below for my log snippet. So configure aio as

"aio":{
  "block_size":1048576,
"overlap_events":true,
"queue_depth":16,
"single_submit":false,
 "thread_count":2
}

Increase micro batch size to 26
Below is the log snippet for the run that produced 44TFLOP/GPU showing the critical perf counters. Hope this helps your debugging.

worker-4: [2021-04-04 22:32:12,030] [INFO] [utils.py:559:see_memory_usage] before forward 9
worker-4: [2021-04-04 22:32:12,031] [INFO] [utils.py:564:see_memory_usage] MA 1.55 GB         Max_MA 1.56 GB         CA 25.07 GB         Max_CA 25 GB
worker-4: [2021-04-04 22:32:12,031] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory:  used = 416.83 GB, percent = 27.6%
worker-4: iteration 9 lm loss = 32.01115036010742 loss_reduced = {'lm loss': tensor(32.0044, device='cuda:0')}
worker-4: [2021-04-04 22:33:12,142] [INFO] [utils.py:559:see_memory_usage] before backward 9
worker-4: [2021-04-04 22:33:12,142] [INFO] [utils.py:564:see_memory_usage] MA 9.07 GB         Max_MA 11.57 GB         CA 25.07 GB         Max_CA 25 GB
worker-4: [2021-04-04 22:33:12,143] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory:  used = 819.81 GB, percent = 54.3%
worker-4: [2021-04-04 22:35:16,997] [INFO] [utils.py:559:see_memory_usage] before optimizer 9
worker-4: [2021-04-04 22:35:16,998] [INFO] [utils.py:564:see_memory_usage] MA 1.54 GB         Max_MA 21.27 GB         CA 25.07 GB         Max_CA 25 GB
worker-4: [2021-04-04 22:35:16,998] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory:  used = 416.88 GB, percent = 27.6%
worker-4: [2021-04-04 22:35:17,014] [INFO] [async_swapper.py:121:_report_statistics] Swapped out[Before flush] num_elems = 1610612736,  6.00 GB
worker-4: [2021-04-04 22:35:17,746] [INFO] [async_swapper.py:121:_report_statistics] Swapped out[After flush] num_elems = 3146842112, 11.72 GB
worker-4: [2021-04-04 22:36:15,733] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | swap_out_param: 24816.14 | swap_in_param: 15424.98 | swap_in_gradient: 7132.62 | async_swap_gradient_wait: 1613.43 | swap_out_gradient: 3775.40
worker-4: [2021-04-04 22:36:15,734] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 58447.52 | optimizer_swap_out_state: 24817.33 | optimizer_swap_in_state: 23293.92
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] step=10, skipped=0, lr=[0.0001470220264257707, 0.0001470220264257707], mom=[(0.9, 0.999), (0.9, 0.999)]
worker-4: [2021-04-04 22:36:15,736] [INFO] [timer.py:157:stop] 0/10, SamplesPerSec=0.5213996461948567
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 60036.95 | backward_microstep: 124798.40 | backward_inner_microstep: 124748.05 | backward_allreduce_microstep: 50.30 | step_microstep: 58737.44
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | forward: 60036.92 | backward: 124798.38 | backward_inner: 124748.03 | backward_allreduce: 50.28 | step: 58737.42
worker-4:  iteration       10/      10 | elapsed time per iteration (ms): 243776.5 | learning rate: 1.470E-04 | lm loss: 3.200439E+01 | loss scale: 1.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
worker-4: time (ms) | forward: 60052.21 | backward: 124798.50 | backward-backward: 124798.46 | backward-allreduce: 0.00 | optimizer: 58737.61 | batch generator: 1.54
worker-4: Effective Tera Flops per GPU: 44.0 and total parameters 50.356 B

May I ask how many NVMe SSD used?

anijmlt · 2022-06-21T17:58:05Z

@luckyq it was 8 SSDs

tjruwase self-assigned this Feb 16, 2022

anijmlt closed this as completed May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot reproduce performance from the paper for 50B and 100B models #1776

Cannot reproduce performance from the paper for 50B and 100B models #1776

anijmlt commented Feb 15, 2022

tjruwase commented Feb 16, 2022

anijmlt commented Feb 23, 2022

tjruwase commented Feb 23, 2022

tjruwase commented Feb 23, 2022

tjruwase commented Mar 15, 2022

anijmlt commented May 19, 2022

luckyq commented Jun 21, 2022

anijmlt commented Jun 21, 2022

Cannot reproduce performance from the paper for 50B and 100B models #1776

Cannot reproduce performance from the paper for 50B and 100B models #1776

Comments

anijmlt commented Feb 15, 2022

tjruwase commented Feb 16, 2022

anijmlt commented Feb 23, 2022

tjruwase commented Feb 23, 2022

tjruwase commented Feb 23, 2022

tjruwase commented Mar 15, 2022

anijmlt commented May 19, 2022

luckyq commented Jun 21, 2022

anijmlt commented Jun 21, 2022