Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reproduce performance from the paper for 50B and 100B models #1776

Closed
anijmlt opened this issue Feb 15, 2022 · 8 comments
Closed

Cannot reproduce performance from the paper for 50B and 100B models #1776

anijmlt opened this issue Feb 15, 2022 · 8 comments
Assignees

Comments

@anijmlt
Copy link

anijmlt commented Feb 15, 2022

I am using one node of DGX2 to reproduce performance shown in the paper for 50B and 100B models.
image

I am using using commit dac9056 which includes the latest goodness from #1453. For the 10B model as described in the paper I am able to see ~63 TF/GPU, however for 50B and 100B models it is much lower at ~35 TF/GPU.

Attaching the scripts and config I used as well as output logs from the runs (paths edited). What settings am I missing that would give me the 40+ TF/GPU that is shown in the paper?
DGX2_ZeroInf_repro.zip

@tjruwase
Copy link
Contributor

@anijmlt, thanks for your question and sharing your log. Here are some suggestions

  1. Enable --split-transformers and --memory-centric-tiling options as discussed here and defined here.

  2. Your nvme swap seems slower, please see below for my log snippet. So configure aio as

"aio":{
  "block_size":1048576,
"overlap_events":true,
"queue_depth":16,
"single_submit":false,
 "thread_count":2
}
  1. Increase micro batch size to 26

  2. Below is the log snippet for the run that produced 44TFLOP/GPU showing the critical perf counters. Hope this helps your debugging.

worker-4: [2021-04-04 22:32:12,030] [INFO] [utils.py:559:see_memory_usage] before forward 9
worker-4: [2021-04-04 22:32:12,031] [INFO] [utils.py:564:see_memory_usage] MA 1.55 GB         Max_MA 1.56 GB         CA 25.07 GB         Max_CA 25 GB
worker-4: [2021-04-04 22:32:12,031] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory:  used = 416.83 GB, percent = 27.6%
worker-4: iteration 9 lm loss = 32.01115036010742 loss_reduced = {'lm loss': tensor(32.0044, device='cuda:0')}
worker-4: [2021-04-04 22:33:12,142] [INFO] [utils.py:559:see_memory_usage] before backward 9
worker-4: [2021-04-04 22:33:12,142] [INFO] [utils.py:564:see_memory_usage] MA 9.07 GB         Max_MA 11.57 GB         CA 25.07 GB         Max_CA 25 GB
worker-4: [2021-04-04 22:33:12,143] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory:  used = 819.81 GB, percent = 54.3%
worker-4: [2021-04-04 22:35:16,997] [INFO] [utils.py:559:see_memory_usage] before optimizer 9
worker-4: [2021-04-04 22:35:16,998] [INFO] [utils.py:564:see_memory_usage] MA 1.54 GB         Max_MA 21.27 GB         CA 25.07 GB         Max_CA 25 GB
worker-4: [2021-04-04 22:35:16,998] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory:  used = 416.88 GB, percent = 27.6%
worker-4: [2021-04-04 22:35:17,014] [INFO] [async_swapper.py:121:_report_statistics] Swapped out[Before flush] num_elems = 1610612736,  6.00 GB
worker-4: [2021-04-04 22:35:17,746] [INFO] [async_swapper.py:121:_report_statistics] Swapped out[After flush] num_elems = 3146842112, 11.72 GB
worker-4: [2021-04-04 22:36:15,733] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | swap_out_param: 24816.14 | swap_in_param: 15424.98 | swap_in_gradient: 7132.62 | async_swap_gradient_wait: 1613.43 | swap_out_gradient: 3775.40
worker-4: [2021-04-04 22:36:15,734] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 58447.52 | optimizer_swap_out_state: 24817.33 | optimizer_swap_in_state: 23293.92
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] step=10, skipped=0, lr=[0.0001470220264257707, 0.0001470220264257707], mom=[(0.9, 0.999), (0.9, 0.999)]
worker-4: [2021-04-04 22:36:15,736] [INFO] [timer.py:157:stop] 0/10, SamplesPerSec=0.5213996461948567
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 60036.95 | backward_microstep: 124798.40 | backward_inner_microstep: 124748.05 | backward_allreduce_microstep: 50.30 | step_microstep: 58737.44
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | forward: 60036.92 | backward: 124798.38 | backward_inner: 124748.03 | backward_allreduce: 50.28 | step: 58737.42
worker-4:  iteration       10/      10 | elapsed time per iteration (ms): 243776.5 | learning rate: 1.470E-04 | lm loss: 3.200439E+01 | loss scale: 1.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
worker-4: time (ms) | forward: 60052.21 | backward: 124798.50 | backward-backward: 124798.46 | backward-allreduce: 0.00 | optimizer: 58737.61 | batch generator: 1.54
worker-4: Effective Tera Flops per GPU: 44.0 and total parameters 50.356 B

@tjruwase tjruwase self-assigned this Feb 16, 2022
@anijmlt
Copy link
Author

anijmlt commented Feb 23, 2022

@tjruwase Thanks so much for your quick response! I tried out the parameters you mentioned, and I actually get worse performance that before. zeroinf_50B_bs26_raw_repro_log.txt

Before:

iteration        4/       4 | elapsed time per iteration (ms): 298910.1 | learning rate: 1.451E-04 | lm loss: 4.533051E+01 | loss scale: 1024.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 53565.19 | backward: 152702.14 | backward-backward: 152702.08 | backward-allreduce: 0.00 | optimizer: 92637.16 | batch generator: 14.53
Effective Tera Flops per GPU: 35.88 and total parameters 50.356 B

After:

iteration        4/       4 | elapsed time per iteration (ms): 338502.3 | learning rate: 1.451E-04 | lm loss: 4.373392E+01 | loss scale: 1024.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
time (ms) | forward: 79386.24 | backward: 169619.15 | backward-backward: 169619.07 | backward-allreduce: 0.00 | optimizer: 89484.05 | batch generator: 1.23
Effective Tera Flops per GPU: 31.68 and total parameters 50.356 B

I notice that optimizer and batch generator got slightly better but fwd and bwd got much worse. Using --split-transformers always gives me worse performance because there are twice as many D2H/H2D transfers. Under what circumstances do you see better perf with --split-transformers ?
Otherwise, it looks like each of my components is slower than yours after the new parameters. I am still looking into this.

@tjruwase
Copy link
Contributor

Hmm, @anijmlt, this is quite mysterious. Although, --split-transformers increases D2H/H2D transfers, it allows for larger micro-batch size and with overlapping of transfers and compute should improve throughput.

My log is from 2021 as seen in the timestamps, and so I may need to run with latest master to investigate further. Unfortunately, I may not have much time until towards the end of next week. Please share updates on your findings.

@tjruwase
Copy link
Contributor

By the way, can you sanity check your nvme performance by running the benchmark here. You might find the entire discussion useful.

@tjruwase
Copy link
Contributor

@anijmlt, do you have any updates on your nvme performance tuning?

@anijmlt
Copy link
Author

anijmlt commented May 19, 2022

@tjruwase Unfortunately I haven't been able to do this. I will close this issue for now and reopen it when I have more data.

@anijmlt anijmlt closed this as completed May 19, 2022
@luckyq
Copy link

luckyq commented Jun 21, 2022

@anijmlt, thanks for your question and sharing your log. Here are some suggestions

  1. Enable --split-transformers and --memory-centric-tiling options as discussed here and defined here.
  2. Your nvme swap seems slower, please see below for my log snippet. So configure aio as
"aio":{
  "block_size":1048576,
"overlap_events":true,
"queue_depth":16,
"single_submit":false,
 "thread_count":2
}
  1. Increase micro batch size to 26
  2. Below is the log snippet for the run that produced 44TFLOP/GPU showing the critical perf counters. Hope this helps your debugging.
worker-4: [2021-04-04 22:32:12,030] [INFO] [utils.py:559:see_memory_usage] before forward 9
worker-4: [2021-04-04 22:32:12,031] [INFO] [utils.py:564:see_memory_usage] MA 1.55 GB         Max_MA 1.56 GB         CA 25.07 GB         Max_CA 25 GB
worker-4: [2021-04-04 22:32:12,031] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory:  used = 416.83 GB, percent = 27.6%
worker-4: iteration 9 lm loss = 32.01115036010742 loss_reduced = {'lm loss': tensor(32.0044, device='cuda:0')}
worker-4: [2021-04-04 22:33:12,142] [INFO] [utils.py:559:see_memory_usage] before backward 9
worker-4: [2021-04-04 22:33:12,142] [INFO] [utils.py:564:see_memory_usage] MA 9.07 GB         Max_MA 11.57 GB         CA 25.07 GB         Max_CA 25 GB
worker-4: [2021-04-04 22:33:12,143] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory:  used = 819.81 GB, percent = 54.3%
worker-4: [2021-04-04 22:35:16,997] [INFO] [utils.py:559:see_memory_usage] before optimizer 9
worker-4: [2021-04-04 22:35:16,998] [INFO] [utils.py:564:see_memory_usage] MA 1.54 GB         Max_MA 21.27 GB         CA 25.07 GB         Max_CA 25 GB
worker-4: [2021-04-04 22:35:16,998] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory:  used = 416.88 GB, percent = 27.6%
worker-4: [2021-04-04 22:35:17,014] [INFO] [async_swapper.py:121:_report_statistics] Swapped out[Before flush] num_elems = 1610612736,  6.00 GB
worker-4: [2021-04-04 22:35:17,746] [INFO] [async_swapper.py:121:_report_statistics] Swapped out[After flush] num_elems = 3146842112, 11.72 GB
worker-4: [2021-04-04 22:36:15,733] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | swap_out_param: 24816.14 | swap_in_param: 15424.98 | swap_in_gradient: 7132.62 | async_swap_gradient_wait: 1613.43 | swap_out_gradient: 3775.40
worker-4: [2021-04-04 22:36:15,734] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 58447.52 | optimizer_swap_out_state: 24817.33 | optimizer_swap_in_state: 23293.92
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] step=10, skipped=0, lr=[0.0001470220264257707, 0.0001470220264257707], mom=[(0.9, 0.999), (0.9, 0.999)]
worker-4: [2021-04-04 22:36:15,736] [INFO] [timer.py:157:stop] 0/10, SamplesPerSec=0.5213996461948567
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 60036.95 | backward_microstep: 124798.40 | backward_inner_microstep: 124748.05 | backward_allreduce_microstep: 50.30 | step_microstep: 58737.44
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | forward: 60036.92 | backward: 124798.38 | backward_inner: 124748.03 | backward_allreduce: 50.28 | step: 58737.42
worker-4:  iteration       10/      10 | elapsed time per iteration (ms): 243776.5 | learning rate: 1.470E-04 | lm loss: 3.200439E+01 | loss scale: 1.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
worker-4: time (ms) | forward: 60052.21 | backward: 124798.50 | backward-backward: 124798.46 | backward-allreduce: 0.00 | optimizer: 58737.61 | batch generator: 1.54
worker-4: Effective Tera Flops per GPU: 44.0 and total parameters 50.356 B

May I ask how many NVMe SSD used?

@anijmlt
Copy link
Author

anijmlt commented Jun 21, 2022

@luckyq it was 8 SSDs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants