-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot reproduce performance from the paper for 50B and 100B models #1776
Comments
@anijmlt, thanks for your question and sharing your log. Here are some suggestions
"aio":{
"block_size":1048576,
"overlap_events":true,
"queue_depth":16,
"single_submit":false,
"thread_count":2
}
worker-4: [2021-04-04 22:32:12,030] [INFO] [utils.py:559:see_memory_usage] before forward 9
worker-4: [2021-04-04 22:32:12,031] [INFO] [utils.py:564:see_memory_usage] MA 1.55 GB Max_MA 1.56 GB CA 25.07 GB Max_CA 25 GB
worker-4: [2021-04-04 22:32:12,031] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory: used = 416.83 GB, percent = 27.6%
worker-4: iteration 9 lm loss = 32.01115036010742 loss_reduced = {'lm loss': tensor(32.0044, device='cuda:0')}
worker-4: [2021-04-04 22:33:12,142] [INFO] [utils.py:559:see_memory_usage] before backward 9
worker-4: [2021-04-04 22:33:12,142] [INFO] [utils.py:564:see_memory_usage] MA 9.07 GB Max_MA 11.57 GB CA 25.07 GB Max_CA 25 GB
worker-4: [2021-04-04 22:33:12,143] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory: used = 819.81 GB, percent = 54.3%
worker-4: [2021-04-04 22:35:16,997] [INFO] [utils.py:559:see_memory_usage] before optimizer 9
worker-4: [2021-04-04 22:35:16,998] [INFO] [utils.py:564:see_memory_usage] MA 1.54 GB Max_MA 21.27 GB CA 25.07 GB Max_CA 25 GB
worker-4: [2021-04-04 22:35:16,998] [INFO] [utils.py:569:see_memory_usage] CPU Virtual Memory: used = 416.88 GB, percent = 27.6%
worker-4: [2021-04-04 22:35:17,014] [INFO] [async_swapper.py:121:_report_statistics] Swapped out[Before flush] num_elems = 1610612736, 6.00 GB
worker-4: [2021-04-04 22:35:17,746] [INFO] [async_swapper.py:121:_report_statistics] Swapped out[After flush] num_elems = 3146842112, 11.72 GB
worker-4: [2021-04-04 22:36:15,733] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | swap_out_param: 24816.14 | swap_in_param: 15424.98 | swap_in_gradient: 7132.62 | async_swap_gradient_wait: 1613.43 | swap_out_gradient: 3775.40
worker-4: [2021-04-04 22:36:15,734] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 58447.52 | optimizer_swap_out_state: 24817.33 | optimizer_swap_in_state: 23293.92
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] step=10, skipped=0, lr=[0.0001470220264257707, 0.0001470220264257707], mom=[(0.9, 0.999), (0.9, 0.999)]
worker-4: [2021-04-04 22:36:15,736] [INFO] [timer.py:157:stop] 0/10, SamplesPerSec=0.5213996461948567
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 60036.95 | backward_microstep: 124798.40 | backward_inner_microstep: 124748.05 | backward_allreduce_microstep: 50.30 | step_microstep: 58737.44
worker-4: [2021-04-04 22:36:15,736] [INFO] [logging.py:60:log_dist] [Rank 0] rank=0 time (ms) | forward: 60036.92 | backward: 124798.38 | backward_inner: 124748.03 | backward_allreduce: 50.28 | step: 58737.42
worker-4: iteration 10/ 10 | elapsed time per iteration (ms): 243776.5 | learning rate: 1.470E-04 | lm loss: 3.200439E+01 | loss scale: 1.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
worker-4: time (ms) | forward: 60052.21 | backward: 124798.50 | backward-backward: 124798.46 | backward-allreduce: 0.00 | optimizer: 58737.61 | batch generator: 1.54
worker-4: Effective Tera Flops per GPU: 44.0 and total parameters 50.356 B
|
@tjruwase Thanks so much for your quick response! I tried out the parameters you mentioned, and I actually get worse performance that before. zeroinf_50B_bs26_raw_repro_log.txt Before:
After:
I notice that optimizer and batch generator got slightly better but fwd and bwd got much worse. Using |
Hmm, @anijmlt, this is quite mysterious. Although, My log is from 2021 as seen in the timestamps, and so I may need to run with latest master to investigate further. Unfortunately, I may not have much time until towards the end of next week. Please share updates on your findings. |
By the way, can you sanity check your nvme performance by running the benchmark here. You might find the entire discussion useful. |
@anijmlt, do you have any updates on your nvme performance tuning? |
@tjruwase Unfortunately I haven't been able to do this. I will close this issue for now and reopen it when I have more data. |
May I ask how many NVMe SSD used? |
@luckyq it was 8 SSDs |
I am using one node of DGX2 to reproduce performance shown in the paper for 50B and 100B models.
I am using using commit dac9056 which includes the latest goodness from #1453. For the 10B model as described in the paper I am able to see ~63 TF/GPU, however for 50B and 100B models it is much lower at ~35 TF/GPU.
Attaching the scripts and config I used as well as output logs from the runs (paths edited). What settings am I missing that would give me the 40+ TF/GPU that is shown in the paper?
DGX2_ZeroInf_repro.zip
The text was updated successfully, but these errors were encountered: