Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During DDP training, eight GPUs are temporarily locked at 100% usage. #165

Open
LuletterSoul opened this issue Mar 21, 2024 · 3 comments
Open

Comments

@LuletterSoul
Copy link

LuletterSoul commented Mar 21, 2024

I am repoducing yolol-worldv2 using 8 GPUs, 1 node. The 100% occupation seems to result in slower training times overall.

---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:07:00.0 Off |                    0 |
| N/A   41C    P0              80W / 400W |  16261MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          Off | 00000000:0A:00.0 Off |                    0 |
| N/A   36C    P0              79W / 400W |  18193MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          Off | 00000000:47:00.0 Off |                    0 |
| N/A   37C    P0              79W / 400W |  14219MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          Off | 00000000:4D:00.0 Off |                    0 |
| N/A   41C    P0              83W / 400W |  19767MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          Off | 00000000:87:00.0 Off |                    0 |
| N/A   42C    P0              85W / 400W |  14965MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-40GB          Off | 00000000:8D:00.0 Off |                    0 |
| N/A   38C    P0              87W / 400W |  16079MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-40GB          Off | 00000000:C7:00.0 Off |                    0 |
| N/A   37C    P0              79W / 400W |  19733MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-40GB          Off | 00000000:CA:00.0 Off |                    0 |
| N/A   42C    P0              83W / 400W |  12913MiB / 40960MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

In addition, I also found that the time consumed by each forward is unstable, sometimes it takes 0.6X, sometimes it takes 1.X, so I want to ask which step in the training process may cause this cost instability.

For exampe, it takes 0.6XX for once forward in epoch 83.

2024/01/24 02:46:39 - mmengine - INFO - Checkpoints will be saved to /group/30042/adriancheng/FastDet/outputs/pretrain_yolow-v8_l_clipv2_frozen_te_noprompt_t2i_bn_2e-3adamw_scale_lr_wd_32xb16-100e_obj365v1_goldg_train_lviseval_second.
2024/01/24 02:47:35 - mmengine - INFO - Epoch(train)  [83][  50/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 14:57:00  time: 1.1114  data_time: 0.0995  memory: 27201  grad_norm: 1298.9029  loss: 1917.3984  loss_cls: 708.9749  loss_bbox: 572.2576  loss_dfl: 636.1659
2024/01/24 02:48:03 - mmengine - INFO - Epoch(train)  [83][ 100/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 11:11:22  time: 0.5540  data_time: 0.0036  memory: 12221  grad_norm: 1304.1145  loss: 1953.1028  loss_cls: 731.8189  loss_bbox: 583.1529  loss_dfl: 638.1310
2024/01/24 02:48:32 - mmengine - INFO - Epoch(train)  [83][ 150/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 10:05:56  time: 0.5916  data_time: 0.0035  memory: 11257  grad_norm: 1224.8096  loss: 1944.7829  loss_cls: 730.5754  loss_bbox: 574.0319  loss_dfl: 640.1756
2024/01/24 02:48:46 - mmengine - INFO - Exp name: yolow-v8_l_clipv2_frozen_te_noprompt_t2i_bn_2e-3adamw_scale_lr_wd_32xb16-100e_obj365v1_goldg_train_lviseval_second_20240124_024126
2024/01/24 02:49:02 - mmengine - INFO - Epoch(train)  [83][ 200/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 9:30:52  time: 0.5811  data_time: 0.0040  memory: 11350  grad_norm: 1441.6634  loss: 1931.3984  loss_cls: 730.8749  loss_bbox: 567.2145  loss_dfl: 633.3090
2024/01/24 02:49:31 - mmengine - INFO - Epoch(train)  [83][ 250/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 9:11:46  time: 0.5944  data_time: 0.0051  memory: 12043  grad_norm: 1251.3882  loss: 1923.6585  loss_cls: 720.5028  loss_bbox: 567.4859  loss_dfl: 635.6698
2024/01/24 02:50:03 - mmengine - INFO - Epoch(train)  [83][ 300/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 9:04:54  time: 0.6394  data_time: 0.0035  memory: 11883  grad_norm: inf  loss: 1960.0928  loss_cls: 730.5000  loss_bbox: 587.6611  loss_dfl: 641.9317
2024/01/24 02:50:32 - mmengine - INFO - Epoch(train)  [83][ 350/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:51:35  time: 0.5674  data_time: 0.0034  memory: 11750  grad_norm: 1426.9864  loss: 1913.7995  loss_cls: 713.1300  loss_bbox: 567.7602  loss_dfl: 632.9093
2024/01/24 02:51:06 - mmengine - INFO - Epoch(train)  [83][ 400/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:53:03  time: 0.6830  data_time: 0.0038  memory: 12217  grad_norm: 1230.4055  loss: 1938.0739  loss_cls: 724.0408  loss_bbox: 577.7115  loss_dfl: 636.3216
2024/01/24 02:51:36 - mmengine - INFO - Epoch(train)  [83][ 450/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:45:57  time: 0.5917  data_time: 0.0037  memory: 12657  grad_norm: 1262.3597  loss: 1928.3339  loss_cls: 722.3070  loss_bbox: 571.3169  loss_dfl: 634.7099
2024/01/24 02:52:03 - mmengine - INFO - Epoch(train)  [83][ 500/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:37:19  time: 0.5559  data_time: 0.0037  memory: 12430  grad_norm: 1171.2148  loss: 1938.4618  loss_cls: 738.1616  loss_bbox: 569.1982  loss_dfl: 631.1020
2024/01/24 02:52:35 - mmengine - INFO - Epoch(train)  [83][ 550/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:36:06  time: 0.6378  data_time: 0.0037  memory: 12230  grad_norm: 1159.3171  loss: 1948.1200  loss_cls: 723.3123  loss_bbox: 581.8724  loss_dfl: 642.9354
2024/01/24 02:53:04 - mmengine - INFO - Epoch(train)  [83][ 600/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:30:35  time: 0.5712  data_time: 0.0035  memory: 11524  grad_norm: 1171.0054  loss: 1920.4894  loss_cls: 722.6454  loss_bbox: 565.6277  loss_dfl: 632.2163
2024/01/24 02:53:34 - mmengine - INFO - Epoch(train)  [83][ 650/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:27:18  time: 0.5951  data_time: 0.0036  memory: 11457  grad_norm: 1214.5159  loss: 1935.7862  loss_cls: 722.0093  loss_bbox: 579.0891  loss_dfl: 634.6878
2024/01/24 02:54:03 - mmengine - INFO - Epoch(train)  [83][ 700/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:24:02  time: 0.5884  data_time: 0.0036  memory: 12364  grad_norm: 1319.2560  loss: 1901.4851  loss_cls: 704.9847  loss_bbox: 571.3287  loss_dfl: 625.1716
2024/01/24 02:54:33 - mmengine - INFO - Epoch(train)  [83][ 750/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:20:57  time: 0.5847  data_time: 0.0035  memory: 11550  grad_norm: 1424.9852  loss: 1915.1698  loss_cls: 715.9729  loss_bbox: 566.6617  loss_dfl: 632.5353
2024/01/24 02:55:04 - mmengine - INFO - Epoch(train)  [83][ 800/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:19:54  time: 0.6193  data_time: 0.0035  memory: 11564  grad_norm: 1166.3062  loss: 1927.2455  loss_cls: 722.3663  loss_bbox: 574.1043  loss_dfl: 630.7749
2024/01/24 02:55:32 - mmengine - INFO - Epoch(train)  [83][ 850/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:16:37  time: 0.5700  data_time: 0.0035  memory: 12790  grad_norm: 1290.3414  loss: 1898.5270  loss_cls: 705.6964  loss_bbox: 565.3750  loss_dfl: 627.4557
2024/01/24 02:56:04 - mmengine - INFO - Epoch(train)  [83][ 900/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:16:07  time: 0.6262  data_time: 0.0035  memory: 11670  grad_norm: 1328.3285  loss: 1938.6448  loss_cls: 729.0949  loss_bbox: 574.5170  loss_dfl: 635.0329
2024/01/24 02:56:32 - mmengine - INFO - Epoch(train)  [83][ 950/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:12:46  time: 0.5579  data_time: 0.0037  memory: 11670  grad_norm: 1246.5969  loss: 1942.6802  loss_cls: 735.6486  loss_bbox: 573.2955  loss_dfl: 633.7362
2024/01/24 02:57:02 - mmengine - INFO - Epoch(train)  [83][1000/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:11:54  time: 0.6132  data_time: 0.0037  memory: 11804  grad_norm: 1097.4520  loss: 1934.1192  loss_cls: 723.8205  loss_bbox: 574.6126  loss_dfl: 635.6860
2024/01/24 02:57:35 - mmengine - INFO - Epoch(train)  [83][1050/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:12:29  time: 0.6513  data_time: 0.0035  memory: 12590  grad_norm: 1282.2389  loss: 1940.2723  loss_cls: 721.0600  loss_bbox: 579.7090  loss_dfl: 639.5034
2024/01/24 02:58:02 - mmengine - INFO - Epoch(train)  [83][1100/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:09:25  time: 0.5521  data_time: 0.0035  memory: 12057  grad_norm: 1262.8680  loss: 1954.8336  loss_cls: 736.2485  loss_bbox: 584.4282  loss_dfl: 634.1570
2024/01/24 02:58:32 - mmengine - INFO - Epoch(train)  [83][1150/2693]  base_lr: 2.0000e-03 lr: 3.9620e-04  eta: 8:08:03  time: 0.5946  data_time: 0.0035  memory: 12950  grad_norm: 1251.2104  loss: 1907.9043  loss_cls: 709.5933  loss_bbox: 568.8475  loss_dfl: 629.4635
2024/01/24 02:58:45 - mmengine - INFO - Exp name: yolow-v8_l_clipv2_frozen_te_noprompt_t2i_bn_2e-3adamw_scale_lr_wd_32xb16-100e_obj365v1_goldg_train_lviseval_second_20240124_024126

However, it takes 1.XXX for once forward in epoch 62, almost double time than before.

2024/01/23 10:05:04 - mmengine - INFO - Epoch(train)  [62][1750/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:09:41  time: 1.2069  data_time: 0.0319  memory: 14225  grad_norm: 1088.3497  loss: 2036.3389  loss_cls: 777.0853  loss_bbox: 602.5995  loss_dfl: 656.6541
2024/01/23 10:05:52 - mmengine - INFO - Epoch(train)  [62][1800/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:08:57  time: 0.9690  data_time: 0.0038  memory: 11399  grad_norm: 1013.2351  loss: 1992.8268  loss_cls: 759.0890  loss_bbox: 582.5867  loss_dfl: 651.1510
2024/01/23 10:06:58 - mmengine - INFO - Epoch(train)  [62][1850/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:08:24  time: 1.3134  data_time: 0.0041  memory: 11626  grad_norm: 1116.8541  loss: 2003.2890  loss_cls: 762.8637  loss_bbox: 595.3914  loss_dfl: 645.0338
2024/01/23 10:07:41 - mmengine - INFO - Epoch(train)  [62][1900/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:07:36  time: 0.8466  data_time: 0.0037  memory: 11946  grad_norm: 1068.1516  loss: 1999.0786  loss_cls: 762.4909  loss_bbox: 592.7665  loss_dfl: 643.8213
2024/01/23 10:08:41 - mmengine - INFO - Epoch(train)  [62][1950/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:07:00  time: 1.2108  data_time: 0.0037  memory: 12172  grad_norm: 1228.8369  loss: 1993.5879  loss_cls: 761.4466  loss_bbox: 590.3838  loss_dfl: 641.7575
2024/01/23 10:09:36 - mmengine - INFO - Epoch(train)  [62][2000/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:06:21  time: 1.1058  data_time: 0.0037  memory: 12332  grad_norm: 932.2955  loss: 2062.0076  loss_cls: 791.1986  loss_bbox: 605.1141  loss_dfl: 665.6949
2024/01/23 10:10:24 - mmengine - INFO - Epoch(train)  [62][2050/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:05:36  time: 0.9550  data_time: 0.0037  memory: 12106  grad_norm: 1157.6741  loss: 1992.7866  loss_cls: 756.4549  loss_bbox: 591.0021  loss_dfl: 645.3296
2024/01/23 10:11:27 - mmengine - INFO - Epoch(train)  [62][2100/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:05:01  time: 1.2486  data_time: 0.0036  memory: 13146  grad_norm: 1051.7533  loss: 1998.7073  loss_cls: 766.7520  loss_bbox: 584.3259  loss_dfl: 647.6294
2024/01/23 10:12:10 - mmengine - INFO - Epoch(train)  [62][2150/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:04:14  time: 0.8600  data_time: 0.0037  memory: 11532  grad_norm: 1305.1072  loss: 2016.2837  loss_cls: 770.0039  loss_bbox: 594.5531  loss_dfl: 651.7267
2024/01/23 10:13:10 - mmengine - INFO - Epoch(train)  [62][2200/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:03:37  time: 1.2040  data_time: 0.0034  memory: 12386  grad_norm: 1055.7881  loss: 2037.4117  loss_cls: 780.8914  loss_bbox: 601.2645  loss_dfl: 655.2557
2024/01/23 10:14:03 - mmengine - INFO - Epoch(train)  [62][2250/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:02:56  time: 1.0601  data_time: 0.0036  memory: 12612  grad_norm: 1007.3806  loss: 2019.0150  loss_cls: 763.2315  loss_bbox: 603.8904  loss_dfl: 651.8931
2024/01/23 10:14:52 - mmengine - INFO - Epoch(train)  [62][2300/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:02:12  time: 0.9739  data_time: 0.0036  memory: 12532  grad_norm: 1037.2856  loss: 2030.2529  loss_cls: 771.3993  loss_bbox: 601.7083  loss_dfl: 657.1453
2024/01/23 10:15:51 - mmengine - INFO - Epoch(train)  [62][2350/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:01:35  time: 1.1829  data_time: 0.0035  memory: 12172  grad_norm: 1118.0638  loss: 1993.9789  loss_cls: 755.0526  loss_bbox: 593.9701  loss_dfl: 644.9561
2024/01/23 10:16:33 - mmengine - INFO - Epoch(train)  [62][2400/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:00:47  time: 0.8367  data_time: 0.0037  memory: 12052  grad_norm: 1067.1046  loss: 2053.1691  loss_cls: 788.0055  loss_bbox: 603.1601  loss_dfl: 662.0035
2024/01/23 10:17:36 - mmengine - INFO - Epoch(train)  [62][2450/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 2:00:12  time: 1.2647  data_time: 0.0036  memory: 11666  grad_norm: 1078.6088  loss: 2023.5629  loss_cls: 774.4113  loss_bbox: 594.3473  loss_dfl: 654.8043
2024/01/23 10:18:29 - mmengine - INFO - Epoch(train)  [62][2500/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 1:59:31  time: 1.0488  data_time: 0.0037  memory: 11346  grad_norm: 1055.5387  loss: 2017.3038  loss_cls: 771.6260  loss_bbox: 596.0453  loss_dfl: 649.6324
2024/01/23 10:19:18 - mmengine - INFO - Epoch(train)  [62][2550/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 1:58:48  time: 0.9857  data_time: 0.0036  memory: 11386  grad_norm: 1125.1726  loss: 2052.4825  loss_cls: 790.8651  loss_bbox: 602.1897  loss_dfl: 659.4277
2024/01/23 10:20:19 - mmengine - INFO - Epoch(train)  [62][2600/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 1:58:11  time: 1.2170  data_time: 0.0036  memory: 12079  grad_norm: 1092.1465  loss: 1996.4686  loss_cls: 751.9052  loss_bbox: 597.7674  loss_dfl: 646.7960
2024/01/23 10:21:01 - mmengine - INFO - Epoch(train)  [62][2650/2693]  base_lr: 2.0000e-03 lr: 8.1200e-04  eta: 1 day, 1:57:24  time: 0.8499  data_time: 0.0037  memory: 12666  grad_norm: 1014.7694  loss: 2012.6077  loss_cls: 768.7700  loss_bbox: 592.3503  loss_dfl: 651.4874
2024/01/23 10:22:00 - mmengine - INFO - Exp name: yolow-v8_l_clipv2_frozen_te_noprompt_t2i_bn_2e-3adamw_scale_lr_wd_32xb16-100e_obj365v1_goldg_train_lviseval_second_20240121_142319

Is there any way to shorten or stabilize the single forward time?

@wondervictor
Copy link
Collaborator

Hi @LuletterSoul, I've noticed! However, I can not spare enough time now to optimize the training frameworks or pipelines to resolve those problems. I'd love to add it to the TODO list and optimize those problems later, but not too long.

@LuletterSoul
Copy link
Author

LuletterSoul commented Mar 21, 2024

Hi @LuletterSoul, I've noticed! However, I can not spare enough time now to optimize the training frameworks or pipelines to resolve those problems. I'd love to add it to the TODO list and optimize those problems later, but not too long.

@wondervictor Haha, it's OK. I really like this project and hope it gets better from here. Maybe the mmyolo training framework is not well optimized. As a result, the training process is slow. I also spent time to locate the above two issues, but mmyolo is really too complicated for me.

@wondervictor
Copy link
Collaborator

@LuletterSoul, Super thanks!! I'll move on to this issue after I fix the fine-tuning bugs. We do have plans to get rid of mmyolo. If you have any new findings or questions related to this issue, I would greatly appreciate it if you could update them under this issue, as it will provide me with valuable guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants