We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
训练过程中训练进程总是被Killed,查看历史发现是占用内存过多
[05/24 05:58:30] ppdet.engine INFO: Epoch: [8] [1000/2373] learning_rate: 0.000025 loss_class: 1.117397 loss_bbox: 0.252872 loss_giou: 0.447971 loss_class_aux: 3.277040 loss_bbox_aux: 1.026030 loss_giou_aux: 1.620483 loss_class_dn: 0.432132 loss_bbox_dn: 0.546347 loss_giou_dn: 0.661677 loss_class_aux_dn: 0.881084 loss_bbox_aux_dn: 1.264564 loss_giou_aux_dn: 1.543707 loss: 13.045103 eta: 22:27:16 batch_cost: 0.4796 data_cost: 0.2118 ips: 8.3410 images/s [05/24 06:00:29] ppdet.engine INFO: Epoch: [8] [1200/2373] learning_rate: 0.000025 loss_class: 1.112336 loss_bbox: 0.260119 loss_giou: 0.458377 loss_class_aux: 3.301874 loss_bbox_aux: 1.002547 loss_giou_aux: 1.663409 loss_class_dn: 0.433255 loss_bbox_dn: 0.515790 loss_giou_dn: 0.658224 loss_class_aux_dn: 0.877041 loss_bbox_aux_dn: 1.190781 loss_giou_aux_dn: 1.539540 loss: 13.123555 eta: 22:26:01 batch_cost: 0.5571 data_cost: 0.3091 ips: 7.1795 images/s Killed
dmesg | tail -10 [7061009.960277] [1406057] 0 1406057 10131 2570 122880 0 999 sh [7061009.960279] [1406395] 0 1406395 14294 2592 147456 0 999 top [7061009.960280] [1407839] 0 1407839 442590971 4671736 41353216 0 999 python [7061009.960282] [1408100] 0 1408100 10515 2455 110592 0 999 orion_client_ex [7061009.960284] [1410164] 0 1410164 10131 2580 126976 0 999 sh [7061009.960285] [1410507] 0 1410507 10131 1663 110592 0 999 sh [7061009.960287] [1410508] 0 1410508 307984 9444 290816 0 999 orion-nv-smi-na [7061009.960289] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=docker-90d30e8297098c6fe06fb3c7b475b132d8f6ef895e545fb6474df6a6ad5640f0.scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod77278367_89c1_4f2d_b54f_4c09b0b644e9.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod77278367_89c1_4f2d_b54f_4c09b0b644e9.slice/docker-90d30e8297098c6fe06fb3c7b475b132d8f6ef895e545fb6474df6a6ad5640f0.scope,task=python,pid=1407839,uid=0 [7061009.960363] Memory cgroup out of memory: Killed process 1407839 (python) total-vm:1770363884kB, anon-rss:16492188kB, file-rss:1995400kB, shmem-rss:199356kB, UID:0 pgtables:40384kB oom_score_adj:999
The text was updated successfully, but these errors were encountered:
补充,训练时虚拟内存的占用就很大
Sorry, something went wrong.
根据你的·内存容量选择合适的batch_size, 并打开amp模式, 在训练指令的后面新增 --amp
增加了内存之后可以了
No branches or pull requests
问题确认 Search before asking
请提出你的问题 Please ask your question
训练过程中训练进程总是被Killed,查看历史发现是占用内存过多
[05/24 05:58:30] ppdet.engine INFO: Epoch: [8] [1000/2373] learning_rate: 0.000025 loss_class: 1.117397 loss_bbox: 0.252872 loss_giou: 0.447971 loss_class_aux: 3.277040 loss_bbox_aux: 1.026030 loss_giou_aux: 1.620483 loss_class_dn: 0.432132 loss_bbox_dn: 0.546347 loss_giou_dn: 0.661677 loss_class_aux_dn: 0.881084 loss_bbox_aux_dn: 1.264564 loss_giou_aux_dn: 1.543707 loss: 13.045103 eta: 22:27:16 batch_cost: 0.4796 data_cost: 0.2118 ips: 8.3410 images/s
[05/24 06:00:29] ppdet.engine INFO: Epoch: [8] [1200/2373] learning_rate: 0.000025 loss_class: 1.112336 loss_bbox: 0.260119 loss_giou: 0.458377 loss_class_aux: 3.301874 loss_bbox_aux: 1.002547 loss_giou_aux: 1.663409 loss_class_dn: 0.433255 loss_bbox_dn: 0.515790 loss_giou_dn: 0.658224 loss_class_aux_dn: 0.877041 loss_bbox_aux_dn: 1.190781 loss_giou_aux_dn: 1.539540 loss: 13.123555 eta: 22:26:01 batch_cost: 0.5571 data_cost: 0.3091 ips: 7.1795 images/s
Killed
dmesg | tail -10
[7061009.960277] [1406057] 0 1406057 10131 2570 122880 0 999 sh
[7061009.960279] [1406395] 0 1406395 14294 2592 147456 0 999 top
[7061009.960280] [1407839] 0 1407839 442590971 4671736 41353216 0 999 python
[7061009.960282] [1408100] 0 1408100 10515 2455 110592 0 999 orion_client_ex
[7061009.960284] [1410164] 0 1410164 10131 2580 126976 0 999 sh
[7061009.960285] [1410507] 0 1410507 10131 1663 110592 0 999 sh
[7061009.960287] [1410508] 0 1410508 307984 9444 290816 0 999 orion-nv-smi-na
[7061009.960289] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=docker-90d30e8297098c6fe06fb3c7b475b132d8f6ef895e545fb6474df6a6ad5640f0.scope,mems_allowed=0-1,oom_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod77278367_89c1_4f2d_b54f_4c09b0b644e9.slice,task_memcg=/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod77278367_89c1_4f2d_b54f_4c09b0b644e9.slice/docker-90d30e8297098c6fe06fb3c7b475b132d8f6ef895e545fb6474df6a6ad5640f0.scope,task=python,pid=1407839,uid=0
[7061009.960363] Memory cgroup out of memory: Killed process 1407839 (python) total-vm:1770363884kB, anon-rss:16492188kB, file-rss:1995400kB, shmem-rss:199356kB, UID:0 pgtables:40384kB oom_score_adj:999
The text was updated successfully, but these errors were encountered: