torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with signal SIGKILL #19

GONGXI1994 · 2023-10-17T02:15:37Z

when I train lightGlue using
python -m gluefactory.train sp+lg_megadepth \ --conf gluefactory/configs/superpoint-open+lightglue_megadepth.yaml \ train.load_experiment=sp+lg_homography \ data.load_features.do=True --distributed

**process killed after **
[10/17/2023 04:26:12 gluefactory INFO] [E 4 | it 1000] loss {total 1.731E+00, last 7.856E-01, assignment_nll 7.856E-01, nll_pos 1.262E+00, nll_neg 3.087E-01, num_matchable 4.165E+02, num_unmatchable 7.160E+02, confidence 2.601E-01, row_norm 8.259E-01} .
Can you offer me some advice to solve this problem? Thanks ~

The text was updated successfully, but these errors were encountered:

sarlinpe · 2023-10-17T06:53:57Z

How does the RAM usage evolve throughout the training on MegaDepth?

GONGXI1994 · 2023-10-18T03:43:36Z

How does the RAM usage evolve throughout the training on MegaDepth?

about 50% (total 125G). when I begin training on MegaDepth. But when the training crushed, I have not inspect the RAM usage.

GONGXI1994 · 2023-10-19T02:03:28Z

max RAM usage : 98% before training crushed

GONGXI1994 · 2023-10-23T02:02:28Z

When I set the "conf.plot == None" in Function " do_evaluation()" , everything goes OK. the max RAM usage reduce to 75%.
Thank for your great Job!!

sarlinpe · 2023-10-23T11:24:32Z

I have optimized how we handle figures during training in PR #30, does this help?

GONGXI1994 closed this as completed Oct 17, 2023

GONGXI1994 reopened this Oct 17, 2023

GONGXI1994 closed this as completed Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with signal SIGKILL #19

torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with signal SIGKILL #19

GONGXI1994 commented Oct 17, 2023 •

edited

Loading

sarlinpe commented Oct 17, 2023

GONGXI1994 commented Oct 18, 2023

GONGXI1994 commented Oct 19, 2023

GONGXI1994 commented Oct 23, 2023 •

edited

Loading

sarlinpe commented Oct 23, 2023

torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with signal SIGKILL #19

torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with signal SIGKILL #19

Comments

GONGXI1994 commented Oct 17, 2023 • edited Loading

sarlinpe commented Oct 17, 2023

GONGXI1994 commented Oct 18, 2023

GONGXI1994 commented Oct 19, 2023

GONGXI1994 commented Oct 23, 2023 • edited Loading

sarlinpe commented Oct 23, 2023

GONGXI1994 commented Oct 17, 2023 •

edited

Loading

GONGXI1994 commented Oct 23, 2023 •

edited

Loading