OOM error when evaluate detection on lvis minival #40

Flaick · 2021-09-20T13:06:40Z

Hello, I am facing an out of memory problem when testing with the eval_lvis.py file.
My hardware setup is 8*1080 Ti GPU with pytorch 1.5. I have managed to successfully run the training code with batch size as 1, but when I try to test the detection on lvis performance, there is an out of memory error as following: line 76 at util/dist.py.
Can you help me with this problem? Thank you in advance.

alcinos · 2021-09-20T15:59:31Z

Hello @Flaick

Thanks for your interest in MDETR.
Could you clarify whether it is cpu or gpu memory running out?
Please keep in mind that LVIS evaluation as we do it is very costly in memory, and you’ll likely need a fair amount of RAM.

Best

Flaick · 2021-09-21T05:29:24Z

It is the gpu memory that is running out. The error information is following:

Thank you!

alcinos · 2021-09-21T15:41:16Z

Try setting the env variable "MDETR_CPU_REDUCE" to "1", this should help with memory during reduce

Flaick · 2021-09-22T10:11:23Z

Thank you for the help, can you tell me what size of memory is enough for the evaluation? Also, is there any way to reduce the usage of memory a little bit?
Now I am encountering with the memory error.

alcinos · 2021-09-23T14:11:20Z

I don’t have precise figures, but it could be in the vicinity of 80gb/card of ram required.

If memory is a concern, you could try modifying the code of the LVIS_evalutor to avoid doing an "all_gather" (which is memory costly), but instead have every process dump to disk, then reload from the main process. This is a bit hacky but could help you move forward.

Flaick · 2021-09-27T07:24:22Z

Thank you for the response, I will try the implementation to dump result to disk and gather them from disk. Now, I am concerning about the training time when fine-tuning upon the 100 % lvis dataset, can you tell me the specification you've used, like the amount of gpus, and how long it will probably take to finish the training? Thank you ahead!

linhuixiao · 2023-02-24T07:04:53Z

It is not work.

linhuixiao · 2023-02-24T11:31:33Z

set "MDETR_CPU_REDUCE" to "1" is not work，I find that mdetr will occur the memory not release during the training epoch increasing。

yangcong356 · 2023-12-04T09:24:36Z

@linhuixiao Hello, have you addressed this issue? During the pre-training phase, after completing the second epoch, an OOM issue arises. Additionally, after finishing the first epoch, there are 4-5 extra processes per card, with each process occupying 500MB of GPU memory.

ShoufaChen mentioned this issue Sep 23, 2021

CUDA error: out of memory when synchronize between processes on refexp at evaluation stage #41

Open

Flaick closed this as completed Sep 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM error when evaluate detection on lvis minival #40

OOM error when evaluate detection on lvis minival #40

Flaick commented Sep 20, 2021

alcinos commented Sep 20, 2021

Flaick commented Sep 21, 2021

alcinos commented Sep 21, 2021

Flaick commented Sep 22, 2021

alcinos commented Sep 23, 2021

Flaick commented Sep 27, 2021

linhuixiao commented Feb 24, 2023

linhuixiao commented Feb 24, 2023

yangcong356 commented Dec 4, 2023

OOM error when evaluate detection on lvis minival #40

OOM error when evaluate detection on lvis minival #40

Comments

Flaick commented Sep 20, 2021

alcinos commented Sep 20, 2021

Flaick commented Sep 21, 2021

alcinos commented Sep 21, 2021

Flaick commented Sep 22, 2021

alcinos commented Sep 23, 2021

Flaick commented Sep 27, 2021

linhuixiao commented Feb 24, 2023

linhuixiao commented Feb 24, 2023

yangcong356 commented Dec 4, 2023