Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM error when evaluate detection on lvis minival #40

Closed
Flaick opened this issue Sep 20, 2021 · 9 comments
Closed

OOM error when evaluate detection on lvis minival #40

Flaick opened this issue Sep 20, 2021 · 9 comments

Comments

@Flaick
Copy link

Flaick commented Sep 20, 2021

Hello, I am facing an out of memory problem when testing with the eval_lvis.py file.
My hardware setup is 8*1080 Ti GPU with pytorch 1.5. I have managed to successfully run the training code with batch size as 1, but when I try to test the detection on lvis performance, there is an out of memory error as following: line 76 at util/dist.py.
Can you help me with this problem? Thank you in advance.

@alcinos
Copy link
Collaborator

alcinos commented Sep 20, 2021

Hello @Flaick

Thanks for your interest in MDETR.
Could you clarify whether it is cpu or gpu memory running out?
Please keep in mind that LVIS evaluation as we do it is very costly in memory, and you’ll likely need a fair amount of RAM.

Best

@Flaick
Copy link
Author

Flaick commented Sep 21, 2021

It is the gpu memory that is running out. The error information is following:

1632202133

Thank you!

@alcinos
Copy link
Collaborator

alcinos commented Sep 21, 2021

Try setting the env variable "MDETR_CPU_REDUCE" to "1", this should help with memory during reduce

@Flaick
Copy link
Author

Flaick commented Sep 22, 2021

Thank you for the help, can you tell me what size of memory is enough for the evaluation? Also, is there any way to reduce the usage of memory a little bit?
Now I am encountering with the memory error.

@alcinos
Copy link
Collaborator

alcinos commented Sep 23, 2021

I don’t have precise figures, but it could be in the vicinity of 80gb/card of ram required.

If memory is a concern, you could try modifying the code of the LVIS_evalutor to avoid doing an "all_gather" (which is memory costly), but instead have every process dump to disk, then reload from the main process. This is a bit hacky but could help you move forward.

@Flaick
Copy link
Author

Flaick commented Sep 27, 2021

Thank you for the response, I will try the implementation to dump result to disk and gather them from disk. Now, I am concerning about the training time when fine-tuning upon the 100 % lvis dataset, can you tell me the specification you've used, like the amount of gpus, and how long it will probably take to finish the training? Thank you ahead!

@Flaick Flaick closed this as completed Sep 29, 2021
@linhuixiao
Copy link

It is not work.

@linhuixiao
Copy link

set "MDETR_CPU_REDUCE" to "1" is not work,I find that mdetr will occur the memory not release during the training epoch increasing。

@yangcong356
Copy link

@linhuixiao Hello, have you addressed this issue? During the pre-training phase, after completing the second epoch, an OOM issue arises. Additionally, after finishing the first epoch, there are 4-5 extra processes per card, with each process occupying 500MB of GPU memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants