-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM error when evaluate detection on lvis minival #40
Comments
Hello @Flaick Thanks for your interest in MDETR. Best |
Try setting the env variable "MDETR_CPU_REDUCE" to "1", this should help with memory during reduce |
Thank you for the help, can you tell me what size of memory is enough for the evaluation? Also, is there any way to reduce the usage of memory a little bit? |
I don’t have precise figures, but it could be in the vicinity of 80gb/card of ram required. If memory is a concern, you could try modifying the code of the LVIS_evalutor to avoid doing an "all_gather" (which is memory costly), but instead have every process dump to disk, then reload from the main process. This is a bit hacky but could help you move forward. |
Thank you for the response, I will try the implementation to dump result to disk and gather them from disk. Now, I am concerning about the training time when fine-tuning upon the 100 % lvis dataset, can you tell me the specification you've used, like the amount of gpus, and how long it will probably take to finish the training? Thank you ahead! |
It is not work. |
set "MDETR_CPU_REDUCE" to "1" is not work,I find that mdetr will occur the memory not release during the training epoch increasing。 |
@linhuixiao Hello, have you addressed this issue? During the pre-training phase, after completing the second epoch, an OOM issue arises. Additionally, after finishing the first epoch, there are 4-5 extra processes per card, with each process occupying 500MB of GPU memory. |
Hello, I am facing an out of memory problem when testing with the eval_lvis.py file.
My hardware setup is 8*1080 Ti GPU with pytorch 1.5. I have managed to successfully run the training code with batch size as 1, but when I try to test the detection on lvis performance, there is an out of memory error as following: line 76 at util/dist.py.
Can you help me with this problem? Thank you in advance.
The text was updated successfully, but these errors were encountered: