Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: reduce gpu memory https://github.com/ashkamath/mdetr/issues/41 #42

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ShoufaChen
Copy link

Related to #41.

When pretraining with MDETR_CPU_REDUCE=1, GPU memory before and after torch.load are:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2390563      C   ...da3/envs/mdetr/bin/python     9333MiB |
|    1   N/A  N/A   2390564      C   ...da3/envs/mdetr/bin/python     9031MiB |
|    2   N/A  N/A   2390565      C   ...da3/envs/mdetr/bin/python     8651MiB |
|    3   N/A  N/A   2390566      C   ...da3/envs/mdetr/bin/python     9857MiB |
+-----------------------------------------------------------------------------+

and

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2337080      C   ...da3/envs/mdetr/bin/python    13587MiB |
|    0   N/A  N/A   2337081      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    0   N/A  N/A   2337082      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    0   N/A  N/A   2337083      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    1   N/A  N/A   2337080      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    1   N/A  N/A   2337081      C   ...da3/envs/mdetr/bin/python    13301MiB |
|    1   N/A  N/A   2337082      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    1   N/A  N/A   2337083      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    2   N/A  N/A   2337080      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    2   N/A  N/A   2337081      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    2   N/A  N/A   2337082      C   ...da3/envs/mdetr/bin/python    11397MiB |
|    2   N/A  N/A   2337083      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    3   N/A  N/A   2337080      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    3   N/A  N/A   2337081      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    3   N/A  N/A   2337082      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    3   N/A  N/A   2337083      C   ...da3/envs/mdetr/bin/python    13251MiB |
+-----------------------------------------------------------------------------+

Use map_location=device solves this issue.

@linhuixiao
Copy link

Suffered the same bug. How was it resolved? Thank you!

@linhuixiao
Copy link

@ShoufaChen according to the author's reply in #65, maybe this issue is hard to solve, we can only to control the batchsize and GPUs num in a small level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants