Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory for training #5

Closed
ChongWang1024 opened this issue Jan 26, 2024 · 4 comments
Closed

GPU memory for training #5

ChongWang1024 opened this issue Jan 26, 2024 · 4 comments

Comments

@ChongWang1024
Copy link

Hi,
Thanks for sharing the code of this interesting work.

I am trying to run the training on the fastMRI dataset and I got CUDA out of memory issue even with batch size=1.
My GPU is NVIDIA A5000, which has 24G memory.

Could you please tell me how much GPU memory is required to train with batchsize=1?

BTW, I noticed that the memory is gradually increasing for each iteration (batch).
Is that normal? Maybe this is somehow related to the code itself and I didn't notice.

Many thanks! looking forward to your reply.

@hellopipu
Copy link
Owner

Hi @ChongWang1024 ,

Approximately 26 GB of GPU memory is required for training on the FastMRI knee dataset. You can decrease the feature dimension to accommodate your GPU.

I haven't observed any gradual increase in memory usage from my end. Could you provide more details about this issue?

@hellopipu
Copy link
Owner

Hi @ChongWang1024 ,

Please update the code and then add --low_mem in the training command. This will enable you to use only ~22GB of memory without modifying the model.

@hellopipu
Copy link
Owner

hellopipu commented Feb 29, 2024

The potential reason for memory leakage is the pip version of h5py package. You can fix it by conda install h5py or pip install h5py==3.3.

reference:
facebookresearch/fastMRI#217
facebookresearch/fastMRI#215

@hellopipu hellopipu reopened this Feb 29, 2024
@ChongWang1024
Copy link
Author

The potential reason for memory leakage is the pip version of h5py package. You can fix it by conda install h5py or pip install h5py==3.3.

reference: facebookresearch/fastMRI#217 facebookresearch/fastMRI#215

Hi,
Thanks for your detailed reply.
I have figured out the problem, it seems to be the wrong version of my pytorch-lightning and h5py.

Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants