-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training aborts when saving checkpoint after epoch 1 #9
Comments
Sorry I had rebuilt the container and forgot to change the |
Well after 5/6 epochs even with the changed parameters the issue reappears. Is this a know problem? Should one further reduce the number of nms samples or increase the threshold?
|
Hi, @meyerjo. It looks a little bit strange. But there are several recommendations below, that might help you to avoid this problem:
|
Hi, I am currently trying to train the network on the S3DIS dataset using the
td3d_is_s3dis-3d-5class
config.The training works fine for all training steps in epoch 1. At the end of the epoch when saving the checkpoint, the memory usage on the GPU suddenly jumps from ~8/9 GB to 18 GB and eventually failing when reaching the limit of 24 GB.
Is this an known issue?
The text was updated successfully, but these errors were encountered: