Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARNING: attempting to recover from OOM in forward/backward pass #23

Open
pzhren opened this issue Nov 6, 2020 · 10 comments
Open

WARNING: attempting to recover from OOM in forward/backward pass #23

pzhren opened this issue Nov 6, 2020 · 10 comments

Comments

@pzhren
Copy link

pzhren commented Nov 6, 2020

Hi, I encountered some errors during the Self-critical sequence training stage:
WARNING: attempting to recover from OOM in forward/backward pass
Is this because the GPU memory is not enough? It feels very strange, because sometimes it is normal.

@krasserm
Copy link
Owner

krasserm commented Nov 6, 2020

Is this because the GPU memory is not enough?

Yes, this is the reason. The settings documented in the README are appropriate for 2 GTX 1080 cards (8 GB each).

@pzhren
Copy link
Author

pzhren commented Nov 6, 2020

In fact, I used 3 GPUs, each of which is 11g. The strange thing is that sometimes it works normally, and sometimes it is reported that the storage is insufficient.

@krasserm
Copy link
Owner

krasserm commented Nov 6, 2020

Did you pre-train the model with CE loss before running SCST?

@pzhren
Copy link
Author

pzhren commented Nov 6, 2020

Yes. I passed --max-sentences 2, and it ran normally, but I was worried that it would affect performance. I don't know if it will have a significant impact? Besides, why not use .checkpoint/checkpoint_best.pt, is this not the best weight?

@krasserm
Copy link
Owner

krasserm commented Nov 6, 2020

Convergence improves with higher --max-sentences values (but also requires more memory). A value of 5 should work fine on 11 GB cards.

Regarding checkpoint_best.pt, this is the checkpoint with the best CE validation loss, but not necessarily with the best CIDEr score (or any other evaluation metric). Checkpoint selection based on a user defined metric should be automated but I had other priorities in the past. Hope I can resume work on it anytime soon. Pull requests are welcome too, of course!

@pzhren
Copy link
Author

pzhren commented Nov 6, 2020

I see, thank you.

@pzhren
Copy link
Author

pzhren commented Nov 6, 2020

image
In fact, we found that when SCST was running, one of the GPU memory loads suddenly became too high. There is a serious load imbalance between GPUs. Do you have a good solution?

@pzhren
Copy link
Author

pzhren commented Nov 6, 2020

Don't worry, I found that during the running process, the memory usage gradually increased. This is the running state at --max-sentence 3.
image

@krasserm
Copy link
Owner

krasserm commented Nov 7, 2020

What is the frequency of OOMs when you run with --max-sentences 5 or 8?

@pzhren
Copy link
Author

pzhren commented Nov 7, 2020

Almost every time I encounter it, the strange thing is that it reports a memory error after SCST runs one or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants