-
-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leakage in cudarc 0.9.x / dfdx 0.11.x #643
Comments
Awesome thanks for including those details. Will check it out |
@jafioti the nsight report links to a .arrows file, is that the right file? |
@coreylowman That's the file nsight exported. Is it supposed to be different? |
I can't seem to import it into nsight. Is the a |
Here's the nsys-rep file instead: https://drive.google.com/file/d/1DCHORiHTXVeqCgChmwpDT4g95E7gYBIu/view?usp=share_link |
Great! Am I correct that you ran out of memory during either validation/generation? There's definitely a noticeable change in the memory graph at a certain point, and it looks like its just forward operations. I would expect there to be less memory used during inference than in training, since there's no gradients allocated for temporary tensors. Is the validation input data the same shape as training data? |
No I run out of memory after a bunch (like 85) of training iterations. Actually I have no issue running validation and generation with the same batch size. |
Looking at the nsys-rep you shared, the end of it is definitely in validation or generation since it's only forward calls. Maybe the memory spike happens after one of those completes in the first training batch after validation?
Can you expand on this? If it was a memory leak I'd expect it to happen no matter the size of the model. I wonder if its a spike in memory used. |
I think what you are seeing is after it hits OOM, the loss function usually fails, and so it continues it's loop back to the beginning and runs another training example. It never reaches the .backward() call, so all you see is forward calls. That's still in the training segment though. As for the memory leak, I think it's a limited leak, where only a certain amount of memory leaks. You can see an increase in memory usage part way through training. That being said, I don't think it leaks an endless amount of memory, so on smaller models I have enough memory to compensate for the leak until it resets / stops leaking. |
@coreylowman Were you ever able to locate the issue that caused the leak to happen? I'm noticing a similar behavior on other projects where older dfdx's will not run out of memory whereas the newer one does. |
@jafioti yeah I'm still trying to figure out the cause. I suspect the difference is because cudarc 0.8 had memFreeAsync happening on a separate stream from the default stream, meaning it could potentially free things more quickly than cudarc 0.9 (which inserts memFreeAsyncs on the default stream). Though I made the switch because it resulted in more regular memory usage. Here's the diff between 0.8 and 0.9 for completeness (it basically only has the free stream) coreylowman/cudarc@v0.8.0...v0.9.0. Can you try some things for me?
|
@coreylowman Hmm that's strange, I wouldn't think it would make too much of a difference. I'll try those diffs, might be a little bit since I don't have access to my workstation atm, but will be using cloud gpus. |
Sounds good @jafioti - also I just fixed an issue a couple hours ago (the linked pr) that was double allocating gradients. You should be seeing less memory usage after that, though I don't think that will entirely fix this issue. |
Okay FYI now that #670 is merged - dfdx uses a bit more memory with caching enabled. To disable caching you can call |
Awesome, I'll let you know how it goes. |
I've been using an older dfdx with cudarc 0.8.0 which has worked fine, and I recently upgraded to the latest version on github. I'm getting OOM errors, notably after many iterations, so I believe it's a leak. It doesn't seem to be happening when I use a smaller model and doesn't seem to be connected to gradient accumulations, because it still happens when grad_accum is set to 1.
I have a reproduction of it here: https://github.com/jafioti/lm_test/tree/bad_gpt2
Here is my nsight report: https://drive.google.com/file/d/1dCJHtA09cnZOaF3lH5-vhWJyX8Uvw2vA/view?usp=share_link
The text was updated successfully, but these errors were encountered: