Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leakage in cudarc 0.9.x / dfdx 0.11.x #643

Open
jafioti opened this issue Mar 30, 2023 · 16 comments
Open

Memory Leakage in cudarc 0.9.x / dfdx 0.11.x #643

jafioti opened this issue Mar 30, 2023 · 16 comments

Comments

@jafioti
Copy link
Contributor

jafioti commented Mar 30, 2023

I've been using an older dfdx with cudarc 0.8.0 which has worked fine, and I recently upgraded to the latest version on github. I'm getting OOM errors, notably after many iterations, so I believe it's a leak. It doesn't seem to be happening when I use a smaller model and doesn't seem to be connected to gradient accumulations, because it still happens when grad_accum is set to 1.

I have a reproduction of it here: https://github.com/jafioti/lm_test/tree/bad_gpt2

Here is my nsight report: https://drive.google.com/file/d/1dCJHtA09cnZOaF3lH5-vhWJyX8Uvw2vA/view?usp=share_link

@coreylowman
Copy link
Owner

Awesome thanks for including those details. Will check it out

@coreylowman
Copy link
Owner

@jafioti the nsight report links to a .arrows file, is that the right file?

@jafioti
Copy link
Contributor Author

jafioti commented Mar 30, 2023

@coreylowman That's the file nsight exported. Is it supposed to be different?

@coreylowman
Copy link
Owner

I can't seem to import it into nsight. Is the a .nsys-rep file? That's what you had shared last time I think

@jafioti
Copy link
Contributor Author

jafioti commented Mar 30, 2023

@coreylowman
Copy link
Owner

Great! Am I correct that you ran out of memory during either validation/generation? There's definitely a noticeable change in the memory graph at a certain point, and it looks like its just forward operations.

I would expect there to be less memory used during inference than in training, since there's no gradients allocated for temporary tensors. Is the validation input data the same shape as training data?

@jafioti
Copy link
Contributor Author

jafioti commented Mar 30, 2023

No I run out of memory after a bunch (like 85) of training iterations. Actually I have no issue running validation and generation with the same batch size.

@coreylowman
Copy link
Owner

No I run out of memory after a bunch (like 85) of training iterations. Actually I have no issue running validation and generation with the same batch size.

Looking at the nsys-rep you shared, the end of it is definitely in validation or generation since it's only forward calls. Maybe the memory spike happens after one of those completes in the first training batch after validation?

It doesn't seem to be happening when I use a smaller model and doesn't seem to be connected to gradient accumulations, because it still happens when grad_accum is set to 1.

Can you expand on this? If it was a memory leak I'd expect it to happen no matter the size of the model. I wonder if its a spike in memory used.

@jafioti
Copy link
Contributor Author

jafioti commented Mar 31, 2023

Looking at the nsys-rep you shared, the end of it is definitely in validation or generation since it's only forward calls. Maybe the memory spike happens after one of those completes in the first training batch after validation?

I think what you are seeing is after it hits OOM, the loss function usually fails, and so it continues it's loop back to the beginning and runs another training example. It never reaches the .backward() call, so all you see is forward calls. That's still in the training segment though.

As for the memory leak, I think it's a limited leak, where only a certain amount of memory leaks. You can see an increase in memory usage part way through training. That being said, I don't think it leaks an endless amount of memory, so on smaller models I have enough memory to compensate for the leak until it resets / stops leaking.

@coreylowman
Copy link
Owner

I think what you are seeing is after it hits OOM, the loss function usually fails, and so it continues it's loop back to the beginning and runs another training example. It never reaches the .backward() call, so all you see is forward calls. That's still in the training segment though.

Oh! Good details, that helps a lot. Seems like I need to focus on this area:
image

@jafioti
Copy link
Contributor Author

jafioti commented Apr 4, 2023

@coreylowman Were you ever able to locate the issue that caused the leak to happen? I'm noticing a similar behavior on other projects where older dfdx's will not run out of memory whereas the newer one does.

@coreylowman
Copy link
Owner

@jafioti yeah I'm still trying to figure out the cause. I suspect the difference is because cudarc 0.8 had memFreeAsync happening on a separate stream from the default stream, meaning it could potentially free things more quickly than cudarc 0.9 (which inserts memFreeAsyncs on the default stream). Though I made the switch because it resulted in more regular memory usage.

Here's the diff between 0.8 and 0.9 for completeness (it basically only has the free stream) coreylowman/cudarc@v0.8.0...v0.9.0.

Can you try some things for me?

  1. Try adding back in the free stream to cudarc. The PR that added this was Adding a free stream to concurrently free memory cudarc#79, and the PR that removed it was Reverting free stream, putting free_async calls on default stream cudarc#94
  2. Try removing the stream usage from binary operations. Here's the diff you can apply to do this: main...binary-unparallel

@jafioti
Copy link
Contributor Author

jafioti commented Apr 4, 2023

@coreylowman Hmm that's strange, I wouldn't think it would make too much of a difference.

I'll try those diffs, might be a little bit since I don't have access to my workstation atm, but will be using cloud gpus.

@coreylowman
Copy link
Owner

Sounds good @jafioti - also I just fixed an issue a couple hours ago (the linked pr) that was double allocating gradients. You should be seeing less memory usage after that, though I don't think that will entirely fix this issue.

@coreylowman
Copy link
Owner

Okay FYI now that #670 is merged - dfdx uses a bit more memory with caching enabled. To disable caching you can call dev.disable_cache() after initialization. Curious how this will impact the behavior you are seeing

@jafioti
Copy link
Contributor Author

jafioti commented Apr 13, 2023

Awesome, I'll let you know how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants