Memory Leakage in cudarc 0.9.x / dfdx 0.11.x #643

jafioti · 2023-03-30T17:52:04Z

I've been using an older dfdx with cudarc 0.8.0 which has worked fine, and I recently upgraded to the latest version on github. I'm getting OOM errors, notably after many iterations, so I believe it's a leak. It doesn't seem to be happening when I use a smaller model and doesn't seem to be connected to gradient accumulations, because it still happens when grad_accum is set to 1.

I have a reproduction of it here: https://github.com/jafioti/lm_test/tree/bad_gpt2

Here is my nsight report: https://drive.google.com/file/d/1dCJHtA09cnZOaF3lH5-vhWJyX8Uvw2vA/view?usp=share_link

coreylowman · 2023-03-30T18:56:50Z

Awesome thanks for including those details. Will check it out

coreylowman · 2023-03-30T21:04:12Z

@jafioti the nsight report links to a .arrows file, is that the right file?

jafioti · 2023-03-30T21:08:35Z

@coreylowman That's the file nsight exported. Is it supposed to be different?

coreylowman · 2023-03-30T21:22:34Z

I can't seem to import it into nsight. Is the a .nsys-rep file? That's what you had shared last time I think

jafioti · 2023-03-30T21:37:46Z

Here's the nsys-rep file instead: https://drive.google.com/file/d/1DCHORiHTXVeqCgChmwpDT4g95E7gYBIu/view?usp=share_link

coreylowman · 2023-03-30T22:08:38Z

Great! Am I correct that you ran out of memory during either validation/generation? There's definitely a noticeable change in the memory graph at a certain point, and it looks like its just forward operations.

I would expect there to be less memory used during inference than in training, since there's no gradients allocated for temporary tensors. Is the validation input data the same shape as training data?

jafioti · 2023-03-30T22:41:11Z

No I run out of memory after a bunch (like 85) of training iterations. Actually I have no issue running validation and generation with the same batch size.

coreylowman · 2023-03-31T12:45:52Z

No I run out of memory after a bunch (like 85) of training iterations. Actually I have no issue running validation and generation with the same batch size.

Looking at the nsys-rep you shared, the end of it is definitely in validation or generation since it's only forward calls. Maybe the memory spike happens after one of those completes in the first training batch after validation?

It doesn't seem to be happening when I use a smaller model and doesn't seem to be connected to gradient accumulations, because it still happens when grad_accum is set to 1.

Can you expand on this? If it was a memory leak I'd expect it to happen no matter the size of the model. I wonder if its a spike in memory used.

jafioti · 2023-03-31T13:02:58Z

Looking at the nsys-rep you shared, the end of it is definitely in validation or generation since it's only forward calls. Maybe the memory spike happens after one of those completes in the first training batch after validation?

I think what you are seeing is after it hits OOM, the loss function usually fails, and so it continues it's loop back to the beginning and runs another training example. It never reaches the .backward() call, so all you see is forward calls. That's still in the training segment though.

As for the memory leak, I think it's a limited leak, where only a certain amount of memory leaks. You can see an increase in memory usage part way through training. That being said, I don't think it leaks an endless amount of memory, so on smaller models I have enough memory to compensate for the leak until it resets / stops leaking.

coreylowman · 2023-03-31T13:25:00Z

I think what you are seeing is after it hits OOM, the loss function usually fails, and so it continues it's loop back to the beginning and runs another training example. It never reaches the .backward() call, so all you see is forward calls. That's still in the training segment though.

Oh! Good details, that helps a lot. Seems like I need to focus on this area:

jafioti · 2023-04-04T01:20:06Z

@coreylowman Were you ever able to locate the issue that caused the leak to happen? I'm noticing a similar behavior on other projects where older dfdx's will not run out of memory whereas the newer one does.

coreylowman · 2023-04-04T13:21:38Z

@jafioti yeah I'm still trying to figure out the cause. I suspect the difference is because cudarc 0.8 had memFreeAsync happening on a separate stream from the default stream, meaning it could potentially free things more quickly than cudarc 0.9 (which inserts memFreeAsyncs on the default stream). Though I made the switch because it resulted in more regular memory usage.

Here's the diff between 0.8 and 0.9 for completeness (it basically only has the free stream) coreylowman/cudarc@v0.8.0...v0.9.0.

Can you try some things for me?

Try adding back in the free stream to cudarc. The PR that added this was Adding a free stream to concurrently free memory cudarc#79, and the PR that removed it was Reverting free stream, putting free_async calls on default stream cudarc#94
Try removing the stream usage from binary operations. Here's the diff you can apply to do this: main...binary-unparallel

jafioti · 2023-04-04T23:52:24Z

@coreylowman Hmm that's strange, I wouldn't think it would make too much of a difference.

I'll try those diffs, might be a little bit since I don't have access to my workstation atm, but will be using cloud gpus.

coreylowman · 2023-04-04T23:54:21Z

Sounds good @jafioti - also I just fixed an issue a couple hours ago (the linked pr) that was double allocating gradients. You should be seeing less memory usage after that, though I don't think that will entirely fix this issue.

coreylowman · 2023-04-12T17:50:35Z

Okay FYI now that #670 is merged - dfdx uses a bit more memory with caching enabled. To disable caching you can call dev.disable_cache() after initialization. Curious how this will impact the behavior you are seeing

jafioti · 2023-04-13T01:38:19Z

Awesome, I'll let you know how it goes.

coreylowman mentioned this issue Apr 4, 2023

Multiple re-allocations of gradients #662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Leakage in cudarc 0.9.x / dfdx 0.11.x #643

Memory Leakage in cudarc 0.9.x / dfdx 0.11.x #643

jafioti commented Mar 30, 2023

coreylowman commented Mar 30, 2023

coreylowman commented Mar 30, 2023

jafioti commented Mar 30, 2023

coreylowman commented Mar 30, 2023

jafioti commented Mar 30, 2023

coreylowman commented Mar 30, 2023

jafioti commented Mar 30, 2023

coreylowman commented Mar 31, 2023

jafioti commented Mar 31, 2023 •

edited

Loading

coreylowman commented Mar 31, 2023

jafioti commented Apr 4, 2023

coreylowman commented Apr 4, 2023

jafioti commented Apr 4, 2023

coreylowman commented Apr 4, 2023

coreylowman commented Apr 12, 2023

jafioti commented Apr 13, 2023

Memory Leakage in cudarc 0.9.x / dfdx 0.11.x #643

Memory Leakage in cudarc 0.9.x / dfdx 0.11.x #643

Comments

jafioti commented Mar 30, 2023

coreylowman commented Mar 30, 2023

coreylowman commented Mar 30, 2023

jafioti commented Mar 30, 2023

coreylowman commented Mar 30, 2023

jafioti commented Mar 30, 2023

coreylowman commented Mar 30, 2023

jafioti commented Mar 30, 2023

coreylowman commented Mar 31, 2023

jafioti commented Mar 31, 2023 • edited Loading

coreylowman commented Mar 31, 2023

jafioti commented Apr 4, 2023

coreylowman commented Apr 4, 2023

jafioti commented Apr 4, 2023

coreylowman commented Apr 4, 2023

coreylowman commented Apr 12, 2023

jafioti commented Apr 13, 2023

jafioti commented Mar 31, 2023 •

edited

Loading