-
-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flux Allocates Excessively #828
Comments
Any suggestions on how to debug or fix this? I’ve spent quite a bit of time on this, and I’m out of ideas. #736 is a blocking issue for me using Flux. |
My naive suggestion is to throw the call to |
I really appreciate the response. A summary of the other issue would probably be useful here. It seems that Julia's garbage collector becomes very slow during training of moderate sized Flux models. That's manifesting as the CuArrays' allocator becoming slow since it calls the GC. It's difficult to debug this because there isn't any documentation about the performance of the GC. I've ran out of things to try within CuArrays. I think Flux's millions of allocations per epoch is the next most obvious culprit. The above code was on the CPU. I gave Cthulhu a shot, but it's crashing traversing the code. Type stability was a suggestion of what's causing the excessive allocation, but I really don't know. Part of the problem with the code that I originally posted is that it's using Maybe the cause is that |
If Cthulhu is crashing on any code, you should file an issue on its repo. It's really important we have a working tool to easily inspect the typed IR, otherwise we'll be going around in circles. For the GC, I would want to see how often CuArrays' allocator calls Julia's GC, since it's less likely that the implementation of Julia's GC is at fault here. It's possible that the Zygote-transformed code allocates in such a way that CuArrays' GC was not designed for, so it may need some tweaks. My suggestion to patch it still stands 😄 |
I'll try to give some numbers tomorrow. Before things "go wrong" (see: the plot in the CuArrays issue) the number of calls to the GC doesn't seem excessive. At one point I simplified my network down to the point that the whole model fit in GPU memory, and (IIRC) it was only calling the GC <10 times per epoch. Before performance gets really bad, the number of calls to the GC is static, but the length of time each call takes is slowly increasing. When performance suddenly falls off a cliff, I saw that both the number of calls and the length of time each call takes goes up by about an order of magnitude. The sudden drop in performance also happens at essentially exactly the same time, every time. I agree that it's weird to point the finger at Julia's GC, but at this point I don't see any obvious solutions. Flux probably exercises the garbage collector in ways that few other packages do. My network is creating several billion small allocations in ~30 minutes. It's still speculation, but it could be that some internal GC structure becomes slow under those conditions. For example, there could be some structure that runs in ~O(1) unless the number of elements is really, really big. The small size of each allocation could be causing some memory pressure issue. There could be some run-of-the-mill bug that only gets exposed under these extreme conditions. It's unclear to me how anyone is training even moderate size networks without experiencing this, so maybe there is some subtle mistake causing me and other people to encounter this bug. Combing through some old issues showed lots of time being spent in the garbage collector, too, though, so who knows. |
I'm having similar issues. |
In case it helps anyone, I had an evalcb() function that was storing losses as an array of TrackedArrays. Since I wasn't using evalcb() for backprop I was able to change it to loss(...).data (leave behind all the gradient stuff) and that fixed all my memory issues. Check to see how many tracked objects you have since mine weren't getting garbage collected. |
Joining the party. Training slows down heavily after 10 epochs the latest. |
Sorry for the late response. I had some things come up, and I wanted to put together a simple example showing what's going on. Here it is:
I ran the above on Windows 10, CUDA 10.1, Julia 1.2 with Flux and CuArrays on master, and a 1080. Minor point, but I think the changes to locks in 1.2 made it so that CuArrays' background task runs way more often. It doesn't make a substantive difference compared to 1.1, though. I've also tested a similar set up on Ubuntu with a Titan Xp with similar results, so this issue doesn't appear to be platform specific. Here's the output:
Note that performance slowly gets worse, all attributable to more time spent in the GC, and then eventually becomes 3x worse. |
For comparison purposes, here is an example model that keeps slowing down during training, but it never becomes abruptly worse. Maybe these are two separate issues? The slowdown is still linked to the GC.
...many epochs later...
|
This would benefit from testing with the latest version of CUDA. Otherwise I don't think there's much actionable stuff to be had on the Flux side of things. |
Flux is allocating quite a bit during calls to the AD, like in gradient. For example, on the Zygote branch with Zygote master:
Bigger input doesn't sufficiently amortize the cost, either.
Tracker is doing the same thing:
It looks like there are some issues with type stability:
I suspect the large number of allocations are at least a contributing factor to #736 and the corresponding CuArrays issue https://github.com/JuliaGPU/CuArrays.jl/issues/323. That problem makes it very difficult/annoying to train some large networks.
The text was updated successfully, but these errors were encountered: