Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux Allocates Excessively #828

Open
jmulcahy opened this issue Aug 1, 2019 · 11 comments
Open

Flux Allocates Excessively #828

jmulcahy opened this issue Aug 1, 2019 · 11 comments

Comments

@jmulcahy
Copy link

jmulcahy commented Aug 1, 2019

Flux is allocating quite a bit during calls to the AD, like in gradient. For example, on the Zygote branch with Zygote master:

julia> function test_zygote()
       t = ones(1, 1, 1, 1)
       f = Conv((1, 1), 1=>1)
       ps = params(f)
       Zygote.gradient(() -> sum(f(t)), ps)
       end
test_zygote (generic function with 1 method)

julia> test_zygote()
Grads(...)

julia> @time test_zygote()
  0.000771 seconds (857 allocations: 35.641 KiB)
Grads(...)

Bigger input doesn't sufficiently amortize the cost, either.

julia> function test_zygote_big()
       t = ones(200, 200, 1, 1)
       f = Conv((1, 1), 1=>1)
       ps = params(f)
       Zygote.gradient(() -> sum(f(t)), ps)
       end
test_zygote_big (generic function with 1 method)

julia> test_zygote_big()
Grads(...)

julia> @time test_zygote_big()
  0.650714 seconds (5.48 M allocations: 191.705 MiB, 6.02% gc time)
Grads(...)

Tracker is doing the same thing:

julia> function test_tracker_big()
       t = ones(200, 200, 1, 1)
       f = Conv((1, 1), 1=>1)
       ps = params(f)
       Flux.Tracker.gradient(() -> sum(f(t)), ps)
       end
test_tracker_big (generic function with 1 method)

julia> test_tracker_big()
Grads(...)


julia> @time test_tracker_big()
  0.600915 seconds (5.36 M allocations: 189.873 MiB, 7.95% gc time)
Grads(...)

It looks like there are some issues with type stability:

julia> function test_zygote_big()
       t = ones(200, 200, 1, 1)
       f = Conv((1, 1), 1=>1)
       ps = params(f)
       @code_warntype Zygote.gradient(() -> sum(f(t)), ps)
       end
test_zygote_big (generic function with 1 method)

julia> test_zygote_big()
Body::Zygote.Grads
1 ─ %1  = (getfield)(args, 1)::Params
│   %2  = Zygote.nothing::Nothing
│   %3  = Zygote.nothing::Nothing
│   %4  = %new(Zygote.Context, %2, %3)::Zygote.Context
│   %5  = invoke Zygote._forward(%4::Zygote.Context, _2::getfield(Main, Symbol("##9#10")){Array{Float64,4},Conv{2,4,typeof(identity),Array{Float32,4},Array{Float32,1}}})::Tuple{Any,typeof(∂(λ))}
│   %6  = (Base.getfield)(%5, 1)::Any
│   %7  = (Base.getfield)(%5, 2)::typeof(∂(λ))
│   %8  = Zygote.:(##42#43)::Type{getfield(Zygote, Symbol("##42#43"))}
│   %9  = (Core.typeof)(%7)::Type{#s70} where #s70<:typeof(∂(λ)) where _1
│   %10 = (Core.apply_type)(%8, Params, Zygote.Context, %9)::Type{getfield(Zygote, Symbol("##42#43")){Params,Zygote.Context,_1}} where _1
│   %11 = %new(%10, %1, %4, %7)::getfield(Zygote, Symbol("##42#43")){Params,Zygote.Context,_1} where _1
│   %12 = (Zygote.sensitivity)(%6)::Any
│   %13 = (%11)(%12)::Zygote.Grads
└──       return %13

I suspect the large number of allocations are at least a contributing factor to #736 and the corresponding CuArrays issue https://github.com/JuliaGPU/CuArrays.jl/issues/323. That problem makes it very difficult/annoying to train some large networks.

@jmulcahy
Copy link
Author

jmulcahy commented Aug 6, 2019

Any suggestions on how to debug or fix this? I’ve spent quite a bit of time on this, and I’m out of ideas. #736 is a blocking issue for me using Flux.

@jpsamaroo
Copy link
Contributor

My naive suggestion is to throw the call to gradient under Cthulhu at a minimum, to tease out where the type instability comes from. Whether or not that can be fixed, if the true issue is GPU allocations, then somehow checking the source of CuArray's allocations would be what's needed. You may be able to patch CuArrays or CUDAdrv to record a stacktrace everytime an allocation is done (probably in CUDAdrv.Mem.alloc?).

@jmulcahy
Copy link
Author

I really appreciate the response.

A summary of the other issue would probably be useful here. It seems that Julia's garbage collector becomes very slow during training of moderate sized Flux models. That's manifesting as the CuArrays' allocator becoming slow since it calls the GC. It's difficult to debug this because there isn't any documentation about the performance of the GC. I've ran out of things to try within CuArrays. I think Flux's millions of allocations per epoch is the next most obvious culprit.

The above code was on the CPU. I gave Cthulhu a shot, but it's crashing traversing the code. Type stability was a suggestion of what's causing the excessive allocation, but I really don't know. Part of the problem with the code that I originally posted is that it's using Float64s which apparently have stability issues, but Float32s don't. Still, even with Float32s there are hundreds of allocations. Even weirder, when I instrumented my training code I saw that apply! was allocating a lot, which is presumably supposed to be in place.

Maybe the cause is that Grads keeps a dictionary of {Any, Any}? I can see that it was typed that way because the alternative looks like a pain. Unless someone really thinks that's the cause, I'm probably not going to try to change that given how complicated it appears to be to untangle.

@jpsamaroo
Copy link
Contributor

If Cthulhu is crashing on any code, you should file an issue on its repo. It's really important we have a working tool to easily inspect the typed IR, otherwise we'll be going around in circles.

For the GC, I would want to see how often CuArrays' allocator calls Julia's GC, since it's less likely that the implementation of Julia's GC is at fault here. It's possible that the Zygote-transformed code allocates in such a way that CuArrays' GC was not designed for, so it may need some tweaks. My suggestion to patch it still stands 😄

@jmulcahy
Copy link
Author

jmulcahy commented Aug 16, 2019

I'll try to give some numbers tomorrow. Before things "go wrong" (see: the plot in the CuArrays issue) the number of calls to the GC doesn't seem excessive. At one point I simplified my network down to the point that the whole model fit in GPU memory, and (IIRC) it was only calling the GC <10 times per epoch. Before performance gets really bad, the number of calls to the GC is static, but the length of time each call takes is slowly increasing. When performance suddenly falls off a cliff, I saw that both the number of calls and the length of time each call takes goes up by about an order of magnitude. The sudden drop in performance also happens at essentially exactly the same time, every time.

I agree that it's weird to point the finger at Julia's GC, but at this point I don't see any obvious solutions. Flux probably exercises the garbage collector in ways that few other packages do. My network is creating several billion small allocations in ~30 minutes. It's still speculation, but it could be that some internal GC structure becomes slow under those conditions. For example, there could be some structure that runs in ~O(1) unless the number of elements is really, really big. The small size of each allocation could be causing some memory pressure issue. There could be some run-of-the-mill bug that only gets exposed under these extreme conditions.

It's unclear to me how anyone is training even moderate size networks without experiencing this, so maybe there is some subtle mistake causing me and other people to encounter this bug. Combing through some old issues showed lots of time being spent in the garbage collector, too, though, so who knows.

@datnamer
Copy link

I'm having similar issues.

@maccam912
Copy link

In case it helps anyone, I had an evalcb() function that was storing losses as an array of TrackedArrays. Since I wasn't using evalcb() for backprop I was able to change it to loss(...).data (leave behind all the gradient stuff) and that fixed all my memory issues. Check to see how many tracked objects you have since mine weren't getting garbage collected.

@dbadrian
Copy link

Joining the party. Training slows down heavily after 10 epochs the latest.
Still trying to figure out where it might come from for me.

@jmulcahy
Copy link
Author

jmulcahy commented Aug 24, 2019

Sorry for the late response. I had some things come up, and I wanted to put together a simple example showing what's going on. Here it is:

using CuArrays, Flux, IterTools

function make_model()
    first_layer = Conv((1, 1), 1=>16)

    # performance becomes very bad during the 8th epoch
    layers = [Conv((3, 3), 16=>16, pad=(1, 1)) for _ in 1:50]
    last_layer = Conv((1, 1), 16=>1)

    Chain(first_layer, layers..., last_layer) |> gpu
end

function test_training()
    model = make_model()

    data = collect(repeatedly(() -> (CuArrays.randn(200, 200, 1, 1), CuArrays.randn(200, 200, 1, 1)), 2500))

    loss(x, y) = sum(model(y) - x)
    opt = ADAM()
    ps = params(model)

    epoch = 1
    while true
        @info "Epoch $epoch"
        CuArrays.pool_timings!()
        CuArrays.@time Flux.train!(loss, ps, data, opt)
        CuArrays.pool_timings()
        epoch += 1
    end
end

test_training()

I ran the above on Windows 10, CUDA 10.1, Julia 1.2 with Flux and CuArrays on master, and a 1080. Minor point, but I think the changes to locks in 1.2 made it so that CuArrays' background task runs way more often. It doesn't make a substantive difference compared to 1.1, though. I've also tested a similar set up on Ubuntu with a Titan Xp with similar results, so this issue doesn't appear to be platform specific. Here's the output:

[ Info: Epoch 1
197.611938 seconds (271.13 M CPU allocations: 13.684 GiB, 4.25% gc time) (1.97 M GPU allocations: 1.493 TiB, 7.27% gc time of which 4.70% spent allocating)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           198s / 6.26%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 pooled alloc       1.97M    11.1s  89.6%  5.66μs
   2 gc(false)        606    10.1s  81.7%  16.7ms
   1 try alloc      4.32k    609ms  4.90%   141μs
 background task       99    1.31s  10.5%  13.2ms
   reclaim             99    113ms  0.91%  1.14ms
   scan                99   86.7μs  0.00%   876ns
 ────────────────────────────────────────────────
[ Info: Epoch 2
182.209291 seconds (226.09 M CPU allocations: 11.346 GiB, 6.41% gc time) (1.97 M GPU allocations: 1.493 TiB, 9.39% gc time of which 0.08% spent allocating)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           182s / 8.83%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 pooled alloc       1.97M    14.1s  87.4%  7.15μs
   2 gc(false)        575    13.6s  84.8%  23.7ms
   1 try alloc        991   45.4ms  0.28%  45.8μs
 background task       96    2.03s  12.6%  21.2ms
   reclaim             96   8.24ms  0.05%  85.8μs
   scan                96   89.9μs  0.00%   936ns
 ────────────────────────────────────────────────
[ Info: Epoch 3
186.791957 seconds (226.13 M CPU allocations: 11.314 GiB, 8.91% gc time) (1.97 M GPU allocations: 1.493 TiB, 11.56% gc time)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           187s / 11.3%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 pooled alloc       1.97M    18.6s  87.9%  9.46μs
   2 gc(false)        578    18.2s  86.0%  31.5ms
   1 try alloc        578   40.3ms  0.19%  69.7μs
 background task       97    2.56s  12.1%  26.4ms
   reclaim             97    149μs  0.00%  1.54μs
   scan                97   94.6μs  0.00%   975ns
 ────────────────────────────────────────────────
[ Info: Epoch 4
193.289831 seconds (226.11 M CPU allocations: 11.380 GiB, 12.05% gc time) (1.97 M GPU allocations: 1.493 TiB, 14.03% gc time)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           193s / 14.3%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 pooled alloc       1.97M    24.1s  87.0%  12.3μs
   2 gc(false)        576    23.7s  85.4%  41.1ms
   1 try alloc        576   40.6ms  0.15%  70.5μs
 background task      102    3.61s  13.0%  35.4ms
   reclaim            102    159μs  0.00%  1.56μs
   scan               102   95.4μs  0.00%   935ns
 ────────────────────────────────────────────────
[ Info: Epoch 5
202.257068 seconds (235.30 M CPU allocations: 11.759 GiB, 15.78% gc time) (1.97 M GPU allocations: 1.493 TiB, 16.15% gc time of which 0.02% spent allocating)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           202s / 17.9%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 pooled alloc       1.97M    29.6s  81.6%  15.1μs
   2 gc(false)        575    29.2s  80.4%  50.7ms
   1 try alloc        773   43.7ms  0.12%  56.5μs
 background task      152    6.67s  18.4%  43.9ms
   reclaim            152   4.47ms  0.01%  29.4μs
   scan               152    138μs  0.00%   905ns
 ────────────────────────────────────────────────
[ Info: Epoch 6
207.121693 seconds (226.11 M CPU allocations: 11.316 GiB, 18.95% gc time) (1.97 M GPU allocations: 1.493 TiB, 17.73% gc time)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           207s / 21.1%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 pooled alloc       1.97M    33.7s  76.9%  17.1μs
   2 gc(false)        577    33.3s  76.0%  57.7ms
   1 try alloc        577   40.7ms  0.09%  70.5μs
 background task      191    10.1s  23.1%  52.9ms
   reclaim            191    295μs  0.00%  1.55μs
   scan               191    179μs  0.00%   939ns
 ────────────────────────────────────────────────
[ Info: Epoch 7
213.311979 seconds (226.11 M CPU allocations: 11.316 GiB, 21.53% gc time) (1.97 M GPU allocations: 1.493 TiB, 19.73% gc time)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           213s / 23.7%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 pooled alloc       1.97M    39.0s  77.4%  19.9μs
   2 gc(false)        577    38.6s  76.5%  66.9ms
   1 try alloc        577   40.3ms  0.08%  69.9μs
 background task      192    11.4s  22.6%  59.5ms
   reclaim            192    298μs  0.00%  1.55μs
   scan               192    173μs  0.00%   903ns
 ────────────────────────────────────────────────
[ Info: Epoch 8
591.957613 seconds (226.11 M CPU allocations: 11.449 GiB, 71.73% gc time) (1.97 M GPU allocations: 1.493 TiB, 55.08% gc time)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           592s / 72.5%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 pooled alloc       1.97M     323s  75.3%   164μs
   2 gc(false)        576     323s  75.2%   560ms
   1 try alloc        576   41.4ms  0.01%  71.8μs
 background task      192     106s  24.7%   553ms
   reclaim            192    295μs  0.00%  1.54μs
   scan               192    195μs  0.00%  1.01μs
 ────────────────────────────────────────────────

Note that performance slowly gets worse, all attributable to more time spent in the GC, and then eventually becomes 3x worse.

@jmulcahy
Copy link
Author

jmulcahy commented Aug 25, 2019

For comparison purposes, here is an example model that keeps slowing down during training, but it never becomes abruptly worse. Maybe these are two separate issues? The slowdown is still linked to the GC.

using CuArrays, Flux, IterTools

function make_model1()
    first_layer = Conv((1, 1), 1=>16)

    # performance becomes very bad during the 8th epoch
    layers = [Conv((3, 3), 16=>16, pad=(1, 1)) for _ in 1:50]
    last_layer = Conv((1, 1), 16=>1)

    Chain(first_layer, layers..., last_layer) |> gpu
end

function make_model2()
    # performance slowly becomes worse, but not suddenly so
    layers = [Conv((3, 3), 1=>1, pad=(1, 1)) for _ in 1:50]

    Chain(layers...) |> gpu
end

function test_training()
    model = make_model2()

    data = collect(repeatedly(() -> (CuArrays.randn(200, 200, 1, 1), CuArrays.randn(200, 200, 1, 1)), 2500))

    loss(x, y) = sum(model(y) - x)
    opt = ADAM()
    ps = params(model)

    epoch = 1
    while true
        @info "Epoch $epoch"
        CuArrays.pool_timings!()
        CuArrays.@time Flux.train!(loss, ps, data, opt)
        CuArrays.pool_timings()
        epoch += 1
    end
end

test_training()
[ Info: Epoch 1
137.232261 seconds (258.23 M CPU allocations: 13.105 GiB, 5.25% gc time) (1.89 M GPU allocations: 95.019 GiB, 2.83% gc time of which 2.82% spent allocating)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           138s / 1.03%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 background task       97    989ms  69.8%  10.2ms
   reclaim             97   32.4ms  2.29%   334μs
   scan                97   88.9μs  0.01%   917ns
 pooled alloc       1.89M    442ms  31.2%   234ns
   1 try alloc      2.94k   77.7ms  5.49%  26.5μs
 ────────────────────────────────────────────────
[ Info: Epoch 2
128.520860 seconds (215.66 M CPU allocations: 10.894 GiB, 7.72% gc time) (1.89 M GPU allocations: 95.019 GiB, 3.23% gc time of which 3.25% spent allocating)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           129s / 1.62%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 background task       96    1.66s  79.7%  17.3ms
   reclaim             96   82.7ms  3.97%   862μs
   scan                96   87.9μs  0.00%   916ns
 pooled alloc       1.89M    423ms  20.3%   224ns
   1 try alloc      3.96k   53.6ms  2.57%  13.5μs
 ────────────────────────────────────────────────
[ Info: Epoch 3
131.765121 seconds (215.66 M CPU allocations: 10.861 GiB, 10.67% gc time) (1.89 M GPU allocations: 95.019 GiB, 3.11% gc time of which 0.10% spent allocating)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           132s / 2.04%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 background task       96    2.32s  86.2%  24.1ms
   reclaim             96   3.45ms  0.13%  36.0μs
   scan                96   91.6μs  0.00%   955ns
 pooled alloc       1.89M    370ms  13.8%   196ns
   1 try alloc         35    662μs  0.02%  18.9μs
 ────────────────────────────────────────────────
[ Info: Epoch 4
132.402783 seconds (215.66 M CPU allocations: 10.928 GiB, 14.39% gc time) (1.89 M GPU allocations: 95.019 GiB, 2.94% gc time of which 0.23% spent allocating)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           132s / 2.59%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 background task       96    3.05s  88.9%  31.8ms
   reclaim             96   6.46ms  0.19%  67.3μs
   scan                96   83.0μs  0.00%   864ns
 pooled alloc       1.89M    380ms  11.1%   201ns
   1 try alloc        191   2.51ms  0.07%  13.2μs
 ────────────────────────────────────────────────
[ Info: Epoch 5
136.397199 seconds (215.66 M CPU allocations: 10.861 GiB, 17.50% gc time) (1.89 M GPU allocations: 95.019 GiB, 2.90% gc time of which 0.10% spent allocating)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           136s / 3.18%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 background task       96    3.97s  91.5%  41.4ms
   reclaim             96   3.26ms  0.08%  33.9μs
   scan                96   78.5μs  0.00%   817ns
 pooled alloc       1.89M    368ms  8.48%   195ns
   1 try alloc         46    859μs  0.02%  18.7μs
 ────────────────────────────────────────────────
[ Info: Epoch 6
141.585927 seconds (215.67 M CPU allocations: 10.862 GiB, 20.06% gc time) (1.89 M GPU allocations: 95.019 GiB, 2.83% gc time of which 0.08% spent allocating)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           142s / 3.59%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 background task       96    4.70s  92.6%  49.0ms
   reclaim             96   2.84ms  0.06%  29.6μs
   scan                96   82.5μs  0.00%   859ns
 pooled alloc       1.89M    377ms  7.43%   200ns
   1 try alloc         40    642μs  0.01%  16.0μs
 ────────────────────────────────────────────────

...many epochs later...

[ Info: Epoch 47
447.433371 seconds (215.67 M CPU allocations: 10.865 GiB, 73.17% gc time) (1.89 M GPU allocations: 95.019 GiB, 0.92% gc time of which 3.11% spent allocating)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           447s / 74.2%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 background task       96     331s   100%   3.45s
   reclaim             96   62.5ms  0.02%   651μs
   scan                96   94.4μs  0.00%   983ns
 pooled alloc       1.89M    469ms  0.14%   248ns
   1 try alloc      3.02k   65.8ms  0.02%  21.8μs
 ────────────────────────────────────────────────
[ Info: Epoch 48
455.582128 seconds (215.67 M CPU allocations: 10.865 GiB, 73.56% gc time) (1.89 M GPU allocations: 95.019 GiB, 0.90% gc time of which 3.09% spent allocating)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           456s / 74.6%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 background task       96     339s   100%   3.53s
   reclaim             96   62.0ms  0.02%   646μs
   scan                96   93.6μs  0.00%   975ns
 pooled alloc       1.89M    454ms  0.13%   240ns
   1 try alloc      3.02k   65.3ms  0.02%  21.6μs
 ────────────────────────────────────────────────
[ Info: Epoch 49
465.005904 seconds (215.67 M CPU allocations: 10.865 GiB, 73.97% gc time) (1.89 M GPU allocations: 95.019 GiB, 0.89% gc time of which 3.15% spent allocating)
 ────────────────────────────────────────────────
                                   Time
                           ──────────────────────
     Tot / % measured:           465s / 75.0%

 Section           ncalls     time   %tot     avg
 ────────────────────────────────────────────────
 background task       96     348s   100%   3.63s
   reclaim             96   66.0ms  0.02%   688μs
   scan                96   93.4μs  0.00%   973ns
 pooled alloc       1.89M    455ms  0.13%   241ns
   1 try alloc      3.02k   64.5ms  0.02%  21.3μs
 ────────────────────────────────────────────────

@ToucheSir
Copy link
Member

This would benefit from testing with the latest version of CUDA. Otherwise I don't think there's much actionable stuff to be had on the Flux side of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants