Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IOBuffer/fileIO Memory leak with Threads.@spawn #49545

Closed
louisponet opened this issue Apr 27, 2023 · 14 comments · Fixed by #50144
Closed

IOBuffer/fileIO Memory leak with Threads.@spawn #49545

louisponet opened this issue Apr 27, 2023 · 14 comments · Fixed by #50144
Labels
GC Garbage collector regression Regression in behavior compared to a previous version regression 1.9 Regression in the 1.9 release
Milestone

Comments

@louisponet
Copy link
Contributor

I managed to isolate a memory leak that caused runaway memory usage for long running processes due to some logging I was doing.

MWE:

function explode(lck, buffer, path, message; kwargs...)
    lock(buffer) do
        println(buffer, message)
        open(path, append = true) do f
            write(f, take!(buffer))
        end
    end
end

let
    lck = ReentrantLock()
    buffer = IOBuffer()
    path = "test.log"
    while true
        Threads.@spawn begin
            explode(lck, buffer, path, "blabla")
        end
    end
end

versioninfo():

Julia Version 1.9.0-rc2
Commit 72aec423c2 (2023-04-01 10:41 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 8 on 8 virtual cores
Environment:
  LD_LIBRARY_PATH = /opt/intel/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin:/opt/intel/compilers_and_libraries_2020.4.304/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin
  JULIA_NUM_THREADS = 8

Compiled from source. Also tested on downloaded binaries of julia 1.8.2 , 1.6.7

@vtjnash
Copy link
Member

vtjnash commented Apr 27, 2023

Doesn't look like a memory leak to me. The while true seems like an application error explicitly referencing all of the accessed memory (commonly known as a fork bomb)

@louisponet
Copy link
Contributor Author

louisponet commented Apr 27, 2023

So in my real code this executes much slower ofcourse, but it still keeps accumulating memory that never gets garbage collected. What would be the fix?

EDIT: I mean even if I stop the loop after a while and run it again or something else, or try manually triggering the GC, the memory does not get cleaned up. In my real world application over the course of a week the process gets to 25gb occupied. In that one there are no while true loops without sleeping etc, just different threads logging messages. Boiled down it's the similar to this snippet here.

EDIT2: see logging.jl for the real code

@louisponet
Copy link
Contributor Author

I have tested the memory consumption with the following code, to also not to do the fork bombing, i.e. making it closer to my real world example:

function meminfo_julia!(d)
  push!(d["GC Live"], Base.gc_live_bytes()/2^20)
  push!(d["Max. RSS"], Sys.maxrss()/2^20)
end

function meminfo_procfs(d)
    pid=getpid()
  smaps = "/proc/$pid/smaps_rollup"
  if !isfile(smaps)
    error("`$smaps` not found. Maybe you are using an OS without procfs support or with an old kernel.")
  end

  rss = pss = shared = private = 0
  for line in eachline(smaps)
    s = split(line)
    if s[1] == "Rss:"
      rss += parse(Int64, s[2])
    elseif s[1] == "Pss:"
      pss += parse(Int64, s[2])
    end
  end

  push!(d["RSS"], rss/2^10)
  push!(d["PSS"], pss/2^10)
end

function explode(buffer = IOBuffer())
    lck = ReentrantLock()
    path = "test.log"
    @sync for _ in 1:1000000
        Threads.@spawn begin
            lock(lck) do
                println(buffer, "blabla")
                open(path, append=true) do f
                    write(f, take!(buffer))
                end
            end
        end
    end
end

memdict = Dict("GC Live" => Float64[],
               "Max. RSS" => Float64[],
               "RSS" => Float64[],
               "PSS" => Float64[])
t = Threads.@spawn let
    buffer = IOBuffer()
    for i = 1:200
        explode(buffer)
        meminfo_julia!(memdict)
        meminfo_procfs(memdict)
    end
end

If I plot the results I get:

image

While it seems like it stabilizes, over time it keeps trending upwards and I think this is exactly what's causing the long term leak. Would you have any idea what might be causing this trend?

@louisponet
Copy link
Contributor Author

I've done some more digging, I think I have isolated the issue in the above piece of code to the write(f, take!(buffer)) part, which can be recreated with much less threads and simply a larger message like:

const MSG = String(rand(UInt8, Int(10e6)))
function explode(buffer = IOBuffer())
    lck = ReentrantLock()
    out = ""
    @sync for _ in 1:2
        Threads.@spawn begin
            lock(lck) do
                println(buffer, MSG)
                out *= String(take!(buffer))
            end
        end
    end
    
    return out
end

I have done the above test in a loop which leads to:
explode

In the case of writing to a file, I could fix it by simply writing the buffer.data directly and manually setting buffer.size=0, buffer.ptr=1. But I don't think that this behavior is expected...

@louisponet louisponet changed the title IOBuffer/fileIO Memory leak with Threads.@spawn IOBuffer/fileIO **to String** Memory leak with Threads.@spawn Apr 28, 2023
@louisponet louisponet changed the title IOBuffer/fileIO **to String** Memory leak with Threads.@spawn IOBuffer/fileIO Memory leak with Threads.@spawn Apr 28, 2023
@vtjnash
Copy link
Member

vtjnash commented Apr 28, 2023

I ran that on master, it it worked fine for me. Stayed constant at about 500-600MB RSS in htop.

@louisponet
Copy link
Contributor Author

louisponet commented Apr 28, 2023

Ok, if I run it with -t 1 or -t 2, it stays fine, at -t 3 it starts going up like in the graph

EDIT: on a freshly compiled master

@louisponet
Copy link
Contributor Author

louisponet commented Apr 28, 2023

I just tested it on 1.8.5, there this particular issue is not present. It is on 1.9-rc4 and master

Edit: i meant 1.9.0-rc2

@gbaraldi
Copy link
Member

gbaraldi commented Apr 28, 2023

On linux I wasn't able to reproduce this but on macos I do see the resident set grow and grow until we start swapping. But calling GC.gc() does make it go back down significantly.
What is interesting is that I don't seem to do a full collection ever during the benchmark, even if the heap keeps growing and growing.
I wonder what the difference here is, macos doesn't do malloc trim which is maybe what's helping linux here.

Actually I can reproduce on linux I just need more threads 😕

@d-netto this might be an interesting benchmark for heuristics.

@louisponet
Copy link
Contributor Author

louisponet commented Apr 28, 2023

So I must say i was quite on the wrong path in the beginning thinking it was just iobuffers, now i think it's about how Strings are allocated during locked (?) Conversion from UInt8 vectors... It's also very weird that it only shows up when the number of threads is bigger than 2, I'm not sure if that correlates with the range of the loop.

I have ran everything on Linux, i verified with some of my colleagues who are running 1.8.x (on Linux) and there it seems to not happen.

@louisponet
Copy link
Contributor Author

Tested it again today on julia v1.9.0-rc3 and it's still present.

@KristofferC
Copy link
Member

You could try a bisect to see what commit introduced the problem.

@louisponet
Copy link
Contributor Author

Oh, you're right I'll have a try. I also tried the Profile.take_heap_snapshot() after seeing your workflow video, but it doesn't seem very illuminating (can still supply it if helpful)

@louisponet
Copy link
Contributor Author

Success! The bisect points at commit 80346c1 as the first bad one.

@KristofferC KristofferC added regression Regression in behavior compared to a previous version GC Garbage collector labels May 2, 2023
@KristofferC KristofferC added this to the 1.10 milestone May 4, 2023
@KristofferC KristofferC added the regression 1.9 Regression in the 1.9 release label Jun 9, 2023
@Moelf
Copy link
Contributor

Moelf commented Jun 26, 2023

what's the perspective on fixing this?

oscardssmith added a commit that referenced this issue Jul 23, 2023
This PR implements GC heuristics based on the amount of pages allocated
instead of live objects like was done before.
The heuristic for new heap target is based on
https://dl.acm.org/doi/10.1145/3563323 (in summary it argues that the
heap target should have square root behaviour).
From my testing this fixes
#49545 and
#49761
KristofferC pushed a commit that referenced this issue Jul 24, 2023
This PR implements GC heuristics based on the amount of pages allocated
instead of live objects like was done before.
The heuristic for new heap target is based on
https://dl.acm.org/doi/10.1145/3563323 (in summary it argues that the
heap target should have square root behaviour).
From my testing this fixes
#49545 and
#49761

(cherry picked from commit 32aa29f)
d-netto pushed a commit to RelationalAI/julia that referenced this issue Oct 4, 2023
This PR implements GC heuristics based on the amount of pages allocated
instead of live objects like was done before.
The heuristic for new heap target is based on
https://dl.acm.org/doi/10.1145/3563323 (in summary it argues that the
heap target should have square root behaviour).
From my testing this fixes
JuliaLang#49545 and
JuliaLang#49761
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector regression Regression in behavior compared to a previous version regression 1.9 Regression in the 1.9 release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants