Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers out of memory during optimization runs #387

Closed
daveweisberg opened this issue Aug 9, 2023 · 46 comments
Closed

Workers out of memory during optimization runs #387

daveweisberg opened this issue Aug 9, 2023 · 46 comments
Assignees
Labels
bug Something isn't working high priority JuliaHub

Comments

@daveweisberg
Copy link
Contributor

daveweisberg commented Aug 9, 2023

I have been seeing srun error: Out Of Memory on FUSE optimization runs that last longer than ~60 iterations. I don't understand why a worker node would be running out of memory, but we need to identify which process is causing this. It is preventing me from efficiently completing optimization runs that require 100s of iterations.

The following error corresponds to the optimization run located here:

/fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop48_HTS0.1_qpol2.75_A3.5_Zeff2.0/

where the last saved run folder is from iteration 56.

Error:

Iteration  30%|████████▍                   |  ETA: 6:51:07 ( 3.26  m/it)mWorker 42 terminated.
srun: error: omega26: task 40: Out Of Memory
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#680")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:945
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:953
 [3] unsafe_read
   @ ./io.jl:759 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:758
 [5] read!
   @ ./io.jl:760 [inlined]
 [6] deserialize_hdr_raw
   @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:172
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133
 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed ./task.jl:484
@daveweisberg daveweisberg added bug Something isn't working high priority labels Aug 9, 2023
@orso82
Copy link
Member

orso82 commented Aug 9, 2023

The new HFSsizing actor is taking a lot more memory than it used to... it has now the largest memory footprint of all actors:

See for example https://github.com/ProjectTorreyPines/FUSE.jl/actions/runs/5742584439/job/15564988399

image

This likely points to some type instability (which would also makes it slow)

@daveweisberg
Copy link
Contributor Author

Ok, but in general why does the allocated memory increase with the number of iterations? Shouldn't we stop this ratcheting effect somehow? Maybe by re-setting all variables after each iteration?

@smithsp
Copy link

smithsp commented Aug 9, 2023

Are you requesting enough memory in the first place?

@daveweisberg
Copy link
Contributor Author

Some more info. Compare these FUSE.timing outputs from the same Julia process (1218946) at the 1st, 2nd, and 56th iterations. By the last iteration, the total allocations are 1.37 TiB.

1st iteration (includes a lot of pre-compiling that is not repeated in subsequent iterations)
Screen Shot 2023-08-09 at 12 00 59 PM

2nd iteration
Screen Shot 2023-08-09 at 12 01 39 PM

56th iteration
Screen Shot 2023-08-09 at 11 57 36 AM

@bclyons12
Copy link
Member

How about the 28th iteration, if you have it? This looks like a memory increase of 20-30 GB per iteration. Some sort of memory leak not captured by the garbage collector?

@daveweisberg
Copy link
Contributor Author

28th iteration:

Screen Shot 2023-08-09 at 1 59 34 PM

@orso82
Copy link
Member

orso82 commented Aug 9, 2023

Indeed, some memory does not get freed.
However the memory usage within each actor does not seem to be growing.

@daveweisberg
Copy link
Contributor Author

Note that in this case (which is typical for my optimization runs), there are 256 tasks being run in parallel on 4 Omega worker nodes. So whatever memory leakage is indicated in the above timing.txt files, multiply that by 256/4 = 64 to get the total amount for each node.

@bclyons12
Copy link
Member

I see some issues online related to garbage collection and multithreading, though the issues appear closed, for example, JuliaLang/julia#31923. Perhaps it would be worth trying a single thread on a single core to see what the performance in. It should only require a few iterations to see if the memory footprint is growing.

orso82 added a commit that referenced this issue Aug 10, 2023
@orso82
Copy link
Member

orso82 commented Aug 10, 2023

@daveweisberg see 75089bc

I see that you are now writing a timing.txt file for each of the optimization outputs. Could you please add a memory.txt file that contains the output of the following function:

FUSE.varinfo(FUSE, all=true, imported=true, recursive=true, sortby=:size, minsize=1024)

this will contain an output similar to this (here shown with recursive=false), which we can then use to track down the memory issue that you are seeing:

| name                                 |        size | summary                                               |
|:------------------------------------ | -----------:|:----------------------------------------------------- |
| Interpolations                       |  20.557 MiB | Module                                                |
| Plots                                |  14.332 MiB | Module                                                |
| DataFrames                           |  13.287 MiB | Module                                                |
| FUSE                                 |  11.432 MiB | Module                                                |
| IMAS                                 |   9.902 MiB | Module                                                |
| OrderedCollections                   |   9.825 MiB | Module                                                |
| ForwardDiff                          |   9.316 MiB | Module                                                |
| LibGEOS                              |   9.124 MiB | Module                                                |
| SpecialFunctions                     |   9.111 MiB | Module                                                |
| CSV                                  |   9.036 MiB | Module                                                |
| Optim                                |   8.833 MiB | Module                                                |
| TEQUILA                              |   8.803 MiB | Module                                                |
| Random                               |   8.543 MiB | Module                                                |
| FiniteElementHermite                 |   8.507 MiB | Module                                                |
| Distributed                          |   8.477 MiB | Module                                                |
| GeoInterface                         |   8.475 MiB | Module                                                |
| Dates                                |   8.463 MiB | Module                                                |
| FileIO                               |   8.427 MiB | Module                                                |
| MXHEquilibrium                       |   8.360 MiB | Module                                                |
| JSON                                 |   8.206 MiB | Module                                                |
| AbstractTrees                        |   8.109 MiB | Module                                                |
| QuadGK                               |   8.099 MiB | Module                                                |
| TimerOutputs                         |   8.073 MiB | Module                                                |
| VacuumFields                         |   8.056 MiB | Module                                                |
| JpegTurbo                            |   8.004 MiB | Module                                                |
| NLsolve                              |   7.962 MiB | Module                                                |
| QED                                  |   7.911 MiB | Module                                                |
| LoggingExtras                        |   7.899 MiB | Module                                                |
| TGLFNN                               |   7.885 MiB | Module                                                |
| Logging                              |   7.874 MiB | Module                                                |
| BoundaryPlasmaModels                 |   7.872 MiB | Module                                                |
| EPEDNN                               |   7.867 MiB | Module                                                |
| TAUENN                               |   7.867 MiB | Module                                                |
| ClusterManagers                      |   7.863 MiB | Module                                                |
| CHEASE                               |   7.839 MiB | Module                                                |
| PolygonOps                           |   7.831 MiB | Module                                                |
| NNeutronics                          |   7.827 MiB | Module                                                |
| EFIT                                 |   7.823 MiB | Module                                                |
| elements                             | 331.727 KiB | PeriodicTable.Elements                                |
| ##meta#58                            | 279.988 KiB | IdDict{Any, Any} with 175 entries                     |
| memory                               | 123.646 KiB | FUSE.Memory                                           |
| timer                                |  33.063 KiB | TimerOutputs.TimerOutput                              |
| ##load_inflation_rate_memoized_cache |  15.069 KiB | IdDict{Any, Any} with 1 entry                         |
| SnoopPrecompile                      |  14.213 KiB | Module                                                |
| Memoize                              |  12.317 KiB | Module                                                |
| ConstraintFunctionsLibrary           |   1.964 KiB | Dict{Symbol, FUSE.ConstraintFunction} with 15 entries |
| ObjectiveFunctionsLibrary            |   1.720 KiB | Dict{Symbol, FUSE.ObjectiveFunction} with 11 entries  |

NOTE: FUSE.varinfo() behaves differently than varinfo() that is available in InteractiveUtils. I found that the recursive=true option of InteractiveUtils.varinfo() would not work as advertised. @sjkelly maybe a bug in the JuliaLang source?

You can now also do to see how memory gets used by your julia process

"""
    plot_memory(memory::Memory, n_big_jumps::Int=5, ignore_first_seconds::Int=0)

Plot the memory usage over time from a `Memory` object. 

# Arguments
- `n_big_jumps`: number of significant memory jumps to highlight in the plot.
- `ignore_first_seconds`: number of initial seconds to ignore in the plot. Memory usage will be plotted relative to the memory after this cutoff. Default is `0` (no seconds are ignored).

# Returns
A plot with the following characteristics:
- Time is displayed on the x-axis as delta seconds since the first recorded time (after ignoring the specified initial seconds).
- Memory usage is displayed on the y-axis in MB.
- Memory usage is shown as a scatter plot.
- The `n_big_jumps` largest jumps in memory are highlighted in red with annotations indicating the action causing each jump.
"""
plot(FUSE.memory, 5, 0)
image

I have not tried, but I suspect that you might be able to gather the memory history from each process and plot them this way:

using Distributed
p = plot(FUSE.memory, 5) #Main process
memories = [@fetchfrom worker FUSE.memory for worker in workers()]
for memory in memories
    plot!(memory, 5)
end
display(p)

@daveweisberg
Copy link
Contributor Author

@orso82 Sorry but I don't have the time to help diagnose this issue. Can you please recreate this using my established optimization workflow?

@orso82
Copy link
Member

orso82 commented Aug 10, 2023

Ok. Let's say whoever gets there first.

@daveweisberg
Copy link
Contributor Author

daveweisberg commented Aug 29, 2023

@orso82 New runs with the varinfo.txt logging are available on Omega here:

/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs
/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs2
/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs3

These runs each have 60 generations, and completed without any memory errors.

An on-going run with 180 generations can be found here:

/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs4

@daveweisberg
Copy link
Contributor Author

daveweisberg commented Aug 30, 2023

@orso82 Update: The 180 generation run errored out 64% of the way to completion. I've moved the run directory to this location:

/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs_fail

Error messages:

Iteration  64%|█████████████████▉          |  ETA: 4:06:56 ( 3.80  m/it)mWorker 246 terminated.
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#680")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:945
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:953
 [3] unsafe_read
   @ ./io.jl:759 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:758
 [5] read!
   @ ./io.jl:760 [inlined]
 [6] deserialize_hdr_raw
   @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:172
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133
 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed ./task.jl:484
srun: error: omega29: task 244: Out Of Memory
Worker 116 terminated.
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base ./stream.jl:410
  [2] (::Base.var"#wait_locked#680")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base ./stream.jl:947
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base ./stream.jl:953
  [4] unsafe_read
    @ ./io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base ./io.jl:758
  [6] read!
    @ ./io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed ./task.jl:484
Worker 19 terminated.
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base ./stream.jl:410
  [2] (::Base.var"#wait_locked#680")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base ./stream.jl:947
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base ./stream.jl:953
  [4] unsafe_read
    @ ./io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base ./io.jl:758
  [6] read!
    @ ./io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed ./task.jl:484

ProcessExitedException(19)
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:871
  [2] wait()
    @ Base ./task.jl:931
  [3] wait(c::Base.GenericCondition{ReentrantLock})
    @ Base ./condition.jl:124
  [4] take_buffered(c::Channel{Any})
    @ Base ./channels.jl:416
  [5] take!(c::Channel{Any})
    @ Base ./channels.jl:410
  [6] take!(::Distributed.RemoteValue)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:726
  [7] remotecall_fetch(f::Function, w::Distributed.Worker, args::Vector{Float64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:461
  [8] remotecall_fetch(f::Function, w::Distributed.Worker, args::Vector{Float64})
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:454
  [9] remotecall_fetch(f::Function, id::Int64, args::Vector{Float64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:492
 [10] remotecall_fetch
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:492 [inlined]
 [11] remotecall_pool(rc_f::typeof(remotecall_fetch), f::Function, pool::WorkerPool, args::Vector{Float64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:123
 [12] remotecall_pool
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:120 [inlined]
 [13] #remotecall_fetch#200
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:229 [inlined]
 [14] remotecall_fetch
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:229 [inlined]
 [15] #208#209
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:274 [inlined]
 [16] #208
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:274 [inlined]
 [17] (::Base.var"#929#934"{Distributed.var"#208#210"{Distributed.var"#208#209#211"{WorkerPool, FUSE.var"#741#743"{FUSE.ParametersInits{Float64}, FUSE.ParametersActors{Float64}, typeof(my_workflow), Vector{FUSE.ObjectiveFunction}, Vector{FUSE.ConstraintFunction}, String, Bool, ProgressMeter.Progress}}}})(r::Base.RefValue{Any}, args::Tuple{Vector{Float64}})
    @ Base ./asyncmap.jl:100
 [18] macro expansion
    @ ./asyncmap.jl:234 [inlined]
 [19] (::Base.var"#945#946"{Base.var"#929#934"{Distributed.var"#208#210"{Distributed.var"#208#209#211"{WorkerPool, FUSE.var"#741#743"{FUSE.ParametersInits{Float64}, FUSE.ParametersActors{Float64}, typeof(my_workflow), Vector{FUSE.ObjectiveFunction}, Vector{FUSE.ConstraintFunction}, String, Bool, ProgressMeter.Progress}}}}, Channel{Any}, Nothing})()
    @ Base ./task.jl:484

Stacktrace:
  [1] (::Base.var"#939#941")(x::Task)
    @ Base ./asyncmap.jl:177
  [2] foreach(f::Base.var"#939#941", itr::Vector{Any})
    @ Base ./abstractarray.jl:2774
  [3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c::Vector{Vector{Float64}})
    @ Base ./asyncmap.jl:177
  [4] wrap_n_exec_twice
    @ ./asyncmap.jl:153 [inlined]
  [5] #async_usemap#924
    @ ./asyncmap.jl:103 [inlined]
  [6] #asyncmap#923
    @ ./asyncmap.jl:81 [inlined]
  [7] pmap(f::Function, p::WorkerPool, c::Vector{Vector{Float64}}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::Vector{Any}, retry_check::Nothing)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:126
  [8] pmap
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:99 [inlined]
  [9] #pmap#233
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:156 [inlined]
 [10] pmap
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:156 [inlined]
 [11] optimization_engine(ini::FUSE.ParametersInits{Float64}, act::FUSE.ParametersActors{Float64}, actor_or_workflow::typeof(my_workflow), X::Matrix{Float64}, objectives_functions::Vector{FUSE.ObjectiveFunction}, constraints_functions::Vector{FUSE.ConstraintFunction}, save_folder::String, save_dd::Bool, p::ProgressMeter.Progress)
    @ FUSE /fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/src/optimization.jl:266
 [12] (::FUSE.var"#748#751"{String, Bool, FUSE.ParametersInits{Float64}, FUSE.ParametersActors{Float64}, typeof(my_workflow), Vector{FUSE.ObjectiveFunction}, Vector{FUSE.ConstraintFunction}, ProgressMeter.Progress})(X::Matrix{Float64})
    @ FUSE /fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/src/workflows/optimization_workflow.jl:101
 [13] create_solutions(X::Matrix{Float64}, problem::Metaheuristics.Problem{SearchSpaces.BoxConstrainedSpace{Float64}}; ε::Float64)
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/solutions/individual.jl:327
 [14] create_solutions
    @ ~/.julia/packages/Metaheuristics/oeGM5/src/solutions/individual.jl:323 [inlined]
 [15] update_state!(::Metaheuristics.State{Metaheuristics.xFgh_solution{Vector{Float64}}}, ::Metaheuristics.SPEA2, ::Metaheuristics.Problem{SearchSpaces.BoxConstrainedSpace{Float64}}, ::Metaheuristics.Information, ::Metaheuristics.Options; kargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/algorithms/SPEA2/SPEA2.jl:95
 [16] update_state!
    @ ~/.julia/packages/Metaheuristics/oeGM5/src/algorithms/SPEA2/SPEA2.jl:79 [inlined]
 [17] _during_optimization!(status::Metaheuristics.State{Metaheuristics.xFgh_solution{Vector{Float64}}}, parameters::Metaheuristics.SPEA2, problem::Metaheuristics.Problem{SearchSpaces.BoxConstrainedSpace{Float64}}, information::Metaheuristics.Information, options::Metaheuristics.Options, convergence::Vector{Metaheuristics.State{Metaheuristics.xFgh_solution{Vector{Float64}}}}, logger::Metaheuristics.var"#154#156", termination::Metaheuristics.Termination)
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/optimize/during.jl:5
 [18] optimize(f::Function, search_space::LinearAlgebra.Adjoint{Float64, Matrix{Float64}}, method::Metaheuristics.Algorithm{Metaheuristics.SPEA2}; logger::Function)
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/optimize/optimize.jl:52
 [19] optimize(f::Function, search_space::LinearAlgebra.Adjoint{Float64, Matrix{Float64}}, method::Metaheuristics.Algorithm{Metaheuristics.SPEA2})
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/optimize/optimize.jl:42
 [20] macro expansion
    @ ./timing.jl:262 [inlined]
 [21] workflow_multiobjective_optimization(ini::FUSE.ParametersInits{Float64}, act::FUSE.ParametersActors{Float64}, actor_or_workflow::typeof(my_workflow), objectives_functions::Vector{FUSE.ObjectiveFunction}, constraints_functions::Vector{FUSE.ConstraintFunction}; exploitation_vs_exploration::Float64, N::Int64, iterations::Int64, continue_state::Metaheuristics.State{Metaheuristics.xFgh_solution{Vector{Float64}}}, save_folder::String, save_dd::Bool)
    @ FUSE /fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/src/workflows/optimization_workflow.jl:101
 [22] top-level scope
    @ In[13]:22

orso82 added a commit that referenced this issue Sep 6, 2023
@orso82
Copy link
Member

orso82 commented Sep 20, 2023

@daveweisberg pointed me to this memtrace:
/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/exploratory_new/opt_betaN_cost__Solovev_Kr_flattop48_HTS0.1_A3.5_B6.0/opt_runs2/99__2023-09-19T10:02:53.638__1294740/memtrace.txt

I did some analysis on it and I think the memory leak is occurring in Solovev

image

@lstagner could you please take a look and see there's anything obvious? I have been burned by Memoize myself in the past... perhaps this is the case again?

@lstagner
Copy link
Contributor

lstagner commented Sep 20, 2023

I made sure all uses of Memoize limited the amount of instances it could save at once. If you think its caused by Memoize you can use MXHEquilibrium.set_cache_size!(0) to set the cache size to zero. The default cache size is five.

@orso82
Copy link
Member

orso82 commented Sep 20, 2023

OK, thank you for the good insight @lstagner !

@daveweisberg the next time you run, can you please try to set FUSE.MXHEquilibrium.set_cache_size!(0) and point me again to the folder where you did the run?

@lstagner
Copy link
Contributor

lstagner commented Sep 20, 2023

another thing to try is to periodically call GC.gc() and ccall(:malloc_trim, Cvoid, (Cint,), 0) to run the garbage collector and to tell linux to free up some memory the GC misses.

@daveweisberg
Copy link
Contributor Author

@orso82 I've finished a new optimization run with FUSE.MXHEquilibrium.set_cache_size!(0).

It can be found here: /fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/study_1/opt_betaN_cost__Solovev_flattop72/opt_runs1/.

@orso82
Copy link
Member

orso82 commented Sep 23, 2023

Similar growth, but with an offset, which makes sense, since before we had a finite cache size of 5.
So, while this reduces the memory usage, it does not solve the problem.

image

Next, as @lstagner suggested, could you please try to add

GC.gc()
ccall(:malloc_trim, Cvoid, (Cint,), 0)

at the bottom of your workflow function that you feed to the FUSE.workflow_multiobjective_optimization() function?

orso82 added a commit that referenced this issue Sep 25, 2023
@daveweisberg
Copy link
Contributor Author

daveweisberg commented Sep 26, 2023

@orso82 New run directories with both of Luke's suggestions implemented can be found here:

/fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/study_2/opt_betaN_cost__flattop72_A3.5_H1.0/opt_runs1/

@orso82
Copy link
Member

orso82 commented Sep 26, 2023

Better, but it's still increasing!

image

@orso82
Copy link
Member

orso82 commented Sep 26, 2023

@daveweisberg can you please try to setup a run where you do not execute the ActorWholeFacility, and just do a init(dd,ini,act). This will run ActorSolovev and ActorCXbuild (and few other trivial actors) but none of the other heavier actors will be run. This would be informative to rule out places where the issue might be.

@TimSlendebroek
Copy link
Contributor

I have another memtrace of similar behavior of a TEQUIA & FluxMatcher run on SAGA.

Here is the memtrace of the last case that was run:

memtrace.txt

@TimSlendebroek
Copy link
Contributor

TimSlendebroek commented Oct 2, 2023

plot of the memtrace
image

@TimSlendebroek
Copy link
Contributor

@daveweisberg pointed me to this memtrace: /fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/exploratory_new/opt_betaN_cost__Solovev_Kr_flattop48_HTS0.1_A3.5_B6.0/opt_runs2/99__2023-09-19T10:02:53.638__1294740/memtrace.txt

I did some analysis on it and I think the memory leak is occurring in Solovev

image @lstagner could you please take a look and see there's anything obvious? I have been burned by Memoize myself in the past... perhaps this is the case again?

@daveweisberg pointed me to this memtrace: /fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/exploratory_new/opt_betaN_cost__Solovev_Kr_flattop48_HTS0.1_A3.5_B6.0/opt_runs2/99__2023-09-19T10:02:53.638__1294740/memtrace.txt

I did some analysis on it and I think the memory leak is occurring in Solovev

image @lstagner could you please take a look and see there's anything obvious? I have been burned by Memoize myself in the past... perhaps this is the case again?

I am not sure if it only Solovev as I see the same problem for running with TEQULIA (although of course it could be both)

@TimSlendebroek
Copy link
Contributor

image

My conclusion on how the memory leakage is happening relates to the garbage collector being unable to do it's job in a pmap while also being in a try-catch structure... see 5d8df73 for the fix

@orso82
Copy link
Member

orso82 commented Oct 11, 2023

help?> GC.gc
  GC.gc([full=true])

  Perform garbage collection. The argument full determines the kind of collection: A full collection (default) sweeps all objects, which
  makes the next GC scan much slower, while an incremental collection may only sweep so-called young objects.

  │ Warning
  │
  │  Excessive use will likely lead to poor performance.

@TimSlendebroek can you please try calling GC.gc(false) and see if this still works?

@orso82
Copy link
Member

orso82 commented Oct 24, 2023

The issue has gotten better, but is not fully resolved. It would be great to have @mbolognaJH and @sjkelly take a look at the issue.

@sjkelly
Copy link

sjkelly commented Oct 24, 2023

This flag can be used to force Julia to collect at a greater interval:

 --heap-size-hint=<size>    Forces garbage collection if memory usage is higher than that value.

@TimSlendebroek
Copy link
Contributor

I'll add the minimal working example

  • local
  • saga (cluster)

But also add the optimization case for

  • local
  • saga (cluster)

Memory goes up directly with the number of dd's processed

@orso82
Copy link
Member

orso82 commented Oct 24, 2023

heap-size-hint can be very handy. Set it to 1/2 or 1/4 of the system's memory. NOTE: since we have many julia processes on each node, perhaps we need to get 1/100th of the memory.

How do we pass heap-size-hint to the worker nodes? @mbolognaJH help! Something like:

JuliaArgs for Distributed.jl
heap-size-hint = sys.total_physical_memory() / sys.CPU_THREADS

v1.10.0 was supposed to improve GC, but did not make the cut. v.1.11.0 might address this.

May be able to find workarounds for v1.9.0 and v1.10.0

@TimSlendebroek
Copy link
Contributor

TimSlendebroek commented Oct 24, 2023

I made an example notebook that runs locally and shows the same problems (without wrapping and gc) @mbolognaJH
see 76cfed0
examples/debugging/memory.ipynb

(don't run the last two cells as this is the optimization it's already showing memory issues with simple load and save):

image

@mbolognaJH
Copy link

@TimSlendebroek is there a specific dd.json file I should be using? Maybe I'm missing part of my setup for this example.

@TimSlendebroek
Copy link
Contributor

TimSlendebroek commented Oct 25, 2023

Aah sorry @mbolognaJH

I can send you one or you can do the following to create your own json:

using FUSE
dd, ini,act = FUSE.init(:KDEMO)


IMAS.json2imas(dd,<your path>)

@TimSlendebroek
Copy link
Contributor

@mbolognaJH can we add each other on zoom/discord so I can respond a bit quicker if you have questions / explain / send files?

@mbolognaJH
Copy link

Yes sure thing, we can continue the conversation on Zoom (I haven't tried it before, but we'll see how it works): my email is michael.bologna@juliahub.com

@mbolognaJH
Copy link

I have been able to reproduce the MWE. By limiting the heap size available to the Julia GC with --heap-size-hint we can avoid the memory growth behavior as shown below:
image

The root cause of this is the multi-threading and multi-processing design patterns used by FUSE. Currently, FUSE uses multi-processing exclusively via Distributed.jl, including within each node. This uses multiple julia process within each node. This requires substantial memory overhead since a copy of the base system image and user space code is loaded for each process. This also introduces the observed memory growth behavior, since each julia process assumes that it is the only one on the node.

By adding --heap-size-hint such that Sys.total_physical_memory() / sys.CPU_THREADS we get the expected behavior and the GC is able to collect at a regular interval. This also obviates the need for manual calls to GC.gc(). Within the addprocs command the exeflags keyword argument can propagate a heap size hint to each child julia process, e.g.: exeflags="--heap-size-hint=2G"

The recommend long-term solution, that will also free up memory overhead and improve performance, would be to use multi-threading within each node as documented here. https://docs.julialang.org/en/v1.9/manual/multi-threading/
This requires the work to be partitioned in a way that the master node is aware of the thread counts on each child node as well as the total memory on each child node.

@bclyons12
Copy link
Member

Thanks @mbolognaJH , this makes sense to me. I've been looking for how to implement your long-term solution, but I can't find a good example. How can we launch one process per node that has access to all the cores/threads on that node?

@mbolognaJH
Copy link

You can use the exeflags again to specify the number of threads you want to spawn:

For example:

# Although you won't need the heap-size-hint if you're doing the long-term solution,
# but here's how to do multiple exeflags in one `addprocs()` command.
num_procs = 1
num_threads = 8 # You can use "auto" here as well
addprocs(num_procs, exeflags=["--threads=$num_threads", "--heap-size-hint=2G"])

@orso82
Copy link
Member

orso82 commented Nov 2, 2023

Thank you @mbolognaJH ! This solves a major headache!!!

@bclyons12 @mbolognaJH I consider this solved now. I'll add the heap-size-hint as you suggest, and have @TimSlendebroek test it.

Please don't spend further time on this now. I have more pressing questions for both of you @bclyons12 and @mbolognaJH
I'll open up some issue and tag you in those.

Thanks again!!!!

@bclyons12
Copy link
Member

I still think there are very good reasons to figure out how to do the hybrid multiprocessing-multithreading implementation. I understand that it's not a priority for now, but we should open a separate issue for a future upgrade if we're going to close this one.

@orso82
Copy link
Member

orso82 commented Nov 6, 2023

@TimSlendebroek reported that the --heap-size-hint fix seems to have fixed the issue!

image

@daveweisberg please note that as part of d7c57a5 now by default cpus_per_task=1 which means that on OMEGA you'll be able to push through double the number of workers for the same number of nodes. This should help significantly!

Thanks again @mbolognaJH !

@orso82 orso82 closed this as completed Nov 6, 2023
@TimSlendebroek
Copy link
Contributor

image on local with 2G hintsize

@orso82
Copy link
Member

orso82 commented Nov 7, 2023

Thank you @TimSlendebroek for the thorough testing!!

@daveweisberg
Copy link
Contributor Author

daveweisberg commented Nov 18, 2023

Thanks everyone for the hard work in solving this issue! I've started testing this with Julia 1.10 on the Omega cluster, and so far it looks good!

A quick question for @TimSlendebroek or @orso82: Should I still include the following in the function I'm running on the distributed workers?

GC.gc()
ccall(:malloc_trim, Cvoid, (Cint,), 0)

Or is this no longer necessary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority JuliaHub
Projects
None yet
Development

No branches or pull requests

8 participants