Workers out of memory during optimization runs #387

daveweisberg · 2023-08-09T16:46:37Z

I have been seeing srun error: Out Of Memory on FUSE optimization runs that last longer than ~60 iterations. I don't understand why a worker node would be running out of memory, but we need to identify which process is causing this. It is preventing me from efficiently completing optimization runs that require 100s of iterations.

The following error corresponds to the optimization run located here:

/fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop48_HTS0.1_qpol2.75_A3.5_Zeff2.0/

where the last saved run folder is from iteration 56.

Error:

Iteration  30%|████████▍                   |  ETA: 6:51:07 ( 3.26  m/it)mWorker 42 terminated.
srun: error: omega26: task 40: Out Of Memory
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#680")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:945
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:953
 [3] unsafe_read
   @ ./io.jl:759 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:758
 [5] read!
   @ ./io.jl:760 [inlined]
 [6] deserialize_hdr_raw
   @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:172
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133
 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed ./task.jl:484

The text was updated successfully, but these errors were encountered:

orso82 · 2023-08-09T16:53:01Z

The new HFSsizing actor is taking a lot more memory than it used to... it has now the largest memory footprint of all actors:

See for example https://github.com/ProjectTorreyPines/FUSE.jl/actions/runs/5742584439/job/15564988399

This likely points to some type instability (which would also makes it slow)

daveweisberg · 2023-08-09T16:57:29Z

Ok, but in general why does the allocated memory increase with the number of iterations? Shouldn't we stop this ratcheting effect somehow? Maybe by re-setting all variables after each iteration?

smithsp · 2023-08-09T17:06:51Z

Are you requesting enough memory in the first place?

daveweisberg · 2023-08-09T19:05:48Z

Some more info. Compare these FUSE.timing outputs from the same Julia process (1218946) at the 1st, 2nd, and 56th iterations. By the last iteration, the total allocations are 1.37 TiB.

1st iteration (includes a lot of pre-compiling that is not repeated in subsequent iterations)

2nd iteration

56th iteration

bclyons12 · 2023-08-09T20:41:30Z

How about the 28th iteration, if you have it? This looks like a memory increase of 20-30 GB per iteration. Some sort of memory leak not captured by the garbage collector?

daveweisberg · 2023-08-09T21:00:16Z

28th iteration:

orso82 · 2023-08-09T21:04:13Z

Indeed, some memory does not get freed.
However the memory usage within each actor does not seem to be growing.

daveweisberg · 2023-08-09T21:08:33Z

Note that in this case (which is typical for my optimization runs), there are 256 tasks being run in parallel on 4 Omega worker nodes. So whatever memory leakage is indicated in the above timing.txt files, multiply that by 256/4 = 64 to get the total amount for each node.

bclyons12 · 2023-08-09T21:44:40Z

I see some issues online related to garbage collection and multithreading, though the issues appear closed, for example, JuliaLang/julia#31923. Perhaps it would be worth trying a single thread on a single core to see what the performance in. It should only require a few iterations to see if the memory footprint is growing.

mention #387

orso82 · 2023-08-10T10:53:55Z

@daveweisberg see 75089bc

I see that you are now writing a timing.txt file for each of the optimization outputs. Could you please add a memory.txt file that contains the output of the following function:

FUSE.varinfo(FUSE, all=true, imported=true, recursive=true, sortby=:size, minsize=1024)

this will contain an output similar to this (here shown with recursive=false), which we can then use to track down the memory issue that you are seeing:

| name                                 |        size | summary                                               |
|:------------------------------------ | -----------:|:----------------------------------------------------- |
| Interpolations                       |  20.557 MiB | Module                                                |
| Plots                                |  14.332 MiB | Module                                                |
| DataFrames                           |  13.287 MiB | Module                                                |
| FUSE                                 |  11.432 MiB | Module                                                |
| IMAS                                 |   9.902 MiB | Module                                                |
| OrderedCollections                   |   9.825 MiB | Module                                                |
| ForwardDiff                          |   9.316 MiB | Module                                                |
| LibGEOS                              |   9.124 MiB | Module                                                |
| SpecialFunctions                     |   9.111 MiB | Module                                                |
| CSV                                  |   9.036 MiB | Module                                                |
| Optim                                |   8.833 MiB | Module                                                |
| TEQUILA                              |   8.803 MiB | Module                                                |
| Random                               |   8.543 MiB | Module                                                |
| FiniteElementHermite                 |   8.507 MiB | Module                                                |
| Distributed                          |   8.477 MiB | Module                                                |
| GeoInterface                         |   8.475 MiB | Module                                                |
| Dates                                |   8.463 MiB | Module                                                |
| FileIO                               |   8.427 MiB | Module                                                |
| MXHEquilibrium                       |   8.360 MiB | Module                                                |
| JSON                                 |   8.206 MiB | Module                                                |
| AbstractTrees                        |   8.109 MiB | Module                                                |
| QuadGK                               |   8.099 MiB | Module                                                |
| TimerOutputs                         |   8.073 MiB | Module                                                |
| VacuumFields                         |   8.056 MiB | Module                                                |
| JpegTurbo                            |   8.004 MiB | Module                                                |
| NLsolve                              |   7.962 MiB | Module                                                |
| QED                                  |   7.911 MiB | Module                                                |
| LoggingExtras                        |   7.899 MiB | Module                                                |
| TGLFNN                               |   7.885 MiB | Module                                                |
| Logging                              |   7.874 MiB | Module                                                |
| BoundaryPlasmaModels                 |   7.872 MiB | Module                                                |
| EPEDNN                               |   7.867 MiB | Module                                                |
| TAUENN                               |   7.867 MiB | Module                                                |
| ClusterManagers                      |   7.863 MiB | Module                                                |
| CHEASE                               |   7.839 MiB | Module                                                |
| PolygonOps                           |   7.831 MiB | Module                                                |
| NNeutronics                          |   7.827 MiB | Module                                                |
| EFIT                                 |   7.823 MiB | Module                                                |
| elements                             | 331.727 KiB | PeriodicTable.Elements                                |
| ##meta#58                            | 279.988 KiB | IdDict{Any, Any} with 175 entries                     |
| memory                               | 123.646 KiB | FUSE.Memory                                           |
| timer                                |  33.063 KiB | TimerOutputs.TimerOutput                              |
| ##load_inflation_rate_memoized_cache |  15.069 KiB | IdDict{Any, Any} with 1 entry                         |
| SnoopPrecompile                      |  14.213 KiB | Module                                                |
| Memoize                              |  12.317 KiB | Module                                                |
| ConstraintFunctionsLibrary           |   1.964 KiB | Dict{Symbol, FUSE.ConstraintFunction} with 15 entries |
| ObjectiveFunctionsLibrary            |   1.720 KiB | Dict{Symbol, FUSE.ObjectiveFunction} with 11 entries  |

NOTE: FUSE.varinfo() behaves differently than varinfo() that is available in InteractiveUtils. I found that the recursive=true option of InteractiveUtils.varinfo() would not work as advertised. @sjkelly maybe a bug in the JuliaLang source?

You can now also do to see how memory gets used by your julia process

"""
    plot_memory(memory::Memory, n_big_jumps::Int=5, ignore_first_seconds::Int=0)

Plot the memory usage over time from a `Memory` object. 

# Arguments
- `n_big_jumps`: number of significant memory jumps to highlight in the plot.
- `ignore_first_seconds`: number of initial seconds to ignore in the plot. Memory usage will be plotted relative to the memory after this cutoff. Default is `0` (no seconds are ignored).

# Returns
A plot with the following characteristics:
- Time is displayed on the x-axis as delta seconds since the first recorded time (after ignoring the specified initial seconds).
- Memory usage is displayed on the y-axis in MB.
- Memory usage is shown as a scatter plot.
- The `n_big_jumps` largest jumps in memory are highlighted in red with annotations indicating the action causing each jump.
"""
plot(FUSE.memory, 5, 0)

I have not tried, but I suspect that you might be able to gather the memory history from each process and plot them this way:

using Distributed
p = plot(FUSE.memory, 5) #Main process
memories = [@fetchfrom worker FUSE.memory for worker in workers()]
for memory in memories
    plot!(memory, 5)
end
display(p)

daveweisberg · 2023-08-10T22:52:16Z

@orso82 Sorry but I don't have the time to help diagnose this issue. Can you please recreate this using my established optimization workflow?

orso82 · 2023-08-10T23:10:36Z

Ok. Let's say whoever gets there first.

daveweisberg · 2023-08-29T18:42:06Z

@orso82 New runs with the varinfo.txt logging are available on Omega here:

/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs
/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs2
/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs3

These runs each have 60 generations, and completed without any memory errors.

An on-going run with 180 generations can be found here:

/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs4

daveweisberg · 2023-08-30T03:10:51Z

@orso82 Update: The 180 generation run errored out 64% of the way to completion. I've moved the run directory to this location:

/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/nominal/opt_betaN_cost__Solovev_Kr_flattop24_HTS0.1_qpol2.75_A3.5_Zeff2.0/opt_runs_fail

Error messages:

Iteration  64%|█████████████████▉          |  ETA: 4:06:56 ( 3.80  m/it)mWorker 246 terminated.
Unhandled Task ERROR: EOFError: read end of file
Stacktrace:
 [1] (::Base.var"#wait_locked#680")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
   @ Base ./stream.jl:945
 [2] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
   @ Base ./stream.jl:953
 [3] unsafe_read
   @ ./io.jl:759 [inlined]
 [4] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
   @ Base ./io.jl:758
 [5] read!
   @ ./io.jl:760 [inlined]
 [6] deserialize_hdr_raw
   @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:167 [inlined]
 [7] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:172
 [8] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
   @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133
 [9] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
   @ Distributed ./task.jl:484
srun: error: omega29: task 244: Out Of Memory
Worker 116 terminated.
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base ./stream.jl:410
  [2] (::Base.var"#wait_locked#680")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base ./stream.jl:947
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base ./stream.jl:953
  [4] unsafe_read
    @ ./io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base ./io.jl:758
  [6] read!
    @ ./io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed ./task.jl:484
Worker 19 terminated.
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base ./stream.jl:410
  [2] (::Base.var"#wait_locked#680")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base ./stream.jl:947
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base ./stream.jl:953
  [4] unsafe_read
    @ ./io.jl:759 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base ./io.jl:758
  [6] read!
    @ ./io.jl:760 [inlined]
  [7] deserialize_hdr_raw
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed ./task.jl:484

ProcessExitedException(19)
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:871
  [2] wait()
    @ Base ./task.jl:931
  [3] wait(c::Base.GenericCondition{ReentrantLock})
    @ Base ./condition.jl:124
  [4] take_buffered(c::Channel{Any})
    @ Base ./channels.jl:416
  [5] take!(c::Channel{Any})
    @ Base ./channels.jl:410
  [6] take!(::Distributed.RemoteValue)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:726
  [7] remotecall_fetch(f::Function, w::Distributed.Worker, args::Vector{Float64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:461
  [8] remotecall_fetch(f::Function, w::Distributed.Worker, args::Vector{Float64})
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:454
  [9] remotecall_fetch(f::Function, id::Int64, args::Vector{Float64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:492
 [10] remotecall_fetch
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/remotecall.jl:492 [inlined]
 [11] remotecall_pool(rc_f::typeof(remotecall_fetch), f::Function, pool::WorkerPool, args::Vector{Float64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:123
 [12] remotecall_pool
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:120 [inlined]
 [13] #remotecall_fetch#200
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:229 [inlined]
 [14] remotecall_fetch
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:229 [inlined]
 [15] #208#209
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:274 [inlined]
 [16] #208
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/workerpool.jl:274 [inlined]
 [17] (::Base.var"#929#934"{Distributed.var"#208#210"{Distributed.var"#208#209#211"{WorkerPool, FUSE.var"#741#743"{FUSE.ParametersInits{Float64}, FUSE.ParametersActors{Float64}, typeof(my_workflow), Vector{FUSE.ObjectiveFunction}, Vector{FUSE.ConstraintFunction}, String, Bool, ProgressMeter.Progress}}}})(r::Base.RefValue{Any}, args::Tuple{Vector{Float64}})
    @ Base ./asyncmap.jl:100
 [18] macro expansion
    @ ./asyncmap.jl:234 [inlined]
 [19] (::Base.var"#945#946"{Base.var"#929#934"{Distributed.var"#208#210"{Distributed.var"#208#209#211"{WorkerPool, FUSE.var"#741#743"{FUSE.ParametersInits{Float64}, FUSE.ParametersActors{Float64}, typeof(my_workflow), Vector{FUSE.ObjectiveFunction}, Vector{FUSE.ConstraintFunction}, String, Bool, ProgressMeter.Progress}}}}, Channel{Any}, Nothing})()
    @ Base ./task.jl:484

Stacktrace:
  [1] (::Base.var"#939#941")(x::Task)
    @ Base ./asyncmap.jl:177
  [2] foreach(f::Base.var"#939#941", itr::Vector{Any})
    @ Base ./abstractarray.jl:2774
  [3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c::Vector{Vector{Float64}})
    @ Base ./asyncmap.jl:177
  [4] wrap_n_exec_twice
    @ ./asyncmap.jl:153 [inlined]
  [5] #async_usemap#924
    @ ./asyncmap.jl:103 [inlined]
  [6] #asyncmap#923
    @ ./asyncmap.jl:81 [inlined]
  [7] pmap(f::Function, p::WorkerPool, c::Vector{Vector{Float64}}; distributed::Bool, batch_size::Int64, on_error::Nothing, retry_delays::Vector{Any}, retry_check::Nothing)
    @ Distributed /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:126
  [8] pmap
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:99 [inlined]
  [9] #pmap#233
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:156 [inlined]
 [10] pmap
    @ /fusion/ga/projects/ird/ptp/weisbergd/julia/juliaup/julia-1.8.5+0.x64.linux.gnu/share/julia/stdlib/v1.8/Distributed/src/pmap.jl:156 [inlined]
 [11] optimization_engine(ini::FUSE.ParametersInits{Float64}, act::FUSE.ParametersActors{Float64}, actor_or_workflow::typeof(my_workflow), X::Matrix{Float64}, objectives_functions::Vector{FUSE.ObjectiveFunction}, constraints_functions::Vector{FUSE.ConstraintFunction}, save_folder::String, save_dd::Bool, p::ProgressMeter.Progress)
    @ FUSE /fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/src/optimization.jl:266
 [12] (::FUSE.var"#748#751"{String, Bool, FUSE.ParametersInits{Float64}, FUSE.ParametersActors{Float64}, typeof(my_workflow), Vector{FUSE.ObjectiveFunction}, Vector{FUSE.ConstraintFunction}, ProgressMeter.Progress})(X::Matrix{Float64})
    @ FUSE /fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/src/workflows/optimization_workflow.jl:101
 [13] create_solutions(X::Matrix{Float64}, problem::Metaheuristics.Problem{SearchSpaces.BoxConstrainedSpace{Float64}}; ε::Float64)
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/solutions/individual.jl:327
 [14] create_solutions
    @ ~/.julia/packages/Metaheuristics/oeGM5/src/solutions/individual.jl:323 [inlined]
 [15] update_state!(::Metaheuristics.State{Metaheuristics.xFgh_solution{Vector{Float64}}}, ::Metaheuristics.SPEA2, ::Metaheuristics.Problem{SearchSpaces.BoxConstrainedSpace{Float64}}, ::Metaheuristics.Information, ::Metaheuristics.Options; kargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/algorithms/SPEA2/SPEA2.jl:95
 [16] update_state!
    @ ~/.julia/packages/Metaheuristics/oeGM5/src/algorithms/SPEA2/SPEA2.jl:79 [inlined]
 [17] _during_optimization!(status::Metaheuristics.State{Metaheuristics.xFgh_solution{Vector{Float64}}}, parameters::Metaheuristics.SPEA2, problem::Metaheuristics.Problem{SearchSpaces.BoxConstrainedSpace{Float64}}, information::Metaheuristics.Information, options::Metaheuristics.Options, convergence::Vector{Metaheuristics.State{Metaheuristics.xFgh_solution{Vector{Float64}}}}, logger::Metaheuristics.var"#154#156", termination::Metaheuristics.Termination)
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/optimize/during.jl:5
 [18] optimize(f::Function, search_space::LinearAlgebra.Adjoint{Float64, Matrix{Float64}}, method::Metaheuristics.Algorithm{Metaheuristics.SPEA2}; logger::Function)
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/optimize/optimize.jl:52
 [19] optimize(f::Function, search_space::LinearAlgebra.Adjoint{Float64, Matrix{Float64}}, method::Metaheuristics.Algorithm{Metaheuristics.SPEA2})
    @ Metaheuristics ~/.julia/packages/Metaheuristics/oeGM5/src/optimize/optimize.jl:42
 [20] macro expansion
    @ ./timing.jl:262 [inlined]
 [21] workflow_multiobjective_optimization(ini::FUSE.ParametersInits{Float64}, act::FUSE.ParametersActors{Float64}, actor_or_workflow::typeof(my_workflow), objectives_functions::Vector{FUSE.ObjectiveFunction}, constraints_functions::Vector{FUSE.ConstraintFunction}; exploitation_vs_exploration::Float64, N::Int64, iterations::Int64, continue_state::Metaheuristics.State{Metaheuristics.xFgh_solution{Vector{Float64}}}, save_folder::String, save_dd::Bool)
    @ FUSE /fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/src/workflows/optimization_workflow.jl:101
 [22] top-level scope
    @ In[13]:22

mention #387

orso82 · 2023-09-20T07:34:09Z

@daveweisberg pointed me to this memtrace:
/fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/exploratory_new/opt_betaN_cost__Solovev_Kr_flattop48_HTS0.1_A3.5_B6.0/opt_runs2/99__2023-09-19T10:02:53.638__1294740/memtrace.txt

I did some analysis on it and I think the memory leak is occurring in Solovev

@lstagner could you please take a look and see there's anything obvious? I have been burned by Memoize myself in the past... perhaps this is the case again?

lstagner · 2023-09-20T17:07:59Z

I made sure all uses of Memoize limited the amount of instances it could save at once. If you think its caused by Memoize you can use MXHEquilibrium.set_cache_size!(0) to set the cache size to zero. The default cache size is five.

orso82 · 2023-09-20T17:20:02Z

OK, thank you for the good insight @lstagner !

@daveweisberg the next time you run, can you please try to set FUSE.MXHEquilibrium.set_cache_size!(0) and point me again to the folder where you did the run?

lstagner · 2023-09-20T17:45:33Z

another thing to try is to periodically call GC.gc() and ccall(:malloc_trim, Cvoid, (Cint,), 0) to run the garbage collector and to tell linux to free up some memory the GC misses.

daveweisberg · 2023-09-21T18:45:43Z

@orso82 I've finished a new optimization run with FUSE.MXHEquilibrium.set_cache_size!(0).

It can be found here: /fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/study_1/opt_betaN_cost__Solovev_flattop72/opt_runs1/.

orso82 · 2023-09-23T04:14:54Z

Similar growth, but with an offset, which makes sense, since before we had a finite cache size of 5.
So, while this reduces the memory usage, it does not solve the problem.

Next, as @lstagner suggested, could you please try to add

GC.gc()
ccall(:malloc_trim, Cvoid, (Cint,), 0)

at the bottom of your workflow function that you feed to the FUSE.workflow_multiobjective_optimization() function?

@daveweisberg

mention @daveweisberg mention #387

daveweisberg · 2023-09-26T18:11:44Z

@orso82 New run directories with both of Luke's suggestions implemented can be found here:

/fusion/ga/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/study_2/opt_betaN_cost__flattop72_A3.5_H1.0/opt_runs1/

orso82 · 2023-09-26T19:47:33Z

Better, but it's still increasing!

orso82 · 2023-09-26T19:52:34Z

@daveweisberg can you please try to setup a run where you do not execute the ActorWholeFacility, and just do a init(dd,ini,act). This will run ActorSolovev and ActorCXbuild (and few other trivial actors) but none of the other heavier actors will be run. This would be informative to rule out places where the issue might be.

TimSlendebroek · 2023-10-02T17:38:39Z

I have another memtrace of similar behavior of a TEQUIA & FluxMatcher run on SAGA.

Here is the memtrace of the last case that was run:

memtrace.txt

TimSlendebroek · 2023-10-02T17:40:53Z

plot of the memtrace

TimSlendebroek · 2023-10-02T17:43:36Z

@daveweisberg pointed me to this memtrace: /fusion/projects/ird/ptp/weisbergd/julia/dev/FUSE/playground/weisberg_FPP_opt/exploratory_new/opt_betaN_cost__Solovev_Kr_flattop48_HTS0.1_A3.5_B6.0/opt_runs2/99__2023-09-19T10:02:53.638__1294740/memtrace.txt

I did some analysis on it and I think the memory leak is occurring in Solovev
@lstagner could you please take a look and see there's anything obvious? I have been burned by Memoize myself in the past... perhaps this is the case again?

I am not sure if it only Solovev as I see the same problem for running with TEQULIA (although of course it could be both)

TimSlendebroek · 2023-10-11T22:00:24Z

My conclusion on how the memory leakage is happening relates to the garbage collector being unable to do it's job in a pmap while also being in a try-catch structure... see 5d8df73 for the fix

orso82 · 2023-10-11T22:02:27Z

help?> GC.gc
  GC.gc([full=true])

  Perform garbage collection. The argument full determines the kind of collection: A full collection (default) sweeps all objects, which
  makes the next GC scan much slower, while an incremental collection may only sweep so-called young objects.

  │ Warning
  │
  │  Excessive use will likely lead to poor performance.

@TimSlendebroek can you please try calling GC.gc(false) and see if this still works?

orso82 · 2023-10-24T16:52:05Z

The issue has gotten better, but is not fully resolved. It would be great to have @mbolognaJH and @sjkelly take a look at the issue.

sjkelly · 2023-10-24T17:10:35Z

This flag can be used to force Julia to collect at a greater interval:

 --heap-size-hint=<size>    Forces garbage collection if memory usage is higher than that value.

TimSlendebroek · 2023-10-24T17:14:13Z

I'll add the minimal working example

local
saga (cluster)

But also add the optimization case for

local
saga (cluster)

Memory goes up directly with the number of dd's processed

orso82 · 2023-10-24T17:21:33Z

heap-size-hint can be very handy. Set it to 1/2 or 1/4 of the system's memory. NOTE: since we have many julia processes on each node, perhaps we need to get 1/100th of the memory.

How do we pass heap-size-hint to the worker nodes? @mbolognaJH help! Something like:

JuliaArgs for Distributed.jl
heap-size-hint = sys.total_physical_memory() / sys.CPU_THREADS

v1.10.0 was supposed to improve GC, but did not make the cut. v.1.11.0 might address this.

May be able to find workarounds for v1.9.0 and v1.10.0

TimSlendebroek · 2023-10-24T21:59:06Z

I made an example notebook that runs locally and shows the same problems (without wrapping and gc) @mbolognaJH
see 76cfed0
examples/debugging/memory.ipynb

(don't run the last two cells as this is the optimization it's already showing memory issues with simple load and save):

mbolognaJH · 2023-10-25T16:45:47Z

@TimSlendebroek is there a specific dd.json file I should be using? Maybe I'm missing part of my setup for this example.

TimSlendebroek · 2023-10-25T17:46:52Z

Aah sorry @mbolognaJH

I can send you one or you can do the following to create your own json:

using FUSE
dd, ini,act = FUSE.init(:KDEMO)


IMAS.json2imas(dd,<your path>)

TimSlendebroek · 2023-10-25T18:12:11Z

@mbolognaJH can we add each other on zoom/discord so I can respond a bit quicker if you have questions / explain / send files?

mbolognaJH · 2023-10-25T18:32:05Z

Yes sure thing, we can continue the conversation on Zoom (I haven't tried it before, but we'll see how it works): my email is michael.bologna@juliahub.com

mbolognaJH · 2023-11-01T16:28:20Z

I have been able to reproduce the MWE. By limiting the heap size available to the Julia GC with --heap-size-hint we can avoid the memory growth behavior as shown below:

The root cause of this is the multi-threading and multi-processing design patterns used by FUSE. Currently, FUSE uses multi-processing exclusively via Distributed.jl, including within each node. This uses multiple julia process within each node. This requires substantial memory overhead since a copy of the base system image and user space code is loaded for each process. This also introduces the observed memory growth behavior, since each julia process assumes that it is the only one on the node.

By adding --heap-size-hint such that Sys.total_physical_memory() / sys.CPU_THREADS we get the expected behavior and the GC is able to collect at a regular interval. This also obviates the need for manual calls to GC.gc(). Within the addprocs command the exeflags keyword argument can propagate a heap size hint to each child julia process, e.g.: exeflags="--heap-size-hint=2G"

The recommend long-term solution, that will also free up memory overhead and improve performance, would be to use multi-threading within each node as documented here. https://docs.julialang.org/en/v1.9/manual/multi-threading/
This requires the work to be partitioned in a way that the master node is aware of the thread counts on each child node as well as the total memory on each child node.

bclyons12 · 2023-11-01T18:11:51Z

Thanks @mbolognaJH , this makes sense to me. I've been looking for how to implement your long-term solution, but I can't find a good example. How can we launch one process per node that has access to all the cores/threads on that node?

mbolognaJH · 2023-11-02T17:10:23Z

You can use the exeflags again to specify the number of threads you want to spawn:

For example:

# Although you won't need the heap-size-hint if you're doing the long-term solution,
# but here's how to do multiple exeflags in one `addprocs()` command.
num_procs = 1
num_threads = 8 # You can use "auto" here as well
addprocs(num_procs, exeflags=["--threads=$num_threads", "--heap-size-hint=2G"])

orso82 · 2023-11-02T17:29:53Z

Thank you @mbolognaJH ! This solves a major headache!!!

@bclyons12 @mbolognaJH I consider this solved now. I'll add the heap-size-hint as you suggest, and have @TimSlendebroek test it.

Please don't spend further time on this now. I have more pressing questions for both of you @bclyons12 and @mbolognaJH
I'll open up some issue and tag you in those.

Thanks again!!!!

bclyons12 · 2023-11-02T18:03:34Z

I still think there are very good reasons to figure out how to do the hybrid multiprocessing-multithreading implementation. I understand that it's not a priority for now, but we should open a separate issue for a future upgrade if we're going to close this one.

#387 (comment)

orso82 · 2023-11-06T21:14:44Z

@TimSlendebroek reported that the --heap-size-hint fix seems to have fixed the issue!

@daveweisberg please note that as part of d7c57a5 now by default cpus_per_task=1 which means that on OMEGA you'll be able to push through double the number of workers for the same number of nodes. This should help significantly!

Thanks again @mbolognaJH !

TimSlendebroek · 2023-11-07T01:18:25Z

on local with 2G hintsize

orso82 · 2023-11-07T03:25:01Z

Thank you @TimSlendebroek for the thorough testing!!

daveweisberg · 2023-11-18T00:40:04Z

Thanks everyone for the hard work in solving this issue! I've started testing this with Julia 1.10 on the Omega cluster, and so far it looks good!

A quick question for @TimSlendebroek or @orso82: Should I still include the following in the function I'm running on the distributed workers?

GC.gc()
ccall(:malloc_trim, Cvoid, (Cint,), 0)

Or is this no longer necessary?

daveweisberg added bug Something isn't working high priority labels Aug 9, 2023

daveweisberg assigned orso82 Aug 9, 2023

orso82 added a commit that referenced this issue Aug 10, 2023

memory diagnostics tools

75089bc

mention #387

orso82 added a commit that referenced this issue Sep 6, 2023

write memtrace to file

c0caac2

mention #387

orso82 added a commit that referenced this issue Sep 25, 2023

call malloc_trim_if_glibc() in optimization_engine

aa51414

mention @daveweisberg mention #387

TimSlendebroek closed this as completed Oct 11, 2023

orso82 reopened this Oct 24, 2023

orso82 added the JuliaHub label Oct 24, 2023

orso82 assigned mbolognaJH Oct 24, 2023

orso82 added a commit that referenced this issue Nov 2, 2023

parallel_environment sets --heap-size-hint

d7c57a5

#387 (comment)

orso82 closed this as completed Nov 6, 2023

orso82 unassigned mbolognaJH Feb 12, 2024

Workers out of memory during optimization runs #387

Workers out of memory during optimization runs #387

Comments

daveweisberg commented Aug 9, 2023 • edited Loading

orso82 commented Aug 9, 2023

daveweisberg commented Aug 9, 2023

smithsp commented Aug 9, 2023

daveweisberg commented Aug 9, 2023

bclyons12 commented Aug 9, 2023

daveweisberg commented Aug 9, 2023

orso82 commented Aug 9, 2023

daveweisberg commented Aug 9, 2023

bclyons12 commented Aug 9, 2023

orso82 commented Aug 10, 2023

daveweisberg commented Aug 10, 2023

orso82 commented Aug 10, 2023

daveweisberg commented Aug 29, 2023 • edited Loading

daveweisberg commented Aug 30, 2023 • edited Loading

orso82 commented Sep 20, 2023

lstagner commented Sep 20, 2023 • edited Loading

orso82 commented Sep 20, 2023

lstagner commented Sep 20, 2023 • edited Loading

daveweisberg commented Sep 21, 2023

orso82 commented Sep 23, 2023

daveweisberg commented Sep 26, 2023 • edited Loading

orso82 commented Sep 26, 2023

orso82 commented Sep 26, 2023

TimSlendebroek commented Oct 2, 2023

TimSlendebroek commented Oct 2, 2023 • edited Loading

TimSlendebroek commented Oct 2, 2023

TimSlendebroek commented Oct 11, 2023

orso82 commented Oct 11, 2023

orso82 commented Oct 24, 2023 • edited Loading

sjkelly commented Oct 24, 2023 • edited Loading

TimSlendebroek commented Oct 24, 2023

orso82 commented Oct 24, 2023 • edited Loading

TimSlendebroek commented Oct 24, 2023 • edited Loading

mbolognaJH commented Oct 25, 2023

TimSlendebroek commented Oct 25, 2023 • edited Loading

TimSlendebroek commented Oct 25, 2023

mbolognaJH commented Oct 25, 2023

mbolognaJH commented Nov 1, 2023

bclyons12 commented Nov 1, 2023

mbolognaJH commented Nov 2, 2023

orso82 commented Nov 2, 2023

bclyons12 commented Nov 2, 2023

orso82 commented Nov 6, 2023

TimSlendebroek commented Nov 7, 2023

orso82 commented Nov 7, 2023

daveweisberg commented Nov 18, 2023 • edited Loading

daveweisberg commented Aug 9, 2023 •

edited

Loading

daveweisberg commented Aug 29, 2023 •

edited

Loading

daveweisberg commented Aug 30, 2023 •

edited

Loading

lstagner commented Sep 20, 2023 •

edited

Loading

lstagner commented Sep 20, 2023 •

edited

Loading

daveweisberg commented Sep 26, 2023 •

edited

Loading

TimSlendebroek commented Oct 2, 2023 •

edited

Loading

orso82 commented Oct 24, 2023 •

edited

Loading

sjkelly commented Oct 24, 2023 •

edited

Loading

orso82 commented Oct 24, 2023 •

edited

Loading

TimSlendebroek commented Oct 24, 2023 •

edited

Loading

TimSlendebroek commented Oct 25, 2023 •

edited

Loading

daveweisberg commented Nov 18, 2023 •

edited

Loading