[BUG]: Possible memory leakage & best practices for memory scaling? #490

paulomontero · 2023-12-20T02:32:27Z

What happened?

Hi and thank you for this great tool! I have been using it enthusiastically for a few months.

Recently, we began running PySR on the Rusty cluster for large data regressions. However, we have encountered an issue where jobs finish early before reaching the specified wall time or the maximum number of operations, even when the stop-early clause is not triggered.

Additionally, I would like to take the opportunity to ask about best practices for memory scaling in PySR.

Version

0.16.3

Operating System

Linux

Package Manager

Other (specify below)

Interface

Script (i.e., python my_script.py)

Relevant log output

We sometimes get an out-of-memory error:

Progress: 2291006 / 6464000000 total iterations (0.035%)
====================================================================================================
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3054365.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Extra Info

We are using Mamba.

Data has ~ 10000 points with 8 features.

Our jobs are submitted as Python scripts, so it shouldn't involve any Jupyter-related issues #460.

We use 1 node with 64 cores and set procs = num_cores

We have experimented with various PySRRegressor configurations, including toggling Turbo, batching, and multithreading. However, we observed relatively minimal improvements in terms of ending at wall time or reaching a significant number of desired iterations.

For reference, the specific out-of-memory error shown above occurred with the following configuration:

procs=num_cores,
multithreading=False,
cluster_manager='slurm',
batching=True,
turbo=True,

The text was updated successfully, but these errors were encountered:

eelregit · 2023-12-20T03:04:24Z

Hi Miles! Hope you have been doing well :)

To add more details, all distributed jobs on one icelake node ran OOM, e.g.,

model = PySRRegressor(
    ...
    maxsize=35,
    niterations=1000000,
    populations=num_cores,  # 64
    ncyclesperiteration=10000,
    procs=num_cores,
    #multithreading=False,
    cluster_manager='slurm',
    #batching=True,
    #turbo=True,  # seems like OOM happened earlier with turbo on but I'm not 100% sure
    ...
)

However, the multithread job ran until the end of the allocation

model = PySRRegressor(
    ...
    maxsize=45,
    niterations=1000000,
    populations=num_cores,  # 64
    ncyclesperiteration=10000,
    #procs=num_cores,
    #multithreading=False,
    #cluster_manager='slurm',
    batching=True,
    turbo=True,
    ...
)

When I ssh'ed into the distributed job work node before it crashed, htop showed heavy cache (yellow fraction in htop) usage. The 1TB mem is filled with cache except the 100~200GB actual memory usage.

MilesCranmer · 2023-12-20T05:39:52Z

Hey @eelregit and @paulomontero,

Thanks for reaching out about this. Actually I have seen this too on rusty, especially for long-running jobs. I think it is actually a Julia bug in their distributed interface which has some issues with the garbage collection. I have opened a bug report here: JuliaLang/julia#50673. Basically two julia processes do not communicate how much memory is used between them, and can sometimes go over the total system memory limit if garbage collection is not triggered soon enough.

The workaround which I have used, and probably need to implement directly into PySR SymbolicRegression.jl for users to use (pull requests very much appreciated!!) is as follows. Take the following lines:

PySR/pysr/julia_helpers.py

Lines 321 to 322 in c9cc6d7

    
           Main.eval(f"import ClusterManagers: addprocs_{cluster_manager}") 
        
           return Main.eval(f"addprocs_{cluster_manager}")

And apply the git diff:

     Main.eval(f"import ClusterManagers: addprocs_{cluster_manager}")
-    return Main.eval(f"addprocs_{cluster_manager}")
+    return Main.eval(f"(args...; kws...) -> addprocs_{cluster_manager}(args...; exeflags=`--heap-size-hint=1G`, kws...)")

What this is doing

It's modifying the way processes are created in ClusterManagers.jl and Distributed.jl to automatically pass --heap-size-hint=1G to every new Julia process. This is giving Julia a hint that each process should only use 1GB in memory max, before doing aggressive garbage collection. This should be well below the memory constraints of a rusty node (assuming one process per core) so I have used it as a default with great success.

With this change I have never gotten an OOM error again for long running jobs.

Let me know if this works for you!
Cheers,
Miles

MilesCranmer · 2023-12-22T02:41:36Z

Pushed some code to automatically fix issues like this: MilesCranmer/SymbolicRegression.jl#270. Give it a week or so to work its way into PySR (needs to pass various tests).

eelregit · 2023-12-22T04:51:48Z

Thanks Miles! This is super helpful!!

paulomontero · 2023-12-22T05:36:33Z

Nice!!! Thank you for the swift response/fix, looking forward to seeing it in action.

paulomontero added the bug Something isn't working label Dec 20, 2023

paulomontero assigned MilesCranmer Dec 20, 2023

MilesCranmer mentioned this issue Dec 22, 2023

Automatically set heap size hint on workers MilesCranmer/SymbolicRegression.jl#270

Merged

3 tasks

MilesCranmer mentioned this issue Dec 24, 2023

Add parameter for specifying --heap-size-hint on spawned Julia processes #493

Merged

MilesCranmer closed this as completed in #493 Dec 24, 2023

GoldenGoldy mentioned this issue Dec 4, 2024

[BUG]: Memory issue in version 1.0.0? #764

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Possible memory leakage & best practices for memory scaling? #490

[BUG]: Possible memory leakage & best practices for memory scaling? #490

paulomontero commented Dec 20, 2023

eelregit commented Dec 20, 2023

MilesCranmer commented Dec 20, 2023

MilesCranmer commented Dec 22, 2023

eelregit commented Dec 22, 2023

paulomontero commented Dec 22, 2023

[BUG]: Possible memory leakage & best practices for memory scaling? #490

[BUG]: Possible memory leakage & best practices for memory scaling? #490

Comments

paulomontero commented Dec 20, 2023

What happened?

Version

Operating System

Package Manager

Interface

Relevant log output

Extra Info

eelregit commented Dec 20, 2023

MilesCranmer commented Dec 20, 2023

MilesCranmer commented Dec 22, 2023

eelregit commented Dec 22, 2023

paulomontero commented Dec 22, 2023