Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Possible memory leakage & best practices for memory scaling? #490

Closed
paulomontero opened this issue Dec 20, 2023 · 5 comments · Fixed by #493
Closed

[BUG]: Possible memory leakage & best practices for memory scaling? #490

paulomontero opened this issue Dec 20, 2023 · 5 comments · Fixed by #493
Assignees
Labels
bug Something isn't working

Comments

@paulomontero
Copy link

What happened?

Hi and thank you for this great tool! I have been using it enthusiastically for a few months.

Recently, we began running PySR on the Rusty cluster for large data regressions. However, we have encountered an issue where jobs finish early before reaching the specified wall time or the maximum number of operations, even when the stop-early clause is not triggered.

Additionally, I would like to take the opportunity to ask about best practices for memory scaling in PySR.

Version

0.16.3

Operating System

Linux

Package Manager

Other (specify below)

Interface

Script (i.e., python my_script.py)

Relevant log output

We sometimes get an out-of-memory error:

Progress: 2291006 / 6464000000 total iterations (0.035%)
====================================================================================================
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3054365.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

Extra Info

We are using Mamba.

Data has ~ 10000 points with 8 features.

Our jobs are submitted as Python scripts, so it shouldn't involve any Jupyter-related issues #460.

We use 1 node with 64 cores and set procs = num_cores

We have experimented with various PySRRegressor configurations, including toggling Turbo, batching, and multithreading. However, we observed relatively minimal improvements in terms of ending at wall time or reaching a significant number of desired iterations.

For reference, the specific out-of-memory error shown above occurred with the following configuration:

procs=num_cores,
multithreading=False,
cluster_manager='slurm',
batching=True,
turbo=True,
@paulomontero paulomontero added the bug Something isn't working label Dec 20, 2023
@eelregit
Copy link

Hi Miles! Hope you have been doing well :)

To add more details, all distributed jobs on one icelake node ran OOM, e.g.,

model = PySRRegressor(
    ...
    maxsize=35,
    niterations=1000000,
    populations=num_cores,  # 64
    ncyclesperiteration=10000,
    procs=num_cores,
    #multithreading=False,
    cluster_manager='slurm',
    #batching=True,
    #turbo=True,  # seems like OOM happened earlier with turbo on but I'm not 100% sure
    ...
)

However, the multithread job ran until the end of the allocation

model = PySRRegressor(
    ...
    maxsize=45,
    niterations=1000000,
    populations=num_cores,  # 64
    ncyclesperiteration=10000,
    #procs=num_cores,
    #multithreading=False,
    #cluster_manager='slurm',
    batching=True,
    turbo=True,
    ...
)

When I ssh'ed into the distributed job work node before it crashed, htop showed heavy cache (yellow fraction in htop) usage. The 1TB mem is filled with cache except the 100~200GB actual memory usage.

@MilesCranmer
Copy link
Owner

Hey @eelregit and @paulomontero,

Thanks for reaching out about this. Actually I have seen this too on rusty, especially for long-running jobs. I think it is actually a Julia bug in their distributed interface which has some issues with the garbage collection. I have opened a bug report here: JuliaLang/julia#50673. Basically two julia processes do not communicate how much memory is used between them, and can sometimes go over the total system memory limit if garbage collection is not triggered soon enough.

The workaround which I have used, and probably need to implement directly into PySR SymbolicRegression.jl for users to use (pull requests very much appreciated!!) is as follows. Take the following lines:

PySR/pysr/julia_helpers.py

Lines 321 to 322 in c9cc6d7

Main.eval(f"import ClusterManagers: addprocs_{cluster_manager}")
return Main.eval(f"addprocs_{cluster_manager}")

And apply the git diff:

     Main.eval(f"import ClusterManagers: addprocs_{cluster_manager}")
-    return Main.eval(f"addprocs_{cluster_manager}")
+    return Main.eval(f"(args...; kws...) -> addprocs_{cluster_manager}(args...; exeflags=`--heap-size-hint=1G`, kws...)")

What this is doing

It's modifying the way processes are created in ClusterManagers.jl and Distributed.jl to automatically pass --heap-size-hint=1G to every new Julia process. This is giving Julia a hint that each process should only use 1GB in memory max, before doing aggressive garbage collection. This should be well below the memory constraints of a rusty node (assuming one process per core) so I have used it as a default with great success.

With this change I have never gotten an OOM error again for long running jobs.

Let me know if this works for you!
Cheers,
Miles

@MilesCranmer
Copy link
Owner

Pushed some code to automatically fix issues like this: MilesCranmer/SymbolicRegression.jl#270. Give it a week or so to work its way into PySR (needs to pass various tests).

@eelregit
Copy link

Thanks Miles! This is super helpful!!

@paulomontero
Copy link
Author

Nice!!! Thank you for the swift response/fix, looking forward to seeing it in action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants