-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Possible memory leakage & best practices for memory scaling? #490
Comments
Hi Miles! Hope you have been doing well :) To add more details, all distributed jobs on one icelake node ran OOM, e.g., model = PySRRegressor(
...
maxsize=35,
niterations=1000000,
populations=num_cores, # 64
ncyclesperiteration=10000,
procs=num_cores,
#multithreading=False,
cluster_manager='slurm',
#batching=True,
#turbo=True, # seems like OOM happened earlier with turbo on but I'm not 100% sure
...
) However, the multithread job ran until the end of the allocation model = PySRRegressor(
...
maxsize=45,
niterations=1000000,
populations=num_cores, # 64
ncyclesperiteration=10000,
#procs=num_cores,
#multithreading=False,
#cluster_manager='slurm',
batching=True,
turbo=True,
...
) When I ssh'ed into the distributed job work node before it crashed, htop showed heavy cache (yellow fraction in htop) usage. The 1TB mem is filled with cache except the 100~200GB actual memory usage. |
Hey @eelregit and @paulomontero, Thanks for reaching out about this. Actually I have seen this too on rusty, especially for long-running jobs. I think it is actually a Julia bug in their distributed interface which has some issues with the garbage collection. I have opened a bug report here: JuliaLang/julia#50673. Basically two julia processes do not communicate how much memory is used between them, and can sometimes go over the total system memory limit if garbage collection is not triggered soon enough. The workaround which I have used, and probably need to implement directly into PySR SymbolicRegression.jl for users to use (pull requests very much appreciated!!) is as follows. Take the following lines: Lines 321 to 322 in c9cc6d7
And apply the git diff: Main.eval(f"import ClusterManagers: addprocs_{cluster_manager}")
- return Main.eval(f"addprocs_{cluster_manager}")
+ return Main.eval(f"(args...; kws...) -> addprocs_{cluster_manager}(args...; exeflags=`--heap-size-hint=1G`, kws...)")
It's modifying the way processes are created in ClusterManagers.jl and Distributed.jl to automatically pass With this change I have never gotten an OOM error again for long running jobs. Let me know if this works for you! |
Pushed some code to automatically fix issues like this: MilesCranmer/SymbolicRegression.jl#270. Give it a week or so to work its way into PySR (needs to pass various tests). |
Thanks Miles! This is super helpful!! |
Nice!!! Thank you for the swift response/fix, looking forward to seeing it in action. |
What happened?
Hi and thank you for this great tool! I have been using it enthusiastically for a few months.
Recently, we began running PySR on the Rusty cluster for large data regressions. However, we have encountered an issue where jobs finish early before reaching the specified wall time or the maximum number of operations, even when the stop-early clause is not triggered.
Additionally, I would like to take the opportunity to ask about best practices for memory scaling in PySR.
Version
0.16.3
Operating System
Linux
Package Manager
Other (specify below)
Interface
Script (i.e.,
python my_script.py
)Relevant log output
We sometimes get an out-of-memory error: Progress: 2291006 / 6464000000 total iterations (0.035%) ==================================================================================================== slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3054365.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Extra Info
We are using Mamba.
Data has ~ 10000 points with 8 features.
Our jobs are submitted as Python scripts, so it shouldn't involve any Jupyter-related issues #460.
We use 1 node with 64 cores and set
procs = num_cores
We have experimented with various PySRRegressor configurations, including toggling Turbo, batching, and multithreading. However, we observed relatively minimal improvements in terms of ending at wall time or reaching a significant number of desired iterations.
For reference, the specific out-of-memory error shown above occurred with the following configuration:
The text was updated successfully, but these errors were encountered: