You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
importpsutilimportosimporttimeimportpolarsasplimportnumpyasnpimportgcfromtqdmimporttqdmdeftest_leak():
assertpl.__version__=="1.4.1"process=psutil.Process(os.getpid())
n_cols=100columns= [f"col_{i}"foriinrange(n_cols)]
df=pl.DataFrame(np.random.randn(200_000, n_cols), schema=columns)
ram_usage= []
time_usage= []
initial_ram_usage=process.memory_info().rss/ (1024*1024)
try:
withtqdm() aspbar:
whileTrue:
t0=time.perf_counter()
row=pl.DataFrame(np.random.randn(1, n_cols), schema=columns)
df=pl.concat((df[1:], row))
time_usage.append(time.perf_counter() -t0)
assertdf.shape[0] ==200_000assertdf.shape[1] ==n_colsdf.shrink_to_fit() # Comment this for memory leak to proceed fastergc.collect() # Comment this for memory leak to proceed fasterram_mb=process.memory_info().rss/ (1024*1024)
ram_usage.append(ram_mb)
pbar.set_description(f"RSS = {ram_mb:.2f} MB")
pbar.update()
assertram_mb<2*initial_ram_usage, "The used memory has doubled"except (KeyboardInterrupt, AssertionError):
np.save("ram_log", ram_usage)
np.save("time_log", time_usage)
raiseif__name__=="__main__":
pl.show_versions()
test_leak()
Log output
No response
Issue description
Let's consider the scenario where we have a simple concatenation as described above: df = pl.concat((df[1:], row)). The DataFrame shape is constant and thus the memory consumed by this process should be constant. However, in reality it grows indefinitely.
The consumed RAM for the process always doubles after a certain period of time, especially if no measures are taken. By measures I mean methods like df.shrink_to_fit() and/or gc.collect() but they just delay the inevitable.
P. S. Doubling the RAM might take some time so some patience is advised. If in rush, please comment the memory-saving methods and/or tune the memory multiplier threshold to 1.5x or less.
Expected behavior
The expected behavior for this code:
RAM usage should plateau at some point before the 2x multiplier threshold
Time consumption of pl.concat should plateau at some point
I'm on mobile so I haven't tried this but I think you need a rechunk=True in those concats otherwise it's expected that it grows because you're not reconstructing the underlying series.
Checks
Reproducible example
Log output
No response
Issue description
Let's consider the scenario where we have a simple concatenation as described above:
df = pl.concat((df[1:], row))
. The DataFrame shape is constant and thus the memory consumed by this process should be constant. However, in reality it grows indefinitely.The consumed RAM for the process always doubles after a certain period of time, especially if no measures are taken. By measures I mean methods like
df.shrink_to_fit()
and/orgc.collect()
but they just delay the inevitable.P. S. Doubling the RAM might take some time so some patience is advised. If in rush, please comment the memory-saving methods and/or tune the memory multiplier threshold to 1.5x or less.
Expected behavior
The expected behavior for this code:
pl.concat
should plateau at some pointInstalled versions
The text was updated successfully, but these errors were encountered: