A memory leak in `pl.concat()` #18052

Q-c7 · 2024-08-05T16:44:49Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import psutil
import os
import time
import polars as pl
import numpy as np
import gc

from tqdm import tqdm


def test_leak():
    assert pl.__version__ == "1.4.1"

    process = psutil.Process(os.getpid())
    n_cols = 100
    columns = [f"col_{i}" for i in range(n_cols)]
    df = pl.DataFrame(np.random.randn(200_000, n_cols), schema=columns)
    ram_usage = []
    time_usage = []
    initial_ram_usage = process.memory_info().rss / (1024 * 1024)
    try:
        with tqdm() as pbar:
            while True:
                t0 = time.perf_counter()
                row = pl.DataFrame(np.random.randn(1, n_cols), schema=columns)
                df = pl.concat((df[1:], row))
                time_usage.append(time.perf_counter() - t0)
                assert df.shape[0] == 200_000
                assert df.shape[1] == n_cols
                df.shrink_to_fit()  # Comment this for memory leak to proceed faster
                gc.collect()  # Comment this for memory leak to proceed faster

                ram_mb = process.memory_info().rss / (1024 * 1024)
                ram_usage.append(ram_mb)
                pbar.set_description(f"RSS = {ram_mb:.2f} MB")
                pbar.update()

                assert ram_mb < 2 * initial_ram_usage, "The used memory has doubled"
    except (KeyboardInterrupt, AssertionError):
        np.save("ram_log", ram_usage)
        np.save("time_log", time_usage)
        raise


if __name__ == "__main__":
    pl.show_versions()
    test_leak()

Log output

No response

Issue description

Let's consider the scenario where we have a simple concatenation as described above: df = pl.concat((df[1:], row)). The DataFrame shape is constant and thus the memory consumed by this process should be constant. However, in reality it grows indefinitely.

The consumed RAM for the process always doubles after a certain period of time, especially if no measures are taken. By measures I mean methods like df.shrink_to_fit() and/or gc.collect() but they just delay the inevitable.

P. S. Doubling the RAM might take some time so some patience is advised. If in rush, please comment the memory-saving methods and/or tune the memory multiplier threshold to 1.5x or less.

Expected behavior

The expected behavior for this code:

RAM usage should plateau at some point before the 2x multiplier threshold
Time consumption of pl.concat should plateau at some point

Installed versions

--------Version info---------
Polars:               1.4.1
Index type:           UInt32
Platform:             Linux-5.15.0-105-generic-x86_64-with-glibc2.31
Python:               3.10.8 (main, Oct 20 2022, 02:23:58) [GCC 9.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           0.3.1
deltalake:            0.17.4
fastexcel:            <not installed>
fsspec:               2023.12.1
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.6.2
nest_asyncio:         1.5.6
numpy:                1.23.5
openpyxl:             <not installed>
pandas:               1.5.2
pyarrow:              9.0.0
pydantic:             1.10.4
pyiceberg:            <not installed>
sqlalchemy:           1.4.46
torch:                2.1.1+cu118
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

deanm0000 · 2024-08-06T01:17:27Z

I'm on mobile so I haven't tried this but I think you need a rechunk=True in those concats otherwise it's expected that it grows because you're not reconstructing the underlying series.

Q-c7 · 2024-08-06T07:35:47Z

Yeah, you're right, rechunk=True solves the problem, my bad.

However, could you please update the user guide? It contains wrong information about default rechunking behavior, that's quite misleading.

mcrumiller · 2024-08-06T16:15:39Z

The default to rechunk=False was changed in #16128 in May.

Q-c7 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Aug 5, 2024

deanm0000 added documentation Improvements or additions to documentation and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Aug 6, 2024

deanm0000 mentioned this issue Aug 7, 2024

docs: Correct concat rechunk in user guide #18080

Merged

deanm0000 self-assigned this Aug 7, 2024

ritchie46 closed this as completed Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A memory leak in `pl.concat()` #18052

A memory leak in `pl.concat()` #18052

Q-c7 commented Aug 5, 2024

deanm0000 commented Aug 6, 2024

Q-c7 commented Aug 6, 2024

mcrumiller commented Aug 6, 2024

A memory leak in pl.concat() #18052

A memory leak in pl.concat() #18052

Comments

Q-c7 commented Aug 5, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

deanm0000 commented Aug 6, 2024

Q-c7 commented Aug 6, 2024

mcrumiller commented Aug 6, 2024

A memory leak in `pl.concat()` #18052

A memory leak in `pl.concat()` #18052