Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A memory leak in pl.concat() #18052

Closed
2 tasks done
Q-c7 opened this issue Aug 5, 2024 · 3 comments
Closed
2 tasks done

A memory leak in pl.concat() #18052

Q-c7 opened this issue Aug 5, 2024 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation python Related to Python Polars

Comments

@Q-c7
Copy link

Q-c7 commented Aug 5, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import psutil
import os
import time
import polars as pl
import numpy as np
import gc

from tqdm import tqdm


def test_leak():
    assert pl.__version__ == "1.4.1"

    process = psutil.Process(os.getpid())
    n_cols = 100
    columns = [f"col_{i}" for i in range(n_cols)]
    df = pl.DataFrame(np.random.randn(200_000, n_cols), schema=columns)
    ram_usage = []
    time_usage = []
    initial_ram_usage = process.memory_info().rss / (1024 * 1024)
    try:
        with tqdm() as pbar:
            while True:
                t0 = time.perf_counter()
                row = pl.DataFrame(np.random.randn(1, n_cols), schema=columns)
                df = pl.concat((df[1:], row))
                time_usage.append(time.perf_counter() - t0)
                assert df.shape[0] == 200_000
                assert df.shape[1] == n_cols
                df.shrink_to_fit()  # Comment this for memory leak to proceed faster
                gc.collect()  # Comment this for memory leak to proceed faster

                ram_mb = process.memory_info().rss / (1024 * 1024)
                ram_usage.append(ram_mb)
                pbar.set_description(f"RSS = {ram_mb:.2f} MB")
                pbar.update()

                assert ram_mb < 2 * initial_ram_usage, "The used memory has doubled"
    except (KeyboardInterrupt, AssertionError):
        np.save("ram_log", ram_usage)
        np.save("time_log", time_usage)
        raise


if __name__ == "__main__":
    pl.show_versions()
    test_leak()

Log output

No response

Issue description

Let's consider the scenario where we have a simple concatenation as described above: df = pl.concat((df[1:], row)). The DataFrame shape is constant and thus the memory consumed by this process should be constant. However, in reality it grows indefinitely.

The consumed RAM for the process always doubles after a certain period of time, especially if no measures are taken. By measures I mean methods like df.shrink_to_fit() and/or gc.collect() but they just delay the inevitable.

P. S. Doubling the RAM might take some time so some patience is advised. If in rush, please comment the memory-saving methods and/or tune the memory multiplier threshold to 1.5x or less.

Expected behavior

The expected behavior for this code:

  • RAM usage should plateau at some point before the 2x multiplier threshold
  • Time consumption of pl.concat should plateau at some point

Installed versions

--------Version info---------
Polars:               1.4.1
Index type:           UInt32
Platform:             Linux-5.15.0-105-generic-x86_64-with-glibc2.31
Python:               3.10.8 (main, Oct 20 2022, 02:23:58) [GCC 9.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           0.3.1
deltalake:            0.17.4
fastexcel:            <not installed>
fsspec:               2023.12.1
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.6.2
nest_asyncio:         1.5.6
numpy:                1.23.5
openpyxl:             <not installed>
pandas:               1.5.2
pyarrow:              9.0.0
pydantic:             1.10.4
pyiceberg:            <not installed>
sqlalchemy:           1.4.46
torch:                2.1.1+cu118
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@Q-c7 Q-c7 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Aug 5, 2024
@deanm0000
Copy link
Collaborator

I'm on mobile so I haven't tried this but I think you need a rechunk=True in those concats otherwise it's expected that it grows because you're not reconstructing the underlying series.

@Q-c7
Copy link
Author

Q-c7 commented Aug 6, 2024

Yeah, you're right, rechunk=True solves the problem, my bad.

However, could you please update the user guide? It contains wrong information about default rechunking behavior, that's quite misleading.

@mcrumiller
Copy link
Contributor

The default to rechunk=False was changed in #16128 in May.

@deanm0000 deanm0000 added documentation Improvements or additions to documentation and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Aug 6, 2024
@deanm0000 deanm0000 self-assigned this Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants