Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage reading single column with read_parquet #15098

Closed
2 tasks done
khwilson opened this issue Mar 16, 2024 · 4 comments · Fixed by #15229
Closed
2 tasks done

High memory usage reading single column with read_parquet #15098

khwilson opened this issue Mar 16, 2024 · 4 comments · Fixed by #15229
Labels
bug Something isn't working python Related to Python Polars

Comments

@khwilson
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Copy the following into example.py

import sys

import polars as pl

def make_dataframe():
    df = pl.DataFrame(
        {
            "a": list(range(int(5e8))),
            "b": list(range(int(5e8), 2 * int(5e8))),
        }
    )
    df.write_csv("test.csv")
    df.write_parquet("test.parquet")


if __name__ == "__main__":
    if sys.argv[1] == "make":
        make_dataframe()
    if sys.argv[1] == "scan":
        pl.scan_parquet("test.csv").select("a").collect().sum()
    if sys.argv[1] == "read":
        pl.read_parquet("test.parquet", columns=["a"]).sum()

Then run and compare the outputs of:

python3 example.py make

# This works on a Mac. Replace `-l` with `-v` if you have gnu time, e.g., on linux
/usr/bin/time -l python3 example.py scan
/usr/bin/time -l python3 example.py read

Log output

No response

Issue description

Similar to #8925, the behaviour of scan_parquet/csv and read_parquet/csv for reading a single column is surprising. In particular, when reading a single column from a parquet file with 1/2 billion rows, using read_parquet(filename, columns=[col_name]) takes nearly 4x the memory usage and 2x the time of calling scan_parquet(filename).select(col_name).collect().

On the other hand, read_csv takes half the memory memory and around 5/6 the time than running scan_csv.collect().

Note that some of this may be due to the rechunk option, but setting rechunk=False in read_parquet still leads to higher memory and time usage than scan_parquet.collect().

Detailed table below generated by this gist on an M2 mac running macOS 14.3.1.

function peak memory (MB) real time (s)
read_parquet(rechunk=True).sum 3116.73 0.66
read_csv(rechunk=True).sum 1553.45 0.73
read_parquet(rechunk=False).sum 1589.93 0.48
read_csv(rechunk=False).sum 791.04 0.61
scan_parquet.collect.sum 819.81 0.26
scan_csv.collect.sum 1553.58 0.69
scan_parquet.sum.collect 820.08 0.26
scan_csv.sum.collect 1553.73 0.68
scan_parquet.sum.collect(streaming) 821.63 0.30
scan_csv.sum.collect(streaming) 943.72 0.65

Here the function names correspond to:

  • .collect.sum: func(filename).select("a").collect().sum()
  • .sum.collect: func(filename).select("a").sum().collect()
  • .sum.collect(streaming): func(filename).select("a").sum().collect(streaming=True)

Expected behavior

I would expect reading a single column of parquet with read_parquet to take less time but perhaps more memory than scan_parquet().collect() and similarly for read_csv and scan_csv.

Installed versions

--------Version info---------
Polars:               0.20.15
Index type:           UInt32
Platform:             macOS-14.3.1-arm64-arm-64bit
Python:               3.11.4 (main, Jul 30 2023, 21:55:46) [Clang 14.0.3 (clang-1403.0.22.14.1)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@khwilson khwilson added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 16, 2024
@itamarst
Copy link
Contributor

I can reproduce this locally, so I'll see if I can figure out what's going on.

@itamarst
Copy link
Contributor

With rechunk=False, what I noticed is that the main factor is the existence of column b in the Parquet file. If the parquet file only has a single column a, scan and read take the same amount of memory.

With 3 columns, read of a single column a is 3× the memory usage of scan.

So at first glance, this suggests columns that aren't named are still somehow being read by read_parquet.

@itamarst
Copy link
Contributor

itamarst commented Mar 21, 2024

The implementation of read_parquet() is essentially scan_parquet().select(columns).collect(no_optimization=True). And the no_optimization=True appears to be the source of the high memory usage. So either that can be just removed, or replaced with disabling most-but-not-the-important-one-for-this optimizations.

@khwilson
Copy link
Author

Thank you for figuring this out!

@stinodego stinodego removed the needs triage Awaiting prioritization by a maintainer label Mar 22, 2024
ritchie46 pushed a commit that referenced this issue Mar 28, 2024
#15285)

Co-authored-by: Itamar Turner-Trauring <itamar@pythonspeed.com>
Co-authored-by: Stijn de Gooijer <stijndegooijer@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
3 participants