Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.25 #16141

dimitri-mar · 2024-05-09T18:52:31Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import os
import json
import random
import string
import time

import polars as pl



### code to generate the dataset

# Function to generate a random string
def random_string(length=8):
    letters = string.ascii_letters + string.digits
    return ''.join(random.choice(letters) for _ in range(length))

# Create dataset directories
def create_dataset_structure(base_path, num_files):
    levels = ['level1', 'level2']
    files_per_dir = num_files // (len(levels) ** 2)
    count = 0

    # Create folders and JSON files
    for l1 in range(1, len(levels) + 1):
        for l2 in range(1, len(levels) + 1):
            dir_path = os.path.join(base_path, f'level1_{l1}', f'level2_{l2}')
            os.makedirs(dir_path, exist_ok=True)

            for _ in range(files_per_dir):
                file_name = f'{random_string(5)}.json'
                file_path = os.path.join(dir_path, file_name)

                json_data = {
                        'id': random_string(),
                        'title': {'value':random_string()},
                        'date': "1998-01-01",
                        'deeper_data': [
                            {'key': random_string(5), 'value': random_string(10)},
                            {'key': random_string(5), 'value': random_string(10)},
                            {'key': random_string(5), 'value': random_string(10)}
                        ]
                    }

                with open(file_path, 'w') as json_file:
                    json.dump(json_data, json_file)

                count += 1
                if count >= num_files:
                    return


# Generate the dataset
# Base path where all datasets will be created
base_dataset_path = 'dataset'
num_files_to_generate = 100000

# Generate the dataset structure with JSON files
create_dataset_structure(base_dataset_path, num_files_to_generate)

print(f"Generated {num_files_to_generate} JSON files.")

#### polars code
print("threadpool_size", pl.threadpool_size())
# time the metadata
start_time = time.time()
metadata = pl.scan_ndjson("dataset/*/*/*.json")
print(f"Metadata scan took {time.time() - start_time} seconds")

# time the metadata dataframe extraction
start_time_collection = time.time()
metadata_df = metadata.select(
    pl.col('id'),
    pl.col('title').struct["value"],
    pl.col('date').str.to_date("%Y-%m-%d")
).collect()
print(f"Metadata dataframe collection took {time.time() - start_time_collection} seconds")

Log output

# Version 0.19.12
Generated 100000 JSON files.
threadpool_size 12
Metadata scan took 1.2148280143737793 seconds
UNION: union is run in parallel
Metadata dataframe collection took 1.7806954383850098 seconds

# Version 0.20.25
Generated 100000 JSON files.
/home/dimitri/polars_tests/polars_test_performance.py:64: DeprecationWarning: `threadpool_size` is deprecated. It has been renamed to `thread_pool_size`.
  print("threadpool_size", pl.threadpool_size())
threadpool_size 12
Metadata scan took 0.2538626194000244 seconds
found multiple sources; run comm_subplan_elim
UNION: union is run in parallel
Metadata dataframe collection took 54.37629246711731 seconds

Issue description

This is a follow-up from the previous issue #16067
With respect to Polars version 0.19.12, I was expecting the newer version to be faster. However, the collection process is way slower (1.7 sec vs 54 sec).
I am working with a relatively large dataset of around 700k json files with a total of ~3Gb and this issue makes 0.20.25 unusable for me. On the other hand, the old version was fast, as you can see in the log output above.
I tested it on two different machines, both with Linux. A different version of Python does not seem to make any sensible difference.

Expected behavior

I was expecting Polars version 0.19.12 to be faster than 0.20.25. However, the collection process is way slower (1.7 sec vs. 54 sec).
Note that in the new version, the initial scan process pl.scan_ndjson is way faster than in the old version (1.2 sec -> 0.25 sec) and does not seem to scale with the number of files.

Installed versions

For version 0.19.12:

--------Version info---------
Polars:              0.19.12
Index type:          UInt32
Platform:            Linux-6.8.6-arch1-1-x86_64-with-glibc2.39
Python:              3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
gevent:              <not installed>
matplotlib:          <not installed>
numpy:               <not installed>
openpyxl:            <not installed>
pandas:              <not installed>
pyarrow:             <not installed>
pydantic:            <not installed>
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
None

For version 0.20.25:

--------Version info---------
Polars:               0.20.25
Index type:           UInt32
Platform:             Linux-6.8.6-arch1-1-x86_64-with-glibc2.39
Python:               3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
None

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-05-10T11:09:40Z

How fast is it if you turn comm_subplan_elim=False?

ritchie46 · 2024-05-10T11:52:54Z

Ah, I see what it is. It is accidentally quadratic behavior due to the sorted check. Will fix it.

dimitri-mar added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 9, 2024

ritchie46 mentioned this issue May 10, 2024

perf: Improve cost of chunk_idx compute #16154

Merged

ritchie46 closed this as completed in #16154 May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.25 #16141

Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.25 #16141

dimitri-mar commented May 9, 2024 •

edited

Loading

ritchie46 commented May 10, 2024

ritchie46 commented May 10, 2024

Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.25 #16141

Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.25 #16141

Comments

dimitri-mar commented May 9, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented May 10, 2024

ritchie46 commented May 10, 2024

dimitri-mar commented May 9, 2024 •

edited

Loading