Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.23 #16067

dimitri-mar · 2024-05-06T00:12:58Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import time


print("threadpool_size", pl.threadpool_size())
# time the metadata
start_time = time.time()
metadata = pl.scan_ndjson("dataset/*/*/*.json")
print(f"Metadata scan took {time.time() - start_time} seconds")

# time the metadata dataframe extraction
start_time_collection = time.time()
metadata_df = metadata.select(
    pl.col('id'),
    pl.col('title').struct["value"],
    pl.col('date').str.to_date("%Y-%m-%d")
).collect()
print(f"Metadata dataframe collection took {time.time() - start_time_collection} seconds")

Log output

# in polars 0.19.12
threadpool_size 20
Metadata scan took 14.438461780548096 seconds
UNION: union is run in parallel
Metadata dataframe collection took 16.68697190284729 seconds

# in polars '0.20.23'
.../polar_tests.py:5: DeprecationWarning: `threadpool_size` is deprecated. It has been renamed to `thread_pool_size`.
  print("threadpool_size", pl.threadpool_size())
threadpool_size 20
Metadata scan took 1.6403937339782715 seconds
found multiple sources; run comm_subplan_elim
# after ten minutes it is still running

Issue description

I am working with a relatively large dataset of around 700k json files with a total of ~3Gb (here a script to generate a similar dataset) .
When I was using polars version 0.19.12, the collection operation took around 16 seconds and used all the threads available. On the other hand, after updating to polars 0.20.23, the collection process uses only a single thread and after several minutes the collection process is still running. In Polars version 0.19.12, the collection operation took around 16 seconds and used all the threads available. On the other hand, after updating to Polars 0.20.23, the collection process uses only a single thread, and after several minutes, it is still running.

Expected behavior

I was expecting that after the update the execution time was the same, or less.

Installed versions

--------Version info---------
Polars:              0.19.12
Index type:          UInt32
Platform:            Linux-6.8.4-arch1-1-x86_64-with-glibc2.39
Python:              3.12.3 | packaged by Anaconda, Inc. | (main, Apr 19 2024, 16:50:38) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
gevent:              <not installed>
matplotlib:          <not installed>
numpy:               1.26.4
openpyxl:            <not installed>
pandas:              2.2.2
pyarrow:             16.0.0
pydantic:            <not installed>
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
None

while in the new installation

--------Version info---------
Polars:               0.20.23
Index type:           UInt32
Platform:             Linux-6.8.4-arch1-1-x86_64-with-glibc2.39
Python:               3.12.3 | packaged by Anaconda, Inc. | (main, Apr 19 2024, 16:50:38) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
None

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-05-06T08:09:50Z

Could you provide a reproduction? Maybe share the dataset or create a custom one?

dimitri-mar · 2024-05-06T12:34:57Z

Here is a script that can generate a similar dataset:

import os
import json
import random
import string

# Function to generate a random string
def random_string(length=8):
    letters = string.ascii_letters + string.digits
    return ''.join(random.choice(letters) for _ in range(length))

# Create dataset directories
def create_dataset_structure(base_path, num_files):
    levels = ['level1', 'level2']
    files_per_dir = num_files // (len(levels) ** 2)
    count = 0

    # Create folders and JSON files
    for l1 in range(1, len(levels) + 1):
        for l2 in range(1, len(levels) + 1):
            dir_path = os.path.join(base_path, f'level1_{l1}', f'level2_{l2}')
            os.makedirs(dir_path, exist_ok=True)

            for _ in range(files_per_dir):
                file_name = f'{random_string(5)}.json'
                file_path = os.path.join(dir_path, file_name)

                json_data = {
                        'id': random_string(),
                        'title': {'value':random_string()},
                        'date': "1998-01-01",
                        'deeper_data': [
                            {'key': random_string(5), 'value': random_string(10)},
                            {'key': random_string(5), 'value': random_string(10)},
                            {'key': random_string(5), 'value': random_string(10)}
                        ]
                    }

                with open(file_path, 'w') as json_file:
                    json.dump(json_data, json_file)

                count += 1
                if count >= num_files:
                    return

# Base path where all datasets will be created
base_dataset_path = 'dataset'
num_files_to_generate = 10000

# Generate the dataset structure with JSON files
create_dataset_structure(base_dataset_path, num_files_to_generate)

print(f"Generated {num_files_to_generate} JSON files.")

Could you provide a reproduction? Maybe share the dataset or create a custom one?

dimitri-mar · 2024-05-06T12:42:37Z

Perhaps you can try .collect(comm_subplan_elim=False) to see if that is the specific problem.

(Just noticed it is the last thing the log mentions - so it may help narrow things down.)

@cmdlineluser, ~~you are right! That works!~~ Thank you. ~~The collect operation is only slightly slower now than in version 0.19.~~ The default values of collect method are not changed. It is not clear to me why setting comm_subplan_elim=False allows to speed up so much the computation, and why it was not needed in 0.19.12.

cmdlineluser · 2024-05-06T13:51:56Z

(Apologies @dimitri-mar - I deleted the comment as I thought it was no longer useful.)

~~comm_subplan_elim is new so did not exist in 0.19.12~~

dimitri-mar · 2024-05-06T14:09:56Z

@cmdlineluser in 0.19.12 comm_subplan_elim was already there. I am wondering if the necessity of using comm_subplan_elim=False is something wanted. I would have never guessed if you had not suggested it.

cmdlineluser · 2024-05-06T15:48:35Z

I remembered wrongly - my bad.

Something new was introduced which did cause some changes (I think it was perhaps "Full Plan CSE" #15264)

Anyways - i'll leave you to the people who know what they are talking about :-D

ritchie46 · 2024-05-06T19:31:33Z

I know what the culprit is. Will try to come with a solution in a few days.

dimitri-mar · 2024-05-06T23:27:57Z

Perhaps you can try .collect(comm_subplan_elim=False) to see if that is the specific problem.
(Just noticed it is the last thing the log mentions - so it may help narrow things down.)

@cmdlineluser, ~~you are right! That works!~~ Thank you. ~~The collect operation is only slightly slower now than in version 0.19.~~ The default values of collect method are not changed. It is not clear to me why setting comm_subplan_elim=False allows to speed up so much the computation, and why it was not needed in 0.19.12.

I realized I made some mistakes while testing... Unfortunately, it does not solve the issue. Sorry.
The only thing that changes is the collection verbose output, which now says UNION: union is run in parallel like version 0.19, but it keeps taking infinite time.

The suspicious aspect to me is that the initial scan takes one order of magnitude less than in version 0.19: from ~16s to 1.6s, as if it is not checking the schema of all the files. Sorry for the premature optimism 😅

cmdlineluser · 2024-05-07T14:23:51Z

Not a problem @dimitri-mar - I was just trying to help triage the issue, not propose a solution.

Figuring out which optimization (if any) is causing the collect from working sometimes help narrow down where the problem is.

It didn't really help here - and Ritchie was already on the case, so it's all good.

dimitri-mar · 2024-05-07T19:51:09Z

Hi sorry, I am back. I tried the new release, 0.20.24; it should incorporate the fix, right? But I still have the same problem.
I moved back to version 0.19.12 to be able to work on this dataset. Do you have any suggestions on what I can do to help debug it? Now the Issue is closed, do I need to open a new one?

cmdlineluser · 2024-05-08T08:06:19Z

Yes, that fix was part of 0.20.24

We can tag @ritchie46 in this case.

If there are no updates, I think opening a new issue is appropriate.

dimitri-mar added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 6, 2024

ritchie46 mentioned this issue May 7, 2024

perf: Don't traverse deep datasets that we repr as union in CSE #16096

Merged

ritchie46 closed this as completed in #16096 May 7, 2024

dimitri-mar mentioned this issue May 9, 2024

Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.25 #16141

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.23 #16067

Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.23 #16067

dimitri-mar commented May 6, 2024 •

edited

Loading

ritchie46 commented May 6, 2024

dimitri-mar commented May 6, 2024

dimitri-mar commented May 6, 2024 •

edited

Loading

cmdlineluser commented May 6, 2024 •

edited

Loading

dimitri-mar commented May 6, 2024 •

edited

Loading

cmdlineluser commented May 6, 2024

ritchie46 commented May 6, 2024

dimitri-mar commented May 6, 2024

cmdlineluser commented May 7, 2024

dimitri-mar commented May 7, 2024 •

edited

Loading

cmdlineluser commented May 8, 2024

Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.23 #16067

Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.23 #16067

Comments

dimitri-mar commented May 6, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented May 6, 2024

dimitri-mar commented May 6, 2024

dimitri-mar commented May 6, 2024 • edited Loading

cmdlineluser commented May 6, 2024 • edited Loading

dimitri-mar commented May 6, 2024 • edited Loading

cmdlineluser commented May 6, 2024

ritchie46 commented May 6, 2024

dimitri-mar commented May 6, 2024

cmdlineluser commented May 7, 2024

dimitri-mar commented May 7, 2024 • edited Loading

cmdlineluser commented May 8, 2024

dimitri-mar commented May 6, 2024 •

edited

Loading

dimitri-mar commented May 6, 2024 •

edited

Loading

cmdlineluser commented May 6, 2024 •

edited

Loading

dimitri-mar commented May 6, 2024 •

edited

Loading

dimitri-mar commented May 7, 2024 •

edited

Loading