Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.23 #16067

Closed
2 tasks done
dimitri-mar opened this issue May 6, 2024 · 11 comments · Fixed by #16096
Closed
2 tasks done

Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.23 #16067

dimitri-mar opened this issue May 6, 2024 · 11 comments · Fixed by #16096
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@dimitri-mar
Copy link

dimitri-mar commented May 6, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import time


print("threadpool_size", pl.threadpool_size())
# time the metadata
start_time = time.time()
metadata = pl.scan_ndjson("dataset/*/*/*.json")
print(f"Metadata scan took {time.time() - start_time} seconds")

# time the metadata dataframe extraction
start_time_collection = time.time()
metadata_df = metadata.select(
    pl.col('id'),
    pl.col('title').struct["value"],
    pl.col('date').str.to_date("%Y-%m-%d")
).collect()
print(f"Metadata dataframe collection took {time.time() - start_time_collection} seconds")

Log output

# in polars 0.19.12
threadpool_size 20
Metadata scan took 14.438461780548096 seconds
UNION: union is run in parallel
Metadata dataframe collection took 16.68697190284729 seconds

# in polars '0.20.23'
.../polar_tests.py:5: DeprecationWarning: `threadpool_size` is deprecated. It has been renamed to `thread_pool_size`.
  print("threadpool_size", pl.threadpool_size())
threadpool_size 20
Metadata scan took 1.6403937339782715 seconds
found multiple sources; run comm_subplan_elim
# after ten minutes it is still running

Issue description

I am working with a relatively large dataset of around 700k json files with a total of ~3Gb (here a script to generate a similar dataset) .
When I was using polars version 0.19.12, the collection operation took around 16 seconds and used all the threads available. On the other hand, after updating to polars 0.20.23, the collection process uses only a single thread and after several minutes the collection process is still running. In Polars version 0.19.12, the collection operation took around 16 seconds and used all the threads available. On the other hand, after updating to Polars 0.20.23, the collection process uses only a single thread, and after several minutes, it is still running.

Expected behavior

I was expecting that after the update the execution time was the same, or less.

Installed versions

--------Version info---------
Polars:              0.19.12
Index type:          UInt32
Platform:            Linux-6.8.4-arch1-1-x86_64-with-glibc2.39
Python:              3.12.3 | packaged by Anaconda, Inc. | (main, Apr 19 2024, 16:50:38) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
gevent:              <not installed>
matplotlib:          <not installed>
numpy:               1.26.4
openpyxl:            <not installed>
pandas:              2.2.2
pyarrow:             16.0.0
pydantic:            <not installed>
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
None

while in the new installation

--------Version info---------
Polars:               0.20.23
Index type:           UInt32
Platform:             Linux-6.8.4-arch1-1-x86_64-with-glibc2.39
Python:               3.12.3 | packaged by Anaconda, Inc. | (main, Apr 19 2024, 16:50:38) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
None
@dimitri-mar dimitri-mar added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 6, 2024
@ritchie46
Copy link
Member

Could you provide a reproduction? Maybe share the dataset or create a custom one?

@dimitri-mar
Copy link
Author

Here is a script that can generate a similar dataset:

import os
import json
import random
import string

# Function to generate a random string
def random_string(length=8):
    letters = string.ascii_letters + string.digits
    return ''.join(random.choice(letters) for _ in range(length))

# Create dataset directories
def create_dataset_structure(base_path, num_files):
    levels = ['level1', 'level2']
    files_per_dir = num_files // (len(levels) ** 2)
    count = 0

    # Create folders and JSON files
    for l1 in range(1, len(levels) + 1):
        for l2 in range(1, len(levels) + 1):
            dir_path = os.path.join(base_path, f'level1_{l1}', f'level2_{l2}')
            os.makedirs(dir_path, exist_ok=True)

            for _ in range(files_per_dir):
                file_name = f'{random_string(5)}.json'
                file_path = os.path.join(dir_path, file_name)

                json_data = {
                        'id': random_string(),
                        'title': {'value':random_string()},
                        'date': "1998-01-01",
                        'deeper_data': [
                            {'key': random_string(5), 'value': random_string(10)},
                            {'key': random_string(5), 'value': random_string(10)},
                            {'key': random_string(5), 'value': random_string(10)}
                        ]
                    }

                with open(file_path, 'w') as json_file:
                    json.dump(json_data, json_file)

                count += 1
                if count >= num_files:
                    return

# Base path where all datasets will be created
base_dataset_path = 'dataset'
num_files_to_generate = 10000

# Generate the dataset structure with JSON files
create_dataset_structure(base_dataset_path, num_files_to_generate)

print(f"Generated {num_files_to_generate} JSON files.")

Could you provide a reproduction? Maybe share the dataset or create a custom one?

@dimitri-mar
Copy link
Author

dimitri-mar commented May 6, 2024

Perhaps you can try .collect(comm_subplan_elim=False) to see if that is the specific problem.

(Just noticed it is the last thing the log mentions - so it may help narrow things down.)

@cmdlineluser, you are right! That works! Thank you. The collect operation is only slightly slower now than in version 0.19. The default values of collect method are not changed. It is not clear to me why setting comm_subplan_elim=False allows to speed up so much the computation, and why it was not needed in 0.19.12.

@cmdlineluser
Copy link
Contributor

cmdlineluser commented May 6, 2024

(Apologies @dimitri-mar - I deleted the comment as I thought it was no longer useful.)

comm_subplan_elim is new so did not exist in 0.19.12

@dimitri-mar
Copy link
Author

dimitri-mar commented May 6, 2024

@cmdlineluser in 0.19.12 comm_subplan_elim was already there. I am wondering if the necessity of using comm_subplan_elim=False is something wanted. I would have never guessed if you had not suggested it.

@cmdlineluser
Copy link
Contributor

I remembered wrongly - my bad.

Something new was introduced which did cause some changes (I think it was perhaps "Full Plan CSE" #15264)

Anyways - i'll leave you to the people who know what they are talking about :-D

@ritchie46
Copy link
Member

I know what the culprit is. Will try to come with a solution in a few days.

@dimitri-mar
Copy link
Author

Perhaps you can try .collect(comm_subplan_elim=False) to see if that is the specific problem.
(Just noticed it is the last thing the log mentions - so it may help narrow things down.)

@cmdlineluser, you are right! That works! Thank you. The collect operation is only slightly slower now than in version 0.19. The default values of collect method are not changed. It is not clear to me why setting comm_subplan_elim=False allows to speed up so much the computation, and why it was not needed in 0.19.12.

I realized I made some mistakes while testing... Unfortunately, it does not solve the issue. Sorry.
The only thing that changes is the collection verbose output, which now says UNION: union is run in parallel like version 0.19, but it keeps taking infinite time.

The suspicious aspect to me is that the initial scan takes one order of magnitude less than in version 0.19: from ~16s to 1.6s, as if it is not checking the schema of all the files. Sorry for the premature optimism 😅

@cmdlineluser
Copy link
Contributor

Not a problem @dimitri-mar - I was just trying to help triage the issue, not propose a solution.

Figuring out which optimization (if any) is causing the collect from working sometimes help narrow down where the problem is.

It didn't really help here - and Ritchie was already on the case, so it's all good.

@dimitri-mar
Copy link
Author

dimitri-mar commented May 7, 2024

Hi sorry, I am back. I tried the new release, 0.20.24; it should incorporate the fix, right? But I still have the same problem.
I moved back to version 0.19.12 to be able to work on this dataset. Do you have any suggestions on what I can do to help debug it? Now the Issue is closed, do I need to open a new one?

@cmdlineluser
Copy link
Contributor

Yes, that fix was part of 0.20.24

We can tag @ritchie46 in this case.

If there are no updates, I think opening a new issue is appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants