Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.25 #16141

Closed
2 tasks done
dimitri-mar opened this issue May 9, 2024 · 2 comments · Fixed by #16154
Closed
2 tasks done

Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.25 #16141

dimitri-mar opened this issue May 9, 2024 · 2 comments · Fixed by #16154
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@dimitri-mar
Copy link

dimitri-mar commented May 9, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import os
import json
import random
import string
import time

import polars as pl



### code to generate the dataset

# Function to generate a random string
def random_string(length=8):
    letters = string.ascii_letters + string.digits
    return ''.join(random.choice(letters) for _ in range(length))

# Create dataset directories
def create_dataset_structure(base_path, num_files):
    levels = ['level1', 'level2']
    files_per_dir = num_files // (len(levels) ** 2)
    count = 0

    # Create folders and JSON files
    for l1 in range(1, len(levels) + 1):
        for l2 in range(1, len(levels) + 1):
            dir_path = os.path.join(base_path, f'level1_{l1}', f'level2_{l2}')
            os.makedirs(dir_path, exist_ok=True)

            for _ in range(files_per_dir):
                file_name = f'{random_string(5)}.json'
                file_path = os.path.join(dir_path, file_name)

                json_data = {
                        'id': random_string(),
                        'title': {'value':random_string()},
                        'date': "1998-01-01",
                        'deeper_data': [
                            {'key': random_string(5), 'value': random_string(10)},
                            {'key': random_string(5), 'value': random_string(10)},
                            {'key': random_string(5), 'value': random_string(10)}
                        ]
                    }

                with open(file_path, 'w') as json_file:
                    json.dump(json_data, json_file)

                count += 1
                if count >= num_files:
                    return


# Generate the dataset
# Base path where all datasets will be created
base_dataset_path = 'dataset'
num_files_to_generate = 100000

# Generate the dataset structure with JSON files
create_dataset_structure(base_dataset_path, num_files_to_generate)

print(f"Generated {num_files_to_generate} JSON files.")

#### polars code
print("threadpool_size", pl.threadpool_size())
# time the metadata
start_time = time.time()
metadata = pl.scan_ndjson("dataset/*/*/*.json")
print(f"Metadata scan took {time.time() - start_time} seconds")

# time the metadata dataframe extraction
start_time_collection = time.time()
metadata_df = metadata.select(
    pl.col('id'),
    pl.col('title').struct["value"],
    pl.col('date').str.to_date("%Y-%m-%d")
).collect()
print(f"Metadata dataframe collection took {time.time() - start_time_collection} seconds")

Log output

# Version 0.19.12
Generated 100000 JSON files.
threadpool_size 12
Metadata scan took 1.2148280143737793 seconds
UNION: union is run in parallel
Metadata dataframe collection took 1.7806954383850098 seconds

# Version 0.20.25
Generated 100000 JSON files.
/home/dimitri/polars_tests/polars_test_performance.py:64: DeprecationWarning: `threadpool_size` is deprecated. It has been renamed to `thread_pool_size`.
  print("threadpool_size", pl.threadpool_size())
threadpool_size 12
Metadata scan took 0.2538626194000244 seconds
found multiple sources; run comm_subplan_elim
UNION: union is run in parallel
Metadata dataframe collection took 54.37629246711731 seconds

Issue description

This is a follow-up from the previous issue #16067
With respect to Polars version 0.19.12, I was expecting the newer version to be faster. However, the collection process is way slower (1.7 sec vs 54 sec).
I am working with a relatively large dataset of around 700k json files with a total of ~3Gb and this issue makes 0.20.25 unusable for me. On the other hand, the old version was fast, as you can see in the log output above.
I tested it on two different machines, both with Linux. A different version of Python does not seem to make any sensible difference.

Expected behavior

I was expecting Polars version 0.19.12 to be faster than 0.20.25. However, the collection process is way slower (1.7 sec vs. 54 sec).
Note that in the new version, the initial scan process pl.scan_ndjson is way faster than in the old version (1.2 sec -> 0.25 sec) and does not seem to scale with the number of files.

Installed versions

For version 0.19.12:
--------Version info---------
Polars:              0.19.12
Index type:          UInt32
Platform:            Linux-6.8.6-arch1-1-x86_64-with-glibc2.39
Python:              3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
gevent:              <not installed>
matplotlib:          <not installed>
numpy:               <not installed>
openpyxl:            <not installed>
pandas:              <not installed>
pyarrow:             <not installed>
pydantic:            <not installed>
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
None

For version 0.20.25:

--------Version info---------
Polars:               0.20.25
Index type:           UInt32
Platform:             Linux-6.8.6-arch1-1-x86_64-with-glibc2.39
Python:               3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:38:13) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
None
@dimitri-mar dimitri-mar added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 9, 2024
@ritchie46
Copy link
Member

How fast is it if you turn comm_subplan_elim=False?

@ritchie46
Copy link
Member

Ah, I see what it is. It is accidentally quadratic behavior due to the sorted check. Will fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants