-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.23 #16067
Comments
Could you provide a reproduction? Maybe share the dataset or create a custom one? |
Here is a script that can generate a similar dataset: import os
import json
import random
import string
# Function to generate a random string
def random_string(length=8):
letters = string.ascii_letters + string.digits
return ''.join(random.choice(letters) for _ in range(length))
# Create dataset directories
def create_dataset_structure(base_path, num_files):
levels = ['level1', 'level2']
files_per_dir = num_files // (len(levels) ** 2)
count = 0
# Create folders and JSON files
for l1 in range(1, len(levels) + 1):
for l2 in range(1, len(levels) + 1):
dir_path = os.path.join(base_path, f'level1_{l1}', f'level2_{l2}')
os.makedirs(dir_path, exist_ok=True)
for _ in range(files_per_dir):
file_name = f'{random_string(5)}.json'
file_path = os.path.join(dir_path, file_name)
json_data = {
'id': random_string(),
'title': {'value':random_string()},
'date': "1998-01-01",
'deeper_data': [
{'key': random_string(5), 'value': random_string(10)},
{'key': random_string(5), 'value': random_string(10)},
{'key': random_string(5), 'value': random_string(10)}
]
}
with open(file_path, 'w') as json_file:
json.dump(json_data, json_file)
count += 1
if count >= num_files:
return
# Base path where all datasets will be created
base_dataset_path = 'dataset'
num_files_to_generate = 10000
# Generate the dataset structure with JSON files
create_dataset_structure(base_dataset_path, num_files_to_generate)
print(f"Generated {num_files_to_generate} JSON files.")
|
@cmdlineluser, |
(Apologies @dimitri-mar - I deleted the comment as I thought it was no longer useful.)
|
@cmdlineluser in 0.19.12 comm_subplan_elim was already there. I am wondering if the necessity of using comm_subplan_elim=False is something wanted. I would have never guessed if you had not suggested it. |
I remembered wrongly - my bad. Something new was introduced which did cause some changes (I think it was perhaps "Full Plan CSE" #15264) Anyways - i'll leave you to the people who know what they are talking about :-D |
I know what the culprit is. Will try to come with a solution in a few days. |
I realized I made some mistakes while testing... Unfortunately, it does not solve the issue. Sorry. The suspicious aspect to me is that the initial scan takes one order of magnitude less than in version 0.19: from ~16s to 1.6s, as if it is not checking the schema of all the files. Sorry for the premature optimism 😅 |
Not a problem @dimitri-mar - I was just trying to help triage the issue, not propose a solution. Figuring out which optimization (if any) is causing the collect from working sometimes help narrow down where the problem is. It didn't really help here - and Ritchie was already on the case, so it's all good. |
Hi sorry, I am back. I tried the new release, 0.20.24; it should incorporate the fix, right? But I still have the same problem. |
Yes, that fix was part of 0.20.24 We can tag @ritchie46 in this case. If there are no updates, I think opening a new issue is appropriate. |
Checks
Reproducible example
Log output
Issue description
I am working with a relatively large dataset of around 700k json files with a total of ~3Gb (here a script to generate a similar dataset) .
When I was using polars version 0.19.12, the collection operation took around 16 seconds and used all the threads available. On the other hand, after updating to polars 0.20.23, the collection process uses only a single thread and after several minutes the collection process is still running. In Polars version 0.19.12, the collection operation took around 16 seconds and used all the threads available. On the other hand, after updating to Polars 0.20.23, the collection process uses only a single thread, and after several minutes, it is still running.
Expected behavior
I was expecting that after the update the execution time was the same, or less.
Installed versions
while in the new installation
The text was updated successfully, but these errors were encountered: