-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Base parquet batch_size on parquet row group size #6701
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this more flexible.
Apart from the clarification on the new behavior, this looks good to me.
except ValueError as e: | ||
logger.error(f"Failed to read file '{file}' with error {type(e)}: {e}") | ||
raise | ||
if parquet_file.metadata.num_row_groups > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check was not present before. Is this behavior change in purpose? What if not parquet_file.metadata.num_row_groups > 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's because now I check for the metadata of the first row group, so I must first check that there is at least one ^^
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
This allows to stream datasets like Major-TOM/Core-S2L2A which have row groups with few rows (one row is ~10MB). Previously the cold start would take a lot of time and OOM because it would download many row groups before yielding the first example.
I tried on OpenOrca and imagenet-hard and it does't affect overall throughput.
Even if the overall throughput doesn't change for datasets like imagenet-hard with big rows, note that it does create shorter and more frequent pauses to download the next row group. Though I find it fine because previously the pauses were less frequent but very long (downloading multiple row groups at a time)