-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fix parquet reads with limit across row groups #2751
[BUG] Fix parquet reads with limit across row groups #2751
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks again for looking into this
( | ||
"parquet/test_parquet_limits_across_row_groups", | ||
"s3://daft-public-data/test_fixtures/parquet-dev/tpch-issue#2730.parquet", | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding, reading this file should have worked before. It just used to error if we applied a limit, which is not done by any tests over the URLs in DAFT_CAN_READ_FILES
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're totally right. Forgot that this file doesn't apply a limit.
Moved the S3 tests
CodSpeed Performance ReportMerging #2751 will improve performances by ×3Comparing Summary
Benchmarks breakdown
|
@desmondcheongzx could you fix up the failing integration test and merge this in? |
@kevinzwang sounds good. I fixed the integration test and will merge once CI passes |
When reading local parquet files containing multiple row groups with a limit applied, sometimes the resulting table does not respect the given limit, causing errors such as
DaftError::ValueError While building a Table with Table::new_with_size, we found that the Series lengths did not match. Series named: col had length: 2048 vs the specified Table length: 1034
to be thrown.The issue was a small bug where each row group range being read would take the global limit passed into the parquet read, instead of the pre-computed row group limit, which is aware of how many rows had been read by previous row groups. This caused the parquet reader to read more rows from a row group than specified.
To fix this, we pass the pre-computed row group limit properly to the reader.
For example, consider a parquet file with the following layout:
When applying a
.limit(1050)
over this parquet file, with the bug, we would read 1024 rows each from row groups 0 and 1 (data pages0-1
and1-1
). Row groups 2 and 3 are skipped because the pre-computed row ranges sees that we have the required1050
rows in the first two row groups. However, the pre-computed row ranges are aware that we only need 26 entries from row group 1, so we simply pass this information correctly into the reader.