Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix parquet reads with limit across row groups #2751

Merged

Conversation

desmondcheongzx
Copy link
Contributor

When reading local parquet files containing multiple row groups with a limit applied, sometimes the resulting table does not respect the given limit, causing errors such as DaftError::ValueError While building a Table with Table::new_with_size, we found that the Series lengths did not match. Series named: col had length: 2048 vs the specified Table length: 1034 to be thrown.

The issue was a small bug where each row group range being read would take the global limit passed into the parquet read, instead of the pre-computed row group limit, which is aware of how many rows had been read by previous row groups. This caused the parquet reader to read more rows from a row group than specified.

To fix this, we pass the pre-computed row group limit properly to the reader.

For example, consider a parquet file with the following layout:

Column: col
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  S _  1       5.00 B     5 B       
  0-1    data  S R  1024    0.01 B     11 B                0       "b" / "b"
  1-D    dict  S _  1       5.00 B     5 B       
  1-1    data  S R  1024    0.01 B     11 B                0       "b" / "b"
  2-D    dict  S _  1       5.00 B     5 B       
  2-1    data  S R  1024    0.01 B     11 B                0       "b" / "b"
  3-D    dict  S _  1       5.00 B     5 B       
  3-1    data  S R  1024    0.01 B     11 B                0       "b" / "b"

When applying a .limit(1050) over this parquet file, with the bug, we would read 1024 rows each from row groups 0 and 1 (data pages 0-1 and 1-1). Row groups 2 and 3 are skipped because the pre-computed row ranges sees that we have the required 1050 rows in the first two row groups. However, the pre-computed row ranges are aware that we only need 26 entries from row group 1, so we simply pass this information correctly into the reader.

@github-actions github-actions bot added the bug Something isn't working label Aug 27, 2024
Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks again for looking into this

Comment on lines 177 to 180
(
"parquet/test_parquet_limits_across_row_groups",
"s3://daft-public-data/test_fixtures/parquet-dev/tpch-issue#2730.parquet",
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, reading this file should have worked before. It just used to error if we applied a limit, which is not done by any tests over the URLs in DAFT_CAN_READ_FILES.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're totally right. Forgot that this file doesn't apply a limit.

Moved the S3 tests

Copy link

codspeed-hq bot commented Aug 28, 2024

CodSpeed Performance Report

Merging #2751 will improve performances by ×3

Comparing desmondcheongzx:fix-local-parquet-limits (c463352) with main (3a7a0b4)

Summary

⚡ 1 improvements
✅ 15 untouched benchmarks

Benchmarks breakdown

Benchmark main desmondcheongzx:fix-local-parquet-limits Change
test_show[100 Small Files] 154.2 ms 51.1 ms ×3

@kevinzwang
Copy link
Member

@desmondcheongzx could you fix up the failing integration test and merge this in?

@desmondcheongzx
Copy link
Contributor Author

@kevinzwang sounds good. I fixed the integration test and will merge once CI passes

@desmondcheongzx desmondcheongzx merged commit a97d871 into Eventual-Inc:main Aug 31, 2024
33 checks passed
@desmondcheongzx desmondcheongzx deleted the fix-local-parquet-limits branch August 31, 2024 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants