[BUG] Fix parquet reads with limit across row groups #2751

desmondcheongzx · 2024-08-27T23:51:23Z

When reading local parquet files containing multiple row groups with a limit applied, sometimes the resulting table does not respect the given limit, causing errors such as DaftError::ValueError While building a Table with Table::new_with_size, we found that the Series lengths did not match. Series named: col had length: 2048 vs the specified Table length: 1034 to be thrown.

The issue was a small bug where each row group range being read would take the global limit passed into the parquet read, instead of the pre-computed row group limit, which is aware of how many rows had been read by previous row groups. This caused the parquet reader to read more rows from a row group than specified.

To fix this, we pass the pre-computed row group limit properly to the reader.

For example, consider a parquet file with the following layout:

Column: col
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  S _  1       5.00 B     5 B       
  0-1    data  S R  1024    0.01 B     11 B                0       "b" / "b"
  1-D    dict  S _  1       5.00 B     5 B       
  1-1    data  S R  1024    0.01 B     11 B                0       "b" / "b"
  2-D    dict  S _  1       5.00 B     5 B       
  2-1    data  S R  1024    0.01 B     11 B                0       "b" / "b"
  3-D    dict  S _  1       5.00 B     5 B       
  3-1    data  S R  1024    0.01 B     11 B                0       "b" / "b"

When applying a .limit(1050) over this parquet file, with the bug, we would read 1024 rows each from row groups 0 and 1 (data pages 0-1 and 1-1). Row groups 2 and 3 are skipped because the pre-computed row ranges sees that we have the required 1050 rows in the first two row groups. However, the pre-computed row ranges are aware that we only need 26 entries from row group 1, so we simply pass this information correctly into the reader.

kevinzwang

LGTM! Thanks again for looking into this

kevinzwang · 2024-08-27T23:58:24Z

tests/integration/io/parquet/test_reads_public_data.py

+    (
+        "parquet/test_parquet_limits_across_row_groups",
+        "s3://daft-public-data/test_fixtures/parquet-dev/tpch-issue#2730.parquet",
+    ),


To my understanding, reading this file should have worked before. It just used to error if we applied a limit, which is not done by any tests over the URLs in DAFT_CAN_READ_FILES.

Oh you're totally right. Forgot that this file doesn't apply a limit.

Moved the S3 tests

codspeed-hq · 2024-08-28T00:03:51Z

CodSpeed Performance Report

Merging #2751 will improve performances by ×3

_{Comparing desmondcheongzx:fix-local-parquet-limits (c463352) with main (3a7a0b4)}

Summary

⚡ 1 improvements
✅ 15 untouched benchmarks

Benchmarks breakdown

	Benchmark	`main`	`desmondcheongzx:fix-local-parquet-limits`	Change
⚡	`test_show[100 Small Files]`	154.2 ms	51.1 ms	×3

kevinzwang · 2024-08-30T18:12:13Z

@desmondcheongzx could you fix up the failing integration test and merge this in?

desmondcheongzx · 2024-08-31T02:33:01Z

@kevinzwang sounds good. I fixed the integration test and will merge once CI passes

Fix parquet reads with limit across row groups

e55282e

github-actions bot added the bug Something isn't working label Aug 27, 2024

desmondcheongzx requested review from kevinzwang and colin-ho August 27, 2024 23:51

kevinzwang approved these changes Aug 27, 2024

View reviewed changes

desmondcheongzx added 2 commits August 27, 2024 18:26

Address comment

1dbab03

Add integration decorator

bb94667

Fix integration test

c463352

desmondcheongzx merged commit a97d871 into Eventual-Inc:main Aug 31, 2024
33 checks passed

desmondcheongzx deleted the fix-local-parquet-limits branch August 31, 2024 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fix parquet reads with limit across row groups #2751

[BUG] Fix parquet reads with limit across row groups #2751

desmondcheongzx commented Aug 27, 2024

kevinzwang left a comment

kevinzwang Aug 27, 2024

desmondcheongzx Aug 28, 2024

codspeed-hq bot commented Aug 28, 2024 •

edited

Loading

kevinzwang commented Aug 30, 2024

desmondcheongzx commented Aug 31, 2024

[BUG] Fix parquet reads with limit across row groups #2751

[BUG] Fix parquet reads with limit across row groups #2751

Conversation

desmondcheongzx commented Aug 27, 2024

kevinzwang left a comment

Choose a reason for hiding this comment

kevinzwang Aug 27, 2024

Choose a reason for hiding this comment

desmondcheongzx Aug 28, 2024

Choose a reason for hiding this comment

codspeed-hq bot commented Aug 28, 2024 • edited Loading

CodSpeed Performance Report

Merging #2751 will improve performances by ×3

Summary

Benchmarks breakdown

kevinzwang commented Aug 30, 2024

desmondcheongzx commented Aug 31, 2024

codspeed-hq bot commented Aug 28, 2024 •

edited

Loading