-
Notifications
You must be signed in to change notification settings - Fork 783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid ColumnIndex
written in parquet
#6310
Comments
ColumnIndex
in parquetColumnIndex
written in parquet
I think the relevant code is arrow-rs/parquet/src/column/writer/mod.rs Lines 752 to 779 in ee2f75a
For the final page (with 30 values), arrow-rs/parquet/src/column/writer/mod.rs Lines 811 to 816 in ee2f75a
The chunk statistics look ok (min 1, max 1), so you'd think the page stats would similarly be ok. They are created here arrow-rs/parquet/src/column/writer/mod.rs Lines 889 to 902 in ee2f75a
Again, if the min/max were invalid in the page, then you'd expect garbage in the chunk stats. Perhaps some print statements or breakpoints would help here. If the original file isn't sensitive could you share it here? cc @adriangb |
Thanks @etseidl, my guess is that the problem comes because either let null_page =
(self.page_metrics.num_buffered_rows as u64) == self.page_metrics.num_page_nulls; wrongly makes The parquet file doesn't seem to have anything particularly sensitive in it, but I wouldn't be happy sharing it on github, happy to email it to you if you're interested? |
Okay, ignore that suggestion. I've done some more digging and have a bit of progress, the key point from above is
I think this is saying that the last page has 7677 null values (which matches Sure enough, if I run I guess the next step is to build a parquet file with a |
I am sorry I am mostly on vacation this week so I haven't been following along as much as normal. This sounds to me like a bug that was introduced in #6105 but has not yet been released. Given that I think it means we should hold 53.0.0 (#6016) until we fixed this issue @etseidl is this something you can look into? It seems like #6315 is tracking a somewhat different issue |
I worked offline with @samuelcolvin and @adriangb and identified a potential fix. Hopefully a PR will be coming soon. |
Thank you everyone -- #6319 is queued up and I plan to merge it shortly |
|
See #6295 — we had an issue with
MetadataLoader::load_page_index
panicing, with invalid metadata, which I "fixed" (Err instead of panic).But since the invalid metadata was written by a very recent version of this crate, I also wanted to work out why invalid metadata was being written in the first place
The problem (as shown in the
test_invalid_column_index
test in #6295) is an invalidColumnIndex
, specifically the invalid data looked like this:Note that the list item in
null_pages
isfalse
, but all values inmin_values
andmax_values
are empty, that causes theErr
from:arrow-rs/parquet/src/file/page_index/index.rs
Lines 204 to 211 in ee2f75a
is_null
is false, sofrom_le_slice::<T>(min)
(andmax
) are called, 4 bytes are expected sinceT
isi32
, but the vec is empty.I've tried in vane to work out where the code its that's writing that data.
cc @adriangb @alamb.
The text was updated successfully, but these errors were encountered: