Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid ColumnIndex written in parquet #6310

Closed
samuelcolvin opened this issue Aug 26, 2024 · 7 comments · Fixed by #6319
Closed

Invalid ColumnIndex written in parquet #6310

samuelcolvin opened this issue Aug 26, 2024 · 7 comments · Fixed by #6319
Labels
bug parquet Changes to the parquet crate

Comments

@samuelcolvin
Copy link
Contributor

See #6295 — we had an issue with MetadataLoader::load_page_index panicing, with invalid metadata, which I "fixed" (Err instead of panic).

But since the invalid metadata was written by a very recent version of this crate, I also wanted to work out why invalid metadata was being written in the first place

The problem (as shown in the test_invalid_column_index test in #6295) is an invalid ColumnIndex, specifically the invalid data looked like this:

[parquet/src/file/page_index/index_reader.rs:167:5] &index = ColumnIndex {
    null_pages: [
        true,
        true,
        true,
        true,
        true,
        true,
        false,
    ],
    min_values: [
        [],
        [],
        [],
        [],
        [],
        [],
        [],
    ],
    max_values: [
        [],
        [],
        [],
        [],
        [],
        [],
        [],
    ],
    boundary_order: BoundaryOrder(
        1,
    ),
    null_counts: Some(
        [
            10944,
            10240,
            10240,
            10240,
            10240,
            10240,
            7677,
        ],
    ),
    repetition_level_histograms: None,
    definition_level_histograms: Some(
        [
            10944,
            0,
            10240,
            0,
            10240,
            0,
            10240,
            0,
            10240,
            0,
            10240,
            0,
            7677,
            30,
        ],
    ),
}

Note that the list item in null_pages is false, but all values in min_values and max_values are empty, that causes the Err from:

let (min, max) = if is_null {
(None, None)
} else {
(
Some(T::try_from_le_slice(min)?),
Some(T::try_from_le_slice(max)?),
)
};

is_null is false, so from_le_slice::<T>(min) (and max) are called, 4 bytes are expected since T is i32, but the vec is empty.

I've tried in vane to work out where the code its that's writing that data.

cc @adriangb @alamb.

@samuelcolvin samuelcolvin changed the title Invalid ColumnIndex in parquet Invalid ColumnIndex written in parquet Aug 26, 2024
@etseidl
Copy link
Contributor

etseidl commented Aug 26, 2024

I think the relevant code is

let null_page =
(self.page_metrics.num_buffered_rows as u64) == self.page_metrics.num_page_nulls;
// a page contains only null values,
// and writers have to set the corresponding entries in min_values and max_values to byte[0]
if null_page && self.column_index_builder.valid() {
self.column_index_builder.append(
null_page,
vec![0; 1],
vec![0; 1],
self.page_metrics.num_page_nulls as i64,
);
} else if self.column_index_builder.valid() {
// from page statistics
// If can't get the page statistics, ignore this column/offset index for this column chunk
match &page_statistics {
None => {
self.column_index_builder.to_invalid();
}
Some(stat) => {
// Check if min/max are still ascending/descending across pages
let new_min = stat.min_opt().unwrap();
let new_max = stat.max_opt().unwrap();
if let Some((last_min, last_max)) = &self.last_non_null_data_page_min_max {
if self.data_page_boundary_ascending {
// If last min/max are greater than new min/max then not ascending anymore
let not_ascending = compare_greater(&self.descr, last_min, new_min)
|| compare_greater(&self.descr, last_max, new_max);
if not_ascending {
.

For the final page (with 30 values), null_page should be false, and we should wind up at

self.column_index_builder.append(
null_page,
stat.min_bytes_opt().unwrap().to_vec(),
stat.max_bytes_opt().unwrap().to_vec(),
self.page_metrics.num_page_nulls as i64,
);

The chunk statistics look ok (min 1, max 1), so you'd think the page stats would similarly be ok. They are created here

let page_statistics = match (values_data.min_value, values_data.max_value) {
(Some(min), Some(max)) => {
// Update chunk level statistics
update_min(&self.descr, &min, &mut self.column_metrics.min_column_value);
update_max(&self.descr, &max, &mut self.column_metrics.max_column_value);
(self.statistics_enabled == EnabledStatistics::Page).then_some(
ValueStatistics::new(
Some(min),
Some(max),
None,
Some(self.page_metrics.num_page_nulls),
false,
),

Again, if the min/max were invalid in the page, then you'd expect garbage in the chunk stats. Perhaps some print statements or breakpoints would help here.

If the original file isn't sensitive could you share it here?

cc @adriangb

@samuelcolvin
Copy link
Contributor Author

samuelcolvin commented Aug 27, 2024

Thanks @etseidl, my guess is that the problem comes because either num_buffered_rows or num_page_nulls is wrong in the last page, hence

        let null_page =
            (self.page_metrics.num_buffered_rows as u64) == self.page_metrics.num_page_nulls;

wrongly makes null_page false when it should be true.

The parquet file doesn't seem to have anything particularly sensitive in it, but I wouldn't be happy sharing it on github, happy to email it to you if you're interested?

@samuelcolvin
Copy link
Contributor Author

samuelcolvin commented Aug 27, 2024

my guess is that the problem comes because either num_buffered_rows or num_page_nulls is wrong in the last page, hence

Okay, ignore that suggestion. I've done some more digging and have a bit of progress, the key point from above is

    definition_level_histograms: Some(
        [
            ...
            7677,
            30,
        ],

I think this is saying that the last page has 7677 null values (which matches null_counts), and 30 non-null values.

Sure enough, if I run select count(*) from 'bad.parquet' where process_pid is not null; on the parquet file (process_pid is the problematic column), I get the result 30! All 30 non-null values are 1.

I guess the next step is to build a parquet file with a Int32 column that's mostly null except one page, and see if we can reproduce the problem.

@alamb
Copy link
Contributor

alamb commented Aug 28, 2024

I am sorry I am mostly on vacation this week so I haven't been following along as much as normal.

This sounds to me like a bug that was introduced in #6105 but has not yet been released. Given that I think it means we should hold 53.0.0 (#6016) until we fixed this issue

@etseidl is this something you can look into? It seems like #6315 is tracking a somewhat different issue

@etseidl
Copy link
Contributor

etseidl commented Aug 28, 2024

@etseidl is this something you can look into? It seems like #6315 is tracking a somewhat different issue

I worked offline with @samuelcolvin and @adriangb and identified a potential fix. Hopefully a PR will be coming soon.

@alamb
Copy link
Contributor

alamb commented Aug 31, 2024

@etseidl is this something you can look into? It seems like #6315 is tracking a somewhat different issue

I worked offline with @samuelcolvin and @adriangb and identified a potential fix. Hopefully a PR will be coming soon.

Thank you everyone -- #6319 is queued up and I plan to merge it shortly

@alamb alamb added the parquet Changes to the parquet crate label Aug 31, 2024
@alamb
Copy link
Contributor

alamb commented Aug 31, 2024

label_issue.py automatically added labels {'parquet'} from #6319

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@alamb @samuelcolvin @etseidl and others