Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parquet)!: coerce_types flag for date64 #6313

Merged
merged 13 commits into from
Nov 26, 2024

Conversation

dsgibbons
Copy link
Contributor

@dsgibbons dsgibbons commented Aug 27, 2024

Which issue does this PR close?

Relates to #1938, which will be closed by future PRs.

What changes are included in this PR?

This PR introduces the coerce_types flag for WriterProperties. To start, I've only addressed the Date64 case. The desired behaviour for Date64 is captured in #1938. I've added tests to ensure that Date64 is handled correctly when coerce_types=false and coerce_types=true. I've also added some testing around Date32 to ensure I haven't accidentally broken anything there.

I've deliberately avoided the other types mentioned in #1938 because I wanted to make sure we are happy with how coerce_types works for Date64 first. Once accepted, I'll raise PRs for the remaining types to finally close out #1938.

One thing missing from this PR is validation on Date64. The C++ implementation has a full_validation option that checks that all Date64 lie on a date boundary (i.e., a multiple of 1000 * 60 * 60 * 24). I've started work on adding this in arrow-rs and intend to raise a separate PR for enabling Date64 validation in the arrow-array crate.

Are there any user-facing changes?

Breaking changes:

  • arrow_to_parquet_schema(schema: &Schema, coerce_types: bool) - I wasn't sure how else we could propagate the coerce_types option down to the parquet schema.
  • arrow_to_parquet_schema_with_root(schema: &Schema, root: &str, coerce_types: bool) - same as above.
  • coerce_types flag in WriterProperties (not a major issue as typically created via WriterPropertiesBuilder).

User facing changes:

  • The ability to read/write parquet files with the native Date64 logical type.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 27, 2024
@dsgibbons dsgibbons changed the title feat(parquet)!: coerce_types flag for date64 (#1938) feat(parquet)!: coerce_types flag for date64 Aug 27, 2024
@alamb alamb requested a review from tustvold August 27, 2024 17:11
@alamb
Copy link
Contributor

alamb commented Aug 27, 2024

Flagging @tustvold who filed the original issue #1938

Thank you @dsgibbons

@dsgibbons
Copy link
Contributor Author

@tustvold I know this isn't high priority, but just bumping this so I can get some feedback before working on accompanying PRs.

@alamb
Copy link
Contributor

alamb commented Sep 12, 2024

@dsgibbons do you have a usecase for this feature? There is a workaround which is for the user to explicitly call cast on any column types that are not supported

I wonder if you need this feature yourself or if you are trying to help workdown the arrow-rs backlog

@dsgibbons
Copy link
Contributor Author

I wonder if you need this feature yourself or if you are trying to help workdown the arrow-rs backlog

The latter

@alamb
Copy link
Contributor

alamb commented Sep 12, 2024

I wonder if you need this feature yourself or if you are trying to help workdown the arrow-rs backlog

The latter

Thank you. In that case I will let @tustvold comment as he filed the original ticket and I have no additional context

@tustvold
Copy link
Contributor

IIRC the request came from people coming from arrow2 which did something similar. I'm afraid I've lost most context on this and don't have time at the moment to review, but I'm sure one of the other maintainers will be able to advise on a path forward

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

@etseidl
Copy link
Contributor

etseidl commented Sep 19, 2024

Thanks for taking this on @dsgibbons. I may be confused, but it appears that the approach you use for the non-coerced case is to write INT64 with a DATE annotation to the Parquet file. The problem is that the Parquet spec does not allow this (ref). I believe the approach called for in #1938 is to write un-annotated INT64, and rely on the encoded arrow schema to know how to interpret the column. For instance, in arrow/schema/mod.rs, perhaps the logic should be more like

        DataType::Date64 => {
            if coerce_types {
                Type::primitive_type_builder(name, PhysicalType::INT32)
                    .with_logical_type(Some(LogicalType::Date))
                    .with_repetition(repetition)
                    .with_id(id)
                    .build()
            } else {
                Type::primitive_type_builder(name, PhysicalType::INT64)
                    .with_repetition(repetition)
                    .with_id(id)
                    .build()
            }
        },

I don't think the write side is too far off.

On the read side, I think you'll still have to account for the INT32(DATE)->Date64 conversion for the case of coerced data. It seems to me like you've removed the code to handle this case, but again I may be confused.

@dsgibbons
Copy link
Contributor Author

dsgibbons commented Sep 20, 2024

Thank you for taking the time to look at this @etseidl. I'm still new to the project so I have plenty to learn.

From #1938:

If not coerce_types, write as Int64 and embed logical type in arrow schema only.

I think I interpreted this as the Parquet LogicalType. I hadn't seen that ref before.

I believe the approach called for in #1938 is to write un-annotated INT64, and rely on the encoded arrow schema to know how to interpret the column.

So if we can't embed the fact that the field refers to a date in the Parquet LogicalType, do we provide additional type information during/after reading to interpret INT64 columns as Date64? Is this what was meant by "embed logical type in arrow schema only" from #1938?

I thought that all type information was inferred from the Parquet file. I mistakenly removed the INT32(DATE)->Date64 code, because I didn't think there would be any way to know whether INT32(DATE) was coerced or not. Could you please give an example of how a reader uses an arrow schema to correctly interpret the columns?

On another note, are you OK with the breaking change introduced by: arrow_to_parquet_schema(schema: &Schema, coerce_types: bool)?

@etseidl
Copy link
Contributor

etseidl commented Sep 20, 2024

Thank you for taking the time to look at this @etseidl. I'm still new to the project so I have plenty to learn.

👋 Welcome! I'm pretty new here too, and still learning as well 😄.

I believe the approach called for in #1938 is to write un-annotated INT64, and rely on the encoded arrow schema to know how to interpret the column.

So if we can't embed the fact that the field refers to a date in the Parquet LogicalType, do we provide additional type information during/after reading to interpret INT64 columns as Date64? Is this what was meant by "embed logical type in arrow schema only" from #1938?

Yes. The Parquet format allows for key-value pairs in the metadata. What the arrow writers will do for certain types that aren't representable in Parquet (most of them time based), is to embed a base64 encoded arrow schema (which is serialized with flatbuffers) in one of these key/value pairs, with the key being ARROW:schema. Take a look at the code here, and trace backwards to see how it's used.

My hope is that the arrow schema will preserve the fact that in arrow it was a Date64 with no further changes (i.e. if coerce==true then do what is currently done, if false then don't scale the value and write as INT64 with no Parquet LogicalType). On the read side, I'm hoping the arrow type will still be Date64, and then you just need to account for the case where the Parquet type is INT64 (the INT32 case is already there since coercing is the current behavior).

Did that make sense? Hopefully someone with more arrow-side experience will correct me if I've messed up some details.

On another note, are you OK with the breaking change introduced by: arrow_to_parquet_schema(schema: &Schema, coerce_types: bool)?

Breaking changes are ok, but with the caveat that they're only allowed in major releases (with the next 54.0.0 release slated for December). IIUC what will happen is once this is PR ready for merge, it will be changed to "Draft" until development on 54.0.0 commences. I'll take another look and see if there's a way to avoid breaking the API. Perhaps new functions that accept the coerce flag, and the existing ones call the new ones passing true. I don't know offhand how many types are impacted, and whether they all currently behave as if coerce is true, though.

@dsgibbons
Copy link
Contributor Author

dsgibbons commented Sep 20, 2024

Makes sense. I didn't know about ARROW:schema. That helps a lot!

@dsgibbons
Copy link
Contributor Author

I finally got a chance to update this. I think this is correct now. Fortunately, this PR is a bit shorter now because some of the previous unit tests no longer apply.

I'm not too sure what to do with parquet/src/arrow/arrow_reader/statistics.rs. I made an attempt at this previously, but I'm not sure if what I've done is correct. Any idea what is required here @etseidl?

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. Looking good so far! I still have to look at statistics.

@@ -279,6 +282,13 @@ impl WriterProperties {
self.statistics_truncate_length
}

/// Returns `coerce_types` boolean
///
/// `true` if type coercion enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add some words about what type coercion does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

parquet/src/arrow/schema/mod.rs Show resolved Hide resolved
parquet/src/arrow/schema/mod.rs Show resolved Hide resolved
[<$stat_type_prefix Int32StatsIterator>]::new($iterator)
.map(|x| x.map(|x| i64::from(*x) * 24 * 60 * 60 * 1000)),
[<$stat_type_prefix Int64StatsIterator>]::new($iterator).map(|x| x.copied()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Statistics struct keeps the parquet physical type info, but I'm not sure how to get at it here. 🤔

We have to know whether the physical type is INT32 or INT64, and if INT32 we still have to do the scaling of days to milliseconds (and use the Int32StatsIterator). Perhaps add physical type to the StatisticsConverter struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed this now. Could you please take another look? :)

@alamb alamb added the api-change Changes to the arrow API label Oct 1, 2024
@alamb
Copy link
Contributor

alamb commented Oct 1, 2024

I think this is an api change so marking it thusly

@dsgibbons dsgibbons force-pushed the feat/coerce-types-parquet-arrow-writer branch from a1f0ca6 to 169ba01 Compare October 6, 2024 02:42
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good @dsgibbons! Thanks! Just one tiny nit in documentation.

cc @alamb for a second opinion

parquet/src/file/properties.rs Outdated Show resolved Hide resolved
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you for your patience whilst we waited for the next breaking release window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants