Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2492: Add binary protocol extensions (any thrift message) #254

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

alkis
Copy link

@alkis alkis commented May 29, 2024

Specify a backwards/forward compatible way to extend any Thrift struct in Parquet.

ref Parquet Extensions

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@alkis alkis force-pushed the t3-metadata-experimentation branch from 2555aa8 to 5f12691 Compare June 1, 2024 09:34
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@alkis alkis force-pushed the t3-metadata-experimentation branch 3 times, most recently from 5ef488c to d187886 Compare June 6, 2024 16:16
@alkis alkis changed the title [T3] Add extension points for all thrift messages [PARQUET-2492] Add extension points for all thrift messages Jun 6, 2024
@alkis alkis changed the title [PARQUET-2492] Add extension points for all thrift messages PARQUET-2492: Add extension points for all thrift messages Jun 6, 2024
@alkis alkis force-pushed the t3-metadata-experimentation branch from d187886 to 056429a Compare June 6, 2024 16:37
@alkis alkis marked this pull request as ready for review June 7, 2024 06:46
@alkis
Copy link
Author

alkis commented Jun 7, 2024

I marked this ready for review. I have tested offline that this method works as expected and has virtually no impact in parse speed of the original FileMetaData thrift message.

How can we move this forward?

README.md Outdated Show resolved Hide resolved
@alkis alkis force-pushed the t3-metadata-experimentation branch from 056429a to 7994102 Compare June 12, 2024 05:48
README.md Outdated Show resolved Hide resolved
Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alkis, I left a few comment.
Minor nit: It is nice to add commits to the PR when addressing comments so that we can see the history of when the comment were added. When amending the same commit and force pushing like here, we can't see the history.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@alkis alkis force-pushed the t3-metadata-experimentation branch from a892a65 to 4b550f6 Compare July 19, 2024 05:16
@alkis alkis force-pushed the t3-metadata-experimentation branch from 8708ad3 to 77c6de7 Compare August 6, 2024 13:04
@alkis alkis force-pushed the t3-metadata-experimentation branch from 77c6de7 to e2ef70e Compare August 6, 2024 13:11
README.md Outdated Show resolved Hide resolved
ExtensionExamples.md Outdated Show resolved Hide resolved
ExtensionExamples.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@alkis alkis force-pushed the t3-metadata-experimentation branch from 91c2ce5 to 852f912 Compare August 7, 2024 12:03
Extensions.md Outdated Show resolved Hide resolved
Extensions.md Outdated Show resolved Hide resolved
Extensions.md Outdated Show resolved Hide resolved
Extensions.md Outdated Show resolved Hide resolved
Extensions.md Outdated Show resolved Hide resolved
Co-authored-by: Gang Wu <ustcwg@gmail.com>
@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

Hi, can we perhaps find another term than "extension" for this feature? The reason is that we may also want to add support for extension types to Parquet (where "extension types" fits naturally with the corresponding feature in Apache Arrow), which might breed confusion when talking about "Parquet extensions".
https://lists.apache.org/thread/lcrjoymddllxf8wvq33ddm5fd0tk9mh7

(but perhaps I'm being overly cautious here, what do you think?)

@alkis
Copy link
Author

alkis commented Aug 15, 2024

Hi, can we perhaps find another term than "extension" for this feature? The reason is that we may also want to add support for extension types to Parquet (where "extension types" fits naturally with the corresponding feature in Apache Arrow), which might breed confusion when talking about "Parquet extensions". https://lists.apache.org/thread/lcrjoymddllxf8wvq33ddm5fd0tk9mh7

(but perhaps I'm being overly cautious here, what do you think?)

It would be beneficial to avoid confusion. I can't think of a great name for it though. If we name them "experiments" it will make them look risky and can reduce adoption when we use them for migration. "plugins" - I don't like it either. Got any suggestions?

For the extensions in the discussion, I do not have a lot of context: what's the advantage of such extensions vs adding the logical types to the spec?

@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

For the extensions in the discussion, I do not have a lot of context: what's the advantage of such extensions vs adding the logical types to the spec?

The point is to decouple the Thrift / Format spec from the definition of new extension (logical) types. It allows third-party definitions of such types and gradual standardization thereof (either inside the Apache Parquet project itself, or by consensus inside a subcommunity). It also makes it easier for the implementations of said types to live outside of the core Parquet implementations, which is desirable for complex and/or highly-specific domains such as with GeoParquet.

(personally, I would rather Parquet C++ didn't have to reimplement logic for geospatial data :-))

@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

It would be beneficial to avoid confusion. I can't think of a great name for it though. If we name them "experiments" it will make them look risky and can reduce adoption when we use them for migration. "plugins" - I don't like it either. Got any suggestions?

"Thrift protocol escapes"? "Thrift binary appends"?

@alkis
Copy link
Author

alkis commented Aug 15, 2024

The point is to decouple the Thrift / Format spec from the definition of new extension (logical) types. It allows third-party definitions of such types and gradual standardization thereof (either inside the Apache Parquet project itself, or by consensus inside a subcommunity). It also makes it easier for the implementations of said types to live outside of the core Parquet implementations, which is desirable for complex and/or highly-specific domains such as with GeoParquet.

(personally, I would rather Parquet C++ didn't have to reimplement logic for geospatial data :-))

Makes sense. Should we name these extensions type-extensions or third-party-types or external-types? Should we name the extensions in this PR format-extensions (not sure if this is specific enough though).

@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

The reason to prefer "extension types" over "third-party types" or "external types" is that at some point some of them might get standardized inside Parquet, like Arrow does.

Though we could also dictate a policy that standardizing an extension type is done by creating a new logical type. @wgtmac

@alkis
Copy link
Author

alkis commented Aug 15, 2024

@pitrou wdyt about calling the extensions in this PR "Metadata Extensions" vs the other ones "Type Extensions"? Would that clear it?

Bonus is that if/when we add flatbuffers the extension points for flatbuffers will also fall under "Metadata Extensions".

@wgtmac
Copy link
Member

wgtmac commented Aug 16, 2024

The reason to prefer "extension types" over "third-party types" or "external types" is that at some point some of them might get standardized inside Parquet, like Arrow does.

Though we could also dictate a policy that standardizing an extension type is done by creating a new logical type. @wgtmac

Agreed. I think we can simply follow the rule of Arrow's extension type. A naive proposal would be adding a custom key-value metadata field to struct SchemaElement. WDYT?

@pitrou
Copy link
Member

pitrou commented Aug 19, 2024

@pitrou wdyt about calling the extensions in this PR "Metadata Extensions" vs the other ones "Type Extensions"? Would that clear it?

Perhaps "Binary protocol extensions" or "Unparsed protocol extensions"? "Metadata" is usually vague.

@alkis alkis changed the title PARQUET-2492: Add extension points for all thrift messages PARQUET-2492: Add binary protocol extensions (any thrift message) Aug 19, 2024
@alkis
Copy link
Author

alkis commented Aug 19, 2024

Perhaps "Binary protocol extensions" or "Unparsed protocol extensions"? "Metadata" is usually vague.

Qualified as "Binary Protocol Extensions".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants