Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2489: Guidance on feature releases #258

Merged
merged 29 commits into from
Jul 25, 2024

Conversation

emkornfield
Copy link
Contributor

@emkornfield emkornfield commented Jun 5, 2024

Make sure you have checked all steps below.

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
Co-authored-by: Rok Mihevc <rok@mihevc.org>
CONTRIBUTING.md Outdated Show resolved Hide resolved
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the lead on this @emkornfield. It will be very helpful to have this codified.

CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
emkornfield and others added 2 commits June 5, 2024 23:19
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
@emkornfield
Copy link
Contributor Author

@etseidl thank you for the proof reading

CONTRIBUTING.md Outdated Show resolved Hide resolved
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is overly prescriptive wrt. timing of introducing and enabling new features. We should let every implementation choose their support policy, based on their user demographics and their respective targeted use cases.

cc @alamb

CONTRIBUTING.md Outdated

In some cases in addition to library level implementations it is
expected the changes will be justified via integration into a
processing engine to show their viability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "processing engine" in this context? Perhaps this is out of scope?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the intent here is to demonstrate some real world benefit to justify the cost. For instance, for the recent size statistics addition we probably should have required some benchmark numbers showing a clear benefit to justify the cost of creating and storing them. My GPU workflow was made much faster due to the inclusion of the unencoded byte array size information and could have provided hard numbers demonstrating that. Similarly, it would have been nice to have a pushdown example that was made possible because of the histograms. Simply adding the feature and showing that two implementations produced them in a compatible way was probably insufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to rephrase and in a new pull request moved this to step 1.

CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated
library.

2. Forwards compatible. A file written by a new version of the library can be read by an older version
of the library.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, this is true of almost no format enhancements, since the additional semantics couldn't be understood under an older version of the format.

Perhaps this should be relaxed a bit? For example:

Forwards compatible. A file written under a newer version of the format can be read under an older version of the format, but some information might be missing or performance might be suboptimal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to use your wording (will be present in next commit).

CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
@emkornfield
Copy link
Contributor Author

I believe this is overly prescriptive wrt. timing of introducing and enabling new features. We should let every implementation choose their support policy, based on their user demographics and their respective targeted use cases.

Would it address most of your concerns if I replaced the "SHOULD" language, I'll revise to do so and we can hopefully iterate from there.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for writing this up @emkornfield -- it is very much needed and helpful

CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated
Comment on lines 83 to 91
### Releases

The Parquet community aims to do releases of the format package only as needed
when new features are introduced. If multiple new features are being proposed
simultaneously some features might be consolidated into the same release.
Guidance is provided below on when implementations should enable features added
to the specification. Due to confusion in the past over parquet versioning it
is not expected that there will be a 3.0 release of the specification in the
foreseeable future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given there seems to be confusion about what even constitutes a parquet "release" maybe we simply should say that the parquet-format has no overall release version identifier and instead is a collection of individual features?

Or maybe we should give it a version date ("parquet-format as of 2024-06-07" etc) 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at least for parquet-java we need a version identifier compatible with Maven because I think that is how it is consumed. We can maybe discuss separately just inlining the generated code there as well, as i think that is typically what other distributions do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @wgtmac thoughts on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this, I think releases are also potentially helpful for any third-party java implementations beyond parquet java. I don't have strong feelings here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the version identifier of parquet-format is compatible with Maven, it looks much weaker than usual Java libraries. IMO, a new version of parquet-format usually includes a set of new feature additions and format clarification/correction, meaning that a new version should be a good superset of previous versions. Therefore we should always stick to the latest version of parquet-format wherever possible. An exception is the deprecation of features to the format where a major version bump is required.

CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated
the feature enabled cannot be read under and older version of the format
(e.g. Adding a new compression algorithm.

The Parquet community hopes that new features are widely beneficial to users of
Copy link
Contributor

@alamb alamb Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think trying to create a process to roll out a feature across all implementations is an admirable goal, however I am not sure there is sufficient consensus to include it in this document. If we include this (aspirational) process I fear there is a non trivial chance it would not be followed in practice, and thus its inclusion would be confusing

Thus I suggest we consider not including the rollout in docs

( I think this is similar to what @pitrou is saying here #258 (review))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to remove almost all prescriptions that actually require effort out of the suggestions. I feel that if we don't provide some guidance the ecosystem will never really move forward. If possible it would be nice to focus on concrete concerns. Are there any parts of what is documented here that you see as problematic as a maintainer from arrow-rs? Also @pitrou anything you see as specifically problematic from a parquet-cpp perspective?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update. However, it is my opinion that the updated text still adds more potential for confusion than it adds. Specifically trying to proscribe the rollout cadence across loosly (at best) coordinated implementations in a document is unlikely to change behaviors and is likely to create confusion

Are there any parts of what is documented here that you see as problematic as a maintainer from arrow-rs?

One specific example is the mention of feature flags. While I am sure we could over come the technical details, as a maintainer I would not want to be in the position to try and dictate when a featre should be used by our users -- instead I would imagine that as new features were supported in parquet-rs there would be an appropriate writer configuration that users would decide to enable/disable as they deemed appropriate.

Another potential improvement (orthogonal to this disucssion) would be to label new format additions with the date in which they were added to the spec, and perhaps track the date when support was added to other implementations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically trying to proscribe the rollout cadence across loosly (at best) coordinated implementations in a document is unlikely to change behaviors and is likely to create confusion

Is there specific language here that you find problematic here? With the latest revision, my intent was to make the recommendation only about preserving compatibility. i.e. forward incompatible changes must wait at least 2 years before an implementation decides to turn them on by default. I wonder if changing the phrasing from MAY to MUST NOT USE before would help clarify?

instead I would imagine that as new features were supported in parquet-rs there would be an appropriate writer configuration that users would decide to enable/disable as they deemed appropriate.

I agree. Maybe this is a terminology issue. I view feature flag as a type of configuration, and I think I removed all recommendations for when, if ever, an implementation should turn something on by default (please let me know if something is still ambiguous or maybe I missed something in rephrasing).

Another potential improvement (orthogonal to this disucssion) would be to label new format additions with the date in which they were added to the spec, and perhaps track the date when support was added to other implementations

Agreed, I'll add this to the process I thought the feature matrix that was already underreview would be a good place for this? I would like to avoid putting too many dates directly in the spec to avoid confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another read -- I think it was easy to miss the sentence that says the rollout dates are guidelines.

Therefore, the Parquet PMC gives the following guidance for changing a feature to be "on" by default:

So I guess I got confused that the text says it is providing guidance and then uses MAY which suggests to me the process as a whole is required but the individual steps have options. I think we should just make it clear these are recommended best practices

What about toning it down a bit like instead of

Forward compatible features/changes MAY be used by default in implementations
   once the parquet-format containing those changes has been formally released.
   For features that may pose a significant performance regression to prior
   format readers, libaries SHOULD consider delaying until 1 year after the
   release of the parquet-java implementation that contains the feature
   implementation.  Implementations MAY choose to do a major version bump when
   turning on a feature in this category.

Something like

Forward compatible features/changes may be used by default in implementations
   once the parquet-format containing those changes has been formally released.
   For features that may pose a significant performance regression to prior
   format readers, libraries should wait 1 year after the
   release of the first  parquet-java implementation that contains the feature
   implementation.  

I also think the major version bump suggestion is so library specific it is not worth putting in the guidance. The libraries have their own cadence (e.g. parquet-rs and parquet-cpp have major releases based on calendar cycles rather than feature content)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to rephrase to make the issues above clearer.

I also think the major version bump suggestion is so library specific it is not worth putting in the guidance. The libraries have their own cadence (e.g. parquet-rs and parquet-cpp have major releases based on calendar cycles rather than feature content)

My intent here was not to specify a release cadence but to specify that if changing a forward incompatible change to default the next release must be a major version release, I will try to clarify. I think parquet-rs and parquet-cpp trivially satisfy this given the current release policies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb thanks for feedback, I've update to clarify and remove upper case for SHOULD/MAY language, please let me know if the current iteration still raises concerns.

@pitrou
Copy link
Member

pitrou commented Jun 7, 2024

Would it address most of your concerns if I replaced the "SHOULD" language, I'll revise to do so and we can hopefully iterate from there.

I think it could be rephrased as an example of how to deal with feature additions, rather than a recommendation.

emkornfield and others added 3 commits June 7, 2024 09:29
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@emkornfield
Copy link
Contributor Author

I think it could be rephrased as an example of how to deal with feature additions, rather than a recommendation.

My goal here is to provide at least a little bit stronger then just an example, I think we do want to try to align implementations for compatiblity purposes and moving the ecosystem further. Maybe we can continue the conversation here on the @alamb's comment

CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
- Soften language on recommendations and add clarifications
- Fix some grammar issues.
CONTRIBUTING.md Outdated
discussion concrete. This step is complete when there is lazy consensus. Part
of the consensus is whether it sufficient to provide 2 working
implementations as outlined in step 2 or if demonstration of the feature with
a down-stream query engine is necessary to justify the feature (e.g.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is overly specific IMHO. Not all features are developed in relation with a "down-stream query engine". For example, the BYTE_STREAM_SPLIT additions were motivated by size measurements with different encodings and compression algorithms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think a lot of features will not need this per @etseidl point on a previous revision, this helps avoid extraneous features that sound good in practice but don't provide much value in practice. Other areas where this is more likely to apply could be for more complicated indexing schemes.

CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more suggestions.

As to the most controversial section, I think the current wording strikes a good balance. Implementers should consider the ramifications of defaulting to breaking changes, and I like the idea of providing a deadline for downstream apps to adapt. Otherwise Parquet will always be trapped in amber, unable to steer users towards beneficial new features because some downstream project refuses to update for a decade. And these are now recommendations (unlike earlier revisions), so an implementation could still decide that a new feature is so beneficial for its user base that it's worth risking compatibility problems (so long as it clearly documents how to revert to non-breaking behavior). Most implementations already seem to be pretty conservative with their choice of defaults, so I don't think this guidance will be too burdensome.

Overall I'm +1 for the release process, and +0.5 for the last section.

CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
emkornfield and others added 3 commits June 26, 2024 14:12
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
CONTRIBUTING.md Outdated Show resolved Hide resolved
feature implementation.

3. Forwards incompatible features/changes should not be turned on by default
until 2 years after the parquet-java implementation containing the feature is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the 2-year guidance is better than other options. We don't want individual votes or discussions for action of each new feature. In addition, developers (of downstream projects or other Parquet implementation not governed under Parquet PMC) usually do not actively keep an eye on Parquet activities. The general 2-year guidance may give people enough time to take action to changes on the Parquet side.

by default until the most conservative timelines outlined above have been
exceeded.

For features released prior to October 2024, target dates for each of these
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this. Do we expect a parquet-java 2.0 release around October 2024? Doesn't any new feature should follow this guidance once published?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent was to try for Parquet-java 2.0 in October. There are a lot of old features from the spec perspective that based on guidance would be ok to use (e.g. lz4-raw, byte stream split, data page v2, v2 encodings) that we need to decide on recommended dates (as a sample suggestion I would aim for 1 year for features we think we actually want to turn on)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good timing to do Parquet-java 2.0 release? As per the discussion from https://lists.apache.org/thread/kttwbl5l7opz6nwb5bck2gghc2y3td0o, we intend to break the API compatibility by removing deprecated APIs (and remove Java 8 support) in the 2.0 release. If the intention is to set a target date for existing features without breaking anything, perhaps we should release 1.15.0 instead?

cc @gszadovszky @Fokko @julienledem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I think API compatibility is a separate concern and it is fine to set future dates and change API compatibility. I can rephrase this section to mention some other criteria for initial target dates if there is an alternative proposal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be up for a Parquet 2.0 release in October and am eager to help! Do we also want a 1.15?

I'm trying to downstream 1.14.1 into various projects (apache/spark#46447, apache/iceberg#10209), but the new Jackson version is giving us some troubles. It feels to me that the different projects need some time to update. My argument is that releasing 1.15 right now would not give too many new features to the users, and I think it makes sense to jump to 2.0.0.

Since there are some delays mentioned earlier in this document:

  • The API that I suggested removing was marked as deprecated almost six years go: PARQUET-1452: Deprecate old logical types API parquet-java#535
  • We're still compiling against Hadoop 2.7.3 which was released in August 2016.
  • I think the Java8 deprecation might be a bit harder and might require more discussion, but there it feels like all the projects are looking at each other :) Spark/Avro are Java 17+ right now.

For my understanding, the things you mentioned (e.g. lz4-raw, byte stream split, data page v2, v2 encodings), they are already shipped with the last release, so the discussion is when to make them the default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in understanding that We intend to:

  • increment the major release of parquet-format when it has the definition of the new (forward incompatible) feature in it.
  • implement the above new feature in parquet-java under a minor release in the previous major but turning it off by default
  • increment the major release of parquet-java when we turn on the new encodings by default?

I think this is not unsound but it might be confusing. We should then expect to have several major releases coming: one for the new footer and one for each new encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienledem no I think there is a misunderstanding.

increment the major release of parquet-format when it has the definition of the new (forward incompatible) feature in it.

Per above, I don't think we expect a major version change for parquet-format in the near future.

implement the above new feature in parquet-java under a minor release in the previous major but turning it off by default

yes, this is correct. this provides read support.

increment the major release of parquet-java when we turn on the new encodings by default?

Yes.

I think this is not unsound but it might be confusing. We should then expect to have several major releases coming: one for the new footer and one for each new encoding.

I think these should be grouped. Per apache/parquet-site#61 the proposal for parquet-java is to have at most one major release per year. Early adopters can use features once a minor release containing them is out as long as they feel comfortable with understanding the scope of impact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not following which part of the conversation "per above" refers to here.
Could you point to it and explain what you mean?

I was assuming for example that once we define a new footer in a new format we would increment the format spec. Would that be wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the releases section:

The Parquet PMC aims to do releases of the format package only as needed when
new features are introduced. If multiple new features are being proposed
simultaneously some features might be consolidated into the same release.
Guidance is provided below on when implementations should enable features added
to the specification.  Due to confusion in the past over Parquet versioning it
is not expected that there will be a 3.x release of the specification in the
foreseeable future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that sounds good to me.

Co-authored-by: Gang Wu <ustcwg@gmail.com>
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with most of the sections. +1 from my side.

@emkornfield emkornfield changed the title PARQUET-2489: Strawman guidance on feature releases PARQUET-2489: Guidance on feature releases Jul 3, 2024
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally ok with this. Thanks a lot @emkornfield for pushing this forward!

Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very Much @emkornfield, I think this is a great addition.
I do believe that we have a few element to finish build consensus on before committing.
I've left comments in line

CONTRIBUTING.md Outdated
Comment on lines 63 to 65
step. The implementations must be made available publicly, and they should be
fit for inclusion (for example, they were submitted as a pull request against
the target repository and committers gave positive reviews).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this section. Open source implementations are the first requirement. I think we should add that validation by proprietary engines is not required but welcome and useful. Third party vendors can publish reports of their analysis of the gains they observe in their proprietary non-open code base.
As Parquet is an open source standard that helps using open source and non-opensource solutions together in the wider industry, I think it is important to cover both. (but open source is still the requirement as you describe above)

CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated
Comment on lines 46 to 50
of the consensus is whether it is sufficient to provide two working
implementations as outlined in step 2, or if demonstration of the feature with
a downstream query engine is necessary to justify the feature (e.g.
demonstrate performance improvements in the Apache Arrow C++ Dataset library,
the Apache DataFusion query engine, or any other open source engine).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is always a trade-off in adding complexity to the format vs adding improvement in term of performance and/or compression.
In that phase we should have a discussion about this. Often, a prototype helps confirm some of the assumptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to address this with additional language, please let me know if I covered your intent.

CONTRIBUTING.md Outdated Show resolved Hide resolved
CONTRIBUTING.md Outdated
Comment on lines 106 to 108
2. Forward compatible. A file written under a newer version of the format with
the feature enabled can be read under an older version of the format, but
some information might be missing or performance might be suboptimal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semantically we mean that we can read the data back without loss.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the data should be readable back "without loss", does it imply that new logical types are not considered forward compatible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they should be and made this explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. a new logical type is not fully forward compatible even if there is a degraded mode (you can ignore it and read the "raw" type). Then we would recommend new logical types to not be turned on by default.

feature implementation.

3. Forwards incompatible features/changes should not be turned on by default
until 2 years after the parquet-java implementation containing the feature is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @emkornfield is saying above makes sense to me.
We should always have new forward incompatible features off by default at first. Otherwise, this will cause breakages every time we add a new encoding (or similarly incompatible feature). How long is debatable. 2 years seems long to me.
I don't think we should give a duration guidance to other implementers/users. We should document why we are doing this and explain the constraints. How long is very dependent on their circumstances. In an environment where there is a single system reading and writing Parquet files, they can enable it right away without problems. In a "data lakehouse" environment where people have multiple systems reading and writing they want to make sure they have upgraded other consumers before they turn it on. The more time passes, the more it becomes acceptable to have such new features on by default and ask the user to turn them off because they have some legacy system that can only read old files.
We could have a setting to adjust this in environments where there is a single system reading/writing parquet: setForwardIncompatibleFeaturesOnByDefault(boolean)

CONTRIBUTING.md Show resolved Hide resolved
CONTRIBUTING.md Outdated Show resolved Hide resolved
by default until the most conservative timelines outlined above have been
exceeded.

For features released prior to October 2024, target dates for each of these
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in understanding that We intend to:

  • increment the major release of parquet-format when it has the definition of the new (forward incompatible) feature in it.
  • implement the above new feature in parquet-java under a minor release in the previous major but turning it off by default
  • increment the major release of parquet-java when we turn on the new encodings by default?

I think this is not unsound but it might be confusing. We should then expect to have several major releases coming: one for the new footer and one for each new encoding.

CONTRIBUTING.md Outdated
@@ -90,16 +90,20 @@ demonstrate that the feature is mergeable to its implementation.
#### General guidelines/preferences on additions.

1. To the greatest extent possible changes should have an option for forward
compatibility (old readers can still read files).
compatibility (old readers can still read files). The 'compatibility and
feature enablement' section below provides more details on expectations for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do [compatibility and feature enablement](#compatibility-and-feature-enablement)

by default until the most conservative timelines outlined above have been
exceeded.

For features released prior to October 2024, target dates for each of these
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that sounds good to me.

@emkornfield emkornfield merged commit 5a5c894 into apache:master Jul 25, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet