-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize storage formats in file metastore #17368
Conversation
ping @arunthirupathi |
storageFormat = Arrays.stream(HiveStorageFormat.values()) | ||
.filter(format -> tableFormat.equals(StorageFormat.fromHiveStorageFormat(format))) | ||
.findFirst(); | ||
storageFormat = Optional.of(partition.getStorage().getStorageFormat()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is always present, why have the storageFormat as optional then ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. The optional here is unnecessary. Have pruned.
storageFormat = Arrays.stream(HiveStorageFormat.values()) | ||
.filter(format -> tableFormat.equals(StorageFormat.fromHiveStorageFormat(format))) | ||
.findFirst(); | ||
storageFormat = Optional.of(tableFormat); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is backward incompatible.
JsonCreator is an annotation used for creating class dynamically from JSON. This change modifies the type. When existing JSON blobs are de-serialized they will fail due to the type change.
Also looks like no-one wrote the tests for it originally for serialization so none of the tests broke.
Can you please investigate how to make these changes without breaking the existing assumptions ? Generally you have to introduce a new property and deprecate the existing property and optionally after few months remove the deprecated property.
@arunthirupathi Thanks for your comments. I have updated the code to make the legacy json (that is serialized from the previous type definition) readable. The legacy constructors and methods are keeping existing but marked as deprecated. Could you please take a second review? |
Thanks for making this backward compatible, but I don't see tests that capture the old json and verify they are successfully deserialized. I want to see the following things in this PR.
Apologies for asking you to create tests, but this code is widely used and we need to be careful. This has a potential to break almost all our users and it is sad that the existing code did not capture the tests. |
21952ef
to
8b883c2
Compare
It makes sense to add tests to ensure good compatibility! After refactoring the code, I split it into two commits: the first one does json round trip tests and encodes the instances of legacy classes to files; the second one adds tests to decode the file by new classes. Could you take a look to check if this is ok? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor nits, once addressed I will merge this in.
serdeParameters, | ||
externalLocation, | ||
columnStatistics, | ||
eligibleToIgnore, sealedPartition); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting: one parameter per line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
public StorageFormat deserialize(JsonParser p, DeserializationContext ctxt) | ||
throws IOException, JsonProcessingException | ||
{ | ||
if (p.currentToken() == VALUE_STRING) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here, prior to version 0.271 this was HiveStorageFormat and this class is to ensure back ward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
@arunthirupathi The PR is ready to merge. Could you take a look when you have time? Thanks in advance. |
In this PR, we changed the storage format class from
HiveStorageFormat
toStorageFormat
, in order to make it possible to create tables and partitions whose stoarge formats are not listed in theHiveStorageFormat
. For example, a Hudi COW table is stored in meatastore with the input format asHoodieParquetInputFormat
.This PR is extracted from the some legacy version of #17149
Test plan - Unit tests.