-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] Implicit schema changes supported by Avro schema-resolution will not work properly if there are filegroups with old schema #7444
Comments
@voonhous Thanks for sharing a test to reproduce the issue! |
@voonhous
test by master branch
|
@codope @nsivabalan |
@xiarixiaoyao Thank you for the reply. I tested your proposed fix and it works. Changes that need to be made to the tests i provided: val commonOpts: Map[String, String] = Map(
HoodieWriteConfig.TBL_NAME.key -> "hoodie_avro_schema_resolution_support",
"hoodie.insert.shuffle.parallelism" -> "1",
"hoodie.upsert.shuffle.parallelism" -> "1",
DataSourceWriteOptions.TABLE_TYPE.key -> "COPY_ON_WRITE",
DataSourceWriteOptions.RECORDKEY_FIELD.key -> "id",
DataSourceWriteOptions.PRECOMBINE_FIELD.key -> "id",
DataSourceWriteOptions.PARTITIONPATH_FIELD.key -> "name",
DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key -> "org.apache.hudi.keygen.SimpleKeyGenerator",
HoodieMetadataConfig.ENABLE.key -> "false",
"hoodie.schema.on.read.enable" -> "true",
"hoodie.datasource.write.reconcile.schema" -> "true"
) val readDf = spark.read.format("hudi").options(commonOpts).load(tempRecordPath) |
@xiarixiaoyao Thanks for triaging and proposing the fix! |
@codope While this issue can be fixed with the 2 parameters provided above, there is a possibility that implicit schema changes can still be with the default parameter values (2 parameters set to false). I do believe this is not a "proper" fix for such cases. Say if these implicit schema changes have already been written to the table, there might not be any recourse that users can do to "fix" the table. I believe the proper way of fixing this issue is to:
I currently using approach (4) and will raise a PR for review for it tomorrow. |
@codope Yeap, the proposed fix is working for the latest snapshot. However, the fix will not work for filegroups that have already been written without the 2 parameters being enabled. (even on master/latest snapshot) i.e. If the table (path) does not have any CMIIW, given that users are allowed to perform implicit schema changes without enabling the 2 parameters, reading of filegroups/base files with differing should be supported without enabling the 2 parameters no? If it is not supported, the existing validation should be changed. |
@codope @xiarixiaoyao Raised a PR here: #7480 |
Have you gone through our FAQs? (Yes)
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Implicit schema changes that do not write to the
.schema
folder will cause read issues on Spark's end.The current implementation of Schema Evolution is as such:
If the schema change is supported by the Avro's Schema resolution,
ALTER TABLE DDL
is not required.The column type changes that are supported by Avro's Schema resolution is as such:
Caveat:
The current implementation is sufficient provided that ALL data is re-written with the new schema. However, if there are certain filegroups/partition that are still in the old schema when being read out, errors will be thrown.
As such, the current support for implicit column changes is still a little buggy when it comes to column type changes.
To reproduce the issue, one can use this script below to test the schema evolution that is "allegedly" supported by Hudi's implicit schema change support.
What the test does is write a partition in the old schema, followed by inserting a row with a new schema into another partition.
Note: This mainly affects schema-type changes only
Steps to reproduce the behavior:
testDataTypePromotion
as a test caseExpected behavior
Able to do a full table scan.
Environment Description
Hudi version : 0.10, 0.11, 0.12, 0.13
Spark version : 3.x
Hive version : NIL
Hadoop version : NIL
Storage (HDFS/S3/GCS..) : NIL
Running on Docker? (yes/no) : NO
Additional context
Add any other context about the problem here.
Stacktrace
The text was updated successfully, but these errors were encountered: