-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] why is the schema evolution done while not setting hoodie.schema.on.read.enable #8018
Comments
Hi, Hudi has optional "schema on read" and "out of the box schema evolution" which is the default. |
hi @kazdy thanks for your reply. sorry can you please elaborate more ? I understand that out of box is when we add a new records with new col it's automatically added to the table and the other options that are mentioned in the link you shared, but in this case, how is schema on read different than this? if we enabled hoodie.schema.on.read.enable , how is the behavior differing? |
To add to this, it's not just new columns being added that is handled natively. Adding complex types like arrays and maps, Valid type conversions, 👍 , extending enums etc... I was just browsing the issues and was also planning on asking this same question! |
It feels like target table schema enforcement is needed (this topic also returns from time to time in Hudi slack). |
hey @danny0405, thanks for your reply. what I understood from the description of hoodie.datasource.write.reconcile.schema
but unfortunately, it's still adding the new col to the table, is there any config option we need to add along with hoodie.datasource.write.reconcile.schema so that it keeps the schema? as it's written in the description "writer-schema will be picked such that table's schema (after txn) is either kept the same or extended, meaning that we'll always prefer the schema that either adds new columns or stays the same." I don't know how if it's enabled it can do both options keeping it the same or extending, isn't extending the same as disabling the feature from the first place? so how do we keep it the same? |
Did you change the table schema through spark catalog or you just injest a data frame with different schema? |
@danny0405 we just injested a data with different schema like this:
and it's updating the schema |
Prior to Hudi Full schema evolution (HFSE), which is enabled using What you are encountering is basically schema-evolution performed via ASR. I've written some test and fixes that are unrelated to your issue before where you can get an idea of how it is different from HFSE here: I noticed that you are using 0.12., not sure which minor version you are using, but it doesn't really matter. In the 0.12. branches, HFSE is an experimental feature and the feature that you wanted that was governed by As such, you will not be able to use I hope this answers questions that you have. With regards to how one can prevent schema from being evolved implicitly when ASR is used... I don't think it's possible in 0.12.*. (Anyone in the community that is more familiar with this, please correct me if I am wrong) |
I inspected the master code base and the option |
@danny0405 isn't it that with When it comes to reconciling schema, when I was playing with it in 0.10 and 0.11 it was allowing wider schemas on write, but when incoming cols were missing then these were added to match "current" target schema. There's a hacky way to prevent schema evolution. Again, MERGE INTO stmt enforces the target table schema when writing records. |
I guess we need a clear doc to elaborate the schema evolution details for 0.13.0 |
thank you @kazdy for your reply. I tried to pass the schema this way to the config you mentioned but I get an error, I am not totally aware how to pass it can u please help?
got the following error:
seems like I shouldn't pass it as a string, but I couldn't get from the doc how I should pass it. Also, regarding your part when U mentioned "missing columns -> add missing columns to match current table schema" did u need to add extra logic in your code or by default the missing cols were added? or it's just by adding 'hoodie.datasource.write.reconcile.schema':"true"? and if the missing cols were added wo extra logic in your code, were u using pyspark+glue? or what did u use exactly? thanks |
Outside of the original issue here, we'd find that extremely useful. We haven't found it clear from the release notes exactly what has changed or what the new behaviour is, or how we configure it. We also don't understand how exactly Ideally I'd like a table of possible operations for the "old" (avro) and "new" schema evolutions, with examples of adding/removing columns on a sample dataset. Also a direct comparison of how they differ. |
+1 on @kazdy 's notes above on ASR. Hudi has always supported some automatic schema evolution to deal with streaming data similar to what Kafka/Schema registry model achieves. The reason was, users found it inconvenient to coordinate pausing pipelines and doing some manual maintenance/backfills when say, new columns were added. What we call full schema evolution/schema-on-read is orthogonal, and it just allows more backwards incompatible evolutions to go through as well. Now on 0.13, I think the reconcile flag simple allows for skipping some columns in the incoming write (partial writes scenarios) and Hudi reconciles this with the table schema - while still respecting the automatic schema evolution. I think this is what 0.13 changes. https://hudi.apache.org/releases/release-0.13.0#schema-handling-in-write-path
Note to @nfarah86 and @nsivabalan to cover this in the schema docs page that is being worked on now. |
@menna224 For your original issue on not adding the new column, it's not something that has come up before. So we would need to provide some way to alter behavior to ignore the extra columns. Is that behavior expected? |
we have a glue streaming job that writes to hudi table, we try to do schema evolution, when we add a new col to any record, it works fine and the new col is shown when querying the table, the thing is we expect that it should not evolute the schema because we didn't set the config hoodie.schema.on.read.enable, and as we understand that this config is set by default to false, and as per hudi docs:
"Enables support for Schema Evolution feature
Default Value: false (Optional)
Config Param: SCHEMA_EVOLUTION_ENABLE"
so when didn't define it on our config, it shouldn't allow for the schema evolution and adding of the new columns, right?
we even tried to explicitly set it to false in our connection options, but still , when we add a new col it's shown to our table
To Reproduce
Steps to reproduce the behavior:
Expected behavior
it shouldn't show the added cols/attributes as we disabled schema evolution and the col/attribute shouldn't be existing also in the schema of the table in the datalake.
Environment Description
Hudi version : .12
Spark version : 3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
glue version: 4
Additional context
our connection options are:
in glue streaming job we use:
and:
and the way we write our hudi table is:
sometimes we write it as follows but it gives the same behaviour:
The text was updated successfully, but these errors were encountered: