-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23355][SQL] convertMetastore should not ignore table properties #20522
Conversation
sessionCatalog.metastoreCatalog | ||
.convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet") | ||
} else { | ||
val options = relation.tableMeta.storage.properties | ||
val options = relation.tableMeta.properties ++ relation.tableMeta.storage.properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan and @gatorsmile .
Unlike USING hive
syntax, STORED AS
syntax saves the property into tableMeta.properties
.
This PR considers both table properties. Please see the test case for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not the right things to do. The table properties should not be always put to the serde properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the reason why my previous PR #20120 was closed. We need a comprehensive fix for resolving all the DDL issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This place is only for convertMetastore
. I think we need this inevitably to prevent regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's surprising that Apache Spark doesn't respect table properties in convertMetastoreParquet
until now. It has been true
by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issues have been defined in the PR description of #20120:
Currently, we ignore table-specific compression conf when the Hive serde tables are converted to the data source tables. We also ignore it when users set compression in the TBLPROPERTIES clause instead of the OPTIONS clause, even if the tables are native data source tables.
#20087 is also trying to resolve the related issues. Thus, we might still miss multiple critical bugs.
Simply adding the table properties to the options will introduce new bugs. For example, users might provide conflicting serde properties and change the properties through DDLs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't blindly add the table properties to the options. This also introduces the behavior changes. For example, some table properties might have the identical names as the serde properties with different semantics.
Thus, what we can do is to do this only for the table properties we need.
Conceptually, the table properties should not take the serde-related confs. However, Hive basically does not follow the rule it sets. It is pretty strange. No idea why Hive did it.
Test build #87130 has finished for PR 20522 at commit
|
Test build #87131 has finished for PR 20522 at commit
|
Just FYI, #19382 is to solve the issue of storage/serde properties. However the issue here is table properties. They are not related. Please create a new JIRA instead of treating it as a follow-up |
Also cc @cloud-fan @gengliangwang |
I agree with you on all the issues. BTW, how can we put these existing Spark bugs into ORC migration guide? Given that users already experience this for |
hey, sorry its unclear to me exactly what isn't support right now in 2.3. The jira and this pr mention table properties not supported but #19382 seems like it fixed that. The reason I'm wondering if to know what is/is not support in 2.3 to determine if I personally turn the config convertMetastoreOrc on or not. |
@tgravescs . |
@dongjoon-hyun Could you submit the PR to resolve all the related issues? |
Sure. I'll try, @gatorsmile . It'll take some time for me. |
@gatorsmile . Sorry for late updating. I updated this PR by narrowing the scope of configuration key names specifically for ORC and Parquet. The test coverage is reading and writing non-partitioned tables. |
Test build #89901 has finished for PR 20522 at commit
|
LGTM, merging to master! |
Thank you for review and merge, @cloud-fan ! |
## What changes were proposed in this pull request? Before Apache Spark 2.3, table properties were ignored when writing data to a hive table(created with STORED AS PARQUET/ORC syntax), because the compression configurations were not passed to the FileFormatWriter in hadoopConf. Then it was fixed in #20087. But actually for CTAS with USING PARQUET/ORC syntax, table properties were ignored too when convertMastore, so the test case for CTAS not supported. Now it has been fixed in #20522 , the test case should be enabled too. ## How was this patch tested? This only re-enables the test cases of previous PR. Closes #22302 from fjh100456/compressionCodec. Authored-by: fjh100456 <fu.jinhua6@zte.com.cn> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 473f2fb) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Previously, SPARK-22158 fixed for
USING hive
syntax. This PR aims to fix forSTORED AS
syntax. Although the test case covers ORC part, the patch considers bothconvertMetastoreOrc
andconvertMetastoreParquet
.How was this patch tested?
Pass newly added test cases.