Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23355][SQL] convertMetastore should not ignore table properties #20522

Closed
wants to merge 1 commit into from
Closed

[SPARK-23355][SQL] convertMetastore should not ignore table properties #20522

wants to merge 1 commit into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Feb 6, 2018

What changes were proposed in this pull request?

Previously, SPARK-22158 fixed for USING hive syntax. This PR aims to fix for STORED AS syntax. Although the test case covers ORC part, the patch considers both convertMetastoreOrc and convertMetastoreParquet.

How was this patch tested?

Pass newly added test cases.

sessionCatalog.metastoreCatalog
.convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet")
} else {
val options = relation.tableMeta.storage.properties
val options = relation.tableMeta.properties ++ relation.tableMeta.storage.properties
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan and @gatorsmile .
Unlike USING hive syntax, STORED AS syntax saves the property into tableMeta.properties.
This PR considers both table properties. Please see the test case for an example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not the right things to do. The table properties should not be always put to the serde properties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the reason why my previous PR #20120 was closed. We need a comprehensive fix for resolving all the DDL issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This place is only for convertMetastore. I think we need this inevitably to prevent regression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's surprising that Apache Spark doesn't respect table properties in convertMetastoreParquet until now. It has been true by default.

Copy link
Member

@gatorsmile gatorsmile Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issues have been defined in the PR description of #20120:

Currently, we ignore table-specific compression conf when the Hive serde tables are converted to the data source tables. We also ignore it when users set compression in the TBLPROPERTIES clause instead of the OPTIONS clause, even if the tables are native data source tables.

#20087 is also trying to resolve the related issues. Thus, we might still miss multiple critical bugs.

Simply adding the table properties to the options will introduce new bugs. For example, users might provide conflicting serde properties and change the properties through DDLs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't blindly add the table properties to the options. This also introduces the behavior changes. For example, some table properties might have the identical names as the serde properties with different semantics.

Thus, what we can do is to do this only for the table properties we need.

Conceptually, the table properties should not take the serde-related confs. However, Hive basically does not follow the rule it sets. It is pretty strange. No idea why Hive did it.

@SparkQA
Copy link

SparkQA commented Feb 6, 2018

Test build #87130 has finished for PR 20522 at commit 23d8205.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 7, 2018

Test build #87131 has finished for PR 20522 at commit 0f65eb9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Just FYI, #19382 is to solve the issue of storage/serde properties. However the issue here is table properties. They are not related. Please create a new JIRA instead of treating it as a follow-up

@gatorsmile
Copy link
Member

Also cc @cloud-fan @gengliangwang

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Feb 7, 2018

I agree with you on all the issues. BTW, how can we put these existing Spark bugs into ORC migration guide?

Given that users already experience this for STORED AS PARQUET Hive table. Is it okay to mention that ORC table will do the same with Parquet table?

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-22158][SQL][FOLLOWUP] convertMetastore should not ignore table properties [SPARK-23355][SQL] convertMetastore should not ignore table properties Feb 7, 2018
@tgravescs
Copy link
Contributor

hey, sorry its unclear to me exactly what isn't support right now in 2.3. The jira and this pr mention table properties not supported but #19382 seems like it fixed that. The reason I'm wondering if to know what is/is not support in 2.3 to determine if I personally turn the config convertMetastoreOrc on or not.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Feb 27, 2018

@tgravescs .
Yep. The title of SPARK-22158 was changed recently because it only supported Table SerDe properties. If you set the ORC property into Table Storage SerDe properties (not in Table properties), it will work. This PR includes Table properties additionally.

@gatorsmile
Copy link
Member

@dongjoon-hyun Could you submit the PR to resolve all the related issues?

@dongjoon-hyun
Copy link
Member Author

Sure. I'll try, @gatorsmile . It'll take some time for me.

@dongjoon-hyun
Copy link
Member Author

@gatorsmile . Sorry for late updating. I updated this PR by narrowing the scope of configuration key names specifically for ORC and Parquet. The test coverage is reading and writing non-partitioned tables.

@SparkQA
Copy link

SparkQA commented Apr 26, 2018

Test build #89901 has finished for PR 20522 at commit 06a9a45.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in 8aa1d7b Apr 27, 2018
@dongjoon-hyun
Copy link
Member Author

Thank you for review and merge, @cloud-fan !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-22158-2 branch April 27, 2018 03:59
asfgit pushed a commit that referenced this pull request Sep 7, 2018
## What changes were proposed in this pull request?
Before Apache Spark 2.3, table properties were ignored when writing data to a hive table(created with STORED AS PARQUET/ORC syntax), because the compression configurations were not passed to the FileFormatWriter in hadoopConf. Then it was fixed in #20087. But actually for CTAS with USING PARQUET/ORC syntax, table properties were ignored too when convertMastore, so the test case for CTAS not supported.

Now it has been fixed  in #20522 , the test case should be enabled too.

## How was this patch tested?
This only re-enables the test cases of previous PR.

Closes #22302 from fjh100456/compressionCodec.

Authored-by: fjh100456 <fu.jinhua6@zte.com.cn>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 473f2fb)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants