[WIP][SPARK-25305][SQL] Respect attribute name in CollapseProject and ColumnPruning #22311

gengliangwang · 2018-09-01T16:30:50Z

What changes were proposed in this pull request?

Currently in optimizer rule CollapseProject, the lower level project is collapsed into upper level, but the naming of alias in lower level is propagated in upper level.
In ColumnPruning, Project is eliminated if its child's output attributes is semanticEquals to it, even the naming doesn't match.

Let's see the follow example:

        val location = "/tmp/t"
        val df = spark.range(10).toDF("id")
        df.write.format("parquet").saveAsTable("tbl")
        spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
        spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location $location")
        spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
        println(spark.read.parquet(location).schema)
        spark.table("tbl2").show()

The output column name in schema will be id instead of ID, thus the last query shows nothing from tbl2.
By enabling the debug message we can see that the output naming is changed from ID to id, and then the outputColumns in InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.

With the fix proposed in this PR, the output naming ID won't be changed.

How was this patch tested?

Unit test

gengliangwang · 2018-09-01T16:37:08Z

Thanks @wangyum for finding the issue. Before this PR, he propose #22124 and #22287 .
But there is no description for the key reason of regression provided, also the test case is complex.
So I created this one.
We may also need to change the outputColumns in DataWritingCommand afterward, to avoid it is changed after optimization.

gengliangwang · 2018-09-01T16:40:09Z

@cloud-fan @gatorsmile @wangyum

gengliangwang · 2018-09-01T16:38:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+    AttributeMap(lower.zipWithIndex.collect {
+      case (a: Alias, index: Int) =>
+        a.toAttribute ->
+          a.copy(name = upper(index).name)(a.exprId, a.qualifier, a.explicitMetadata)


Do we have a better way for Alias.copy(...) ?

SparkQA · 2018-09-01T17:17:55Z

Test build #95580 has finished for PR 22311 at commit f94fdf7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-09-03T03:41:01Z

This behaivour depends on spark.sql.caseSensitive?

maropu · 2018-09-03T03:43:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -515,8 +515,7 @@ object PushProjectionThroughUnion extends Rule[LogicalPlan] with PredicateHelper
 */
 object ColumnPruning extends Rule[LogicalPlan] {
  private def sameOutput(output1: Seq[Attribute], output2: Seq[Attribute]): Boolean =
-    output1.size == output2.size &&
-      output1.zip(output2).forall(pair => pair._1.semanticEquals(pair._2))
+    output1.size == output2.size && output1.zip(output2).forall(pair => pair._1 == pair._2)


I think still we need to check if the exprIds that they refere to are the same.

cloud-fan · 2018-09-03T04:05:03Z

This behaivour depends on spark.sql.caseSensitive?

No. It's writing not resolving a column, so Spark should be case-preserving.

maropu · 2018-09-03T04:38:24Z

ok, thanks.

gengliangwang · 2018-09-03T06:32:02Z

We may also need to change the outputColumns in DataWritingCommand afterward, to avoid it is changed after optimization.

After second thought, I will create PR to fix outputColumns first. It takes time to cover all the cases by changing optimizer rules.
Mark this PR as "WIP" for now.

SparkQA · 2018-09-03T07:05:02Z

Test build #95602 has finished for PR 22311 at commit 96b4cca.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-09-03T07:15:05Z

retest this please.

SparkQA · 2018-09-03T08:05:45Z

Test build #95605 has finished for PR 22311 at commit 96b4cca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? Let's see the follow example: ``` val location = "/tmp/t" val df = spark.range(10).toDF("id") df.write.format("parquet").saveAsTable("tbl") spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location $location") spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") println(spark.read.parquet(location).schema) spark.table("tbl2").show() ``` The output column name in schema will be `id` instead of `ID`, thus the last query shows nothing from `tbl2`. By enabling the debug message we can see that the output naming is changed from `ID` to `id`, and then the `outputColumns` in `InsertIntoHadoopFsRelationCommand` is changed in `RemoveRedundantAliases`. ![wechatimg5](https://user-images.githubusercontent.com/1097932/44947871-6299f200-ae46-11e8-9c96-d45fe368206c.jpeg) ![wechatimg4](https://user-images.githubusercontent.com/1097932/44947866-56ae3000-ae46-11e8-8923-8b3bbe060075.jpeg) **To guarantee correctness**, we should change the output columns from `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by optimizer. I will fix project elimination related rules in apache#22311 after this one. ## How was this patch tested? Unit test. Closes apache#22320 from gengliangwang/fixOutputSchema. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…put names Port #22320 to branch-2.3 ## What changes were proposed in this pull request? Let's see the follow example: ``` val location = "/tmp/t" val df = spark.range(10).toDF("id") df.write.format("parquet").saveAsTable("tbl") spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location $location") spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") println(spark.read.parquet(location).schema) spark.table("tbl2").show() ``` The output column name in schema will be `id` instead of `ID`, thus the last query shows nothing from `tbl2`. By enabling the debug message we can see that the output naming is changed from `ID` to `id`, and then the `outputColumns` in `InsertIntoHadoopFsRelationCommand` is changed in `RemoveRedundantAliases`. ![wechatimg5](https://user-images.githubusercontent.com/1097932/44947871-6299f200-ae46-11e8-9c96-d45fe368206c.jpeg) ![wechatimg4](https://user-images.githubusercontent.com/1097932/44947866-56ae3000-ae46-11e8-8923-8b3bbe060075.jpeg) **To guarantee correctness**, we should change the output columns from `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by optimizer. I will fix project elimination related rules in #22311 after this one. ## How was this patch tested? Unit test. Closes #22346 from gengliangwang/portSchemaOutputName2.3. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

gengliangwang · 2018-09-07T02:46:18Z

retest this please.

SparkQA · 2018-09-07T06:36:12Z

Test build #95778 has finished for PR 22311 at commit 96b4cca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-09-08T03:37:03Z

Can this pr include the revert code for the #22320 change? Or, we should keep the change for safeguard for a while? (I thinks we'd be better to choose the second approach, so this is just a question).

gengliangwang · 2018-09-08T06:53:47Z

@maropu Let's keep the change of #22320 for a while. Thanks.

maropu · 2018-09-08T07:49:46Z

ok, thanks for the check.

Fix ColumnPruning and CollapseProject on eliminating Project

f94fdf7

gengliangwang commented Sep 1, 2018

View reviewed changes

maropu reviewed Sep 3, 2018

View reviewed changes

fix test failure

96b4cca

gengliangwang changed the title ~~[SPARK-25305][SQL] Respect attribute name in CollapseProject and ColumnPruning~~ [WIP][SPARK-25305][SQL] Respect attribute name in CollapseProject and ColumnPruning Sep 3, 2018

gengliangwang mentioned this pull request Sep 3, 2018

[SPARK-25313][SQL]Fix regression in FileFormatWriter output names #22320

Closed

gengliangwang mentioned this pull request Sep 6, 2018

[branch-2.3][SPARK-25313][SQL] Fix regression in FileFormatWriter output names #22346

Closed

gengliangwang closed this Mar 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-25305][SQL] Respect attribute name in CollapseProject and ColumnPruning #22311

[WIP][SPARK-25305][SQL] Respect attribute name in CollapseProject and ColumnPruning #22311

gengliangwang commented Sep 1, 2018

gengliangwang commented Sep 1, 2018

gengliangwang commented Sep 1, 2018

gengliangwang Sep 1, 2018

SparkQA commented Sep 1, 2018

maropu commented Sep 3, 2018

maropu Sep 3, 2018

cloud-fan commented Sep 3, 2018

maropu commented Sep 3, 2018

gengliangwang commented Sep 3, 2018 •

edited

Loading

SparkQA commented Sep 3, 2018

gengliangwang commented Sep 3, 2018

SparkQA commented Sep 3, 2018

gengliangwang commented Sep 7, 2018

SparkQA commented Sep 7, 2018

maropu commented Sep 8, 2018

gengliangwang commented Sep 8, 2018

maropu commented Sep 8, 2018

[WIP][SPARK-25305][SQL] Respect attribute name in CollapseProject and ColumnPruning #22311

[WIP][SPARK-25305][SQL] Respect attribute name in CollapseProject and ColumnPruning #22311

Conversation

gengliangwang commented Sep 1, 2018

What changes were proposed in this pull request?

How was this patch tested?

gengliangwang commented Sep 1, 2018

gengliangwang commented Sep 1, 2018

gengliangwang Sep 1, 2018

Choose a reason for hiding this comment

SparkQA commented Sep 1, 2018

maropu commented Sep 3, 2018

maropu Sep 3, 2018

Choose a reason for hiding this comment

cloud-fan commented Sep 3, 2018

maropu commented Sep 3, 2018

gengliangwang commented Sep 3, 2018 • edited Loading

SparkQA commented Sep 3, 2018

gengliangwang commented Sep 3, 2018

SparkQA commented Sep 3, 2018

gengliangwang commented Sep 7, 2018

SparkQA commented Sep 7, 2018

maropu commented Sep 8, 2018

gengliangwang commented Sep 8, 2018

maropu commented Sep 8, 2018

gengliangwang commented Sep 3, 2018 •

edited

Loading