Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-25305][SQL] Respect attribute name in CollapseProject and ColumnPruning #22311

Closed

Conversation

gengliangwang
Copy link
Member

What changes were proposed in this pull request?

Currently in optimizer rule CollapseProject, the lower level project is collapsed into upper level, but the naming of alias in lower level is propagated in upper level.
In ColumnPruning, Project is eliminated if its child's output attributes is semanticEquals to it, even the naming doesn't match.

Let's see the follow example:

        val location = "/tmp/t"
        val df = spark.range(10).toDF("id")
        df.write.format("parquet").saveAsTable("tbl")
        spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
        spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location $location")
        spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
        println(spark.read.parquet(location).schema)
        spark.table("tbl2").show()

The output column name in schema will be id instead of ID, thus the last query shows nothing from tbl2.
By enabling the debug message we can see that the output naming is changed from ID to id, and then the outputColumns in InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
wechatimg5

wechatimg4

With the fix proposed in this PR, the output naming ID won't be changed.
wechatimg3

How was this patch tested?

Unit test

@gengliangwang
Copy link
Member Author

Thanks @wangyum for finding the issue. Before this PR, he propose #22124 and #22287 .
But there is no description for the key reason of regression provided, also the test case is complex.
So I created this one.
We may also need to change the outputColumns in DataWritingCommand afterward, to avoid it is changed after optimization.

@gengliangwang
Copy link
Member Author

AttributeMap(lower.zipWithIndex.collect {
case (a: Alias, index: Int) =>
a.toAttribute ->
a.copy(name = upper(index).name)(a.exprId, a.qualifier, a.explicitMetadata)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a better way for Alias.copy(...) ?

@SparkQA
Copy link

SparkQA commented Sep 1, 2018

Test build #95580 has finished for PR 22311 at commit f94fdf7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Sep 3, 2018

This behaivour depends on spark.sql.caseSensitive?

@@ -515,8 +515,7 @@ object PushProjectionThroughUnion extends Rule[LogicalPlan] with PredicateHelper
*/
object ColumnPruning extends Rule[LogicalPlan] {
private def sameOutput(output1: Seq[Attribute], output2: Seq[Attribute]): Boolean =
output1.size == output2.size &&
output1.zip(output2).forall(pair => pair._1.semanticEquals(pair._2))
output1.size == output2.size && output1.zip(output2).forall(pair => pair._1 == pair._2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think still we need to check if the exprIds that they refere to are the same.

@cloud-fan
Copy link
Contributor

This behaivour depends on spark.sql.caseSensitive?

No. It's writing not resolving a column, so Spark should be case-preserving.

@maropu
Copy link
Member

maropu commented Sep 3, 2018

ok, thanks.

@gengliangwang gengliangwang changed the title [SPARK-25305][SQL] Respect attribute name in CollapseProject and ColumnPruning [WIP][SPARK-25305][SQL] Respect attribute name in CollapseProject and ColumnPruning Sep 3, 2018
@gengliangwang
Copy link
Member Author

gengliangwang commented Sep 3, 2018

We may also need to change the outputColumns in DataWritingCommand afterward, to avoid it is changed after optimization.

After second thought, I will create PR to fix outputColumns first. It takes time to cover all the cases by changing optimizer rules.
Mark this PR as "WIP" for now.

@SparkQA
Copy link

SparkQA commented Sep 3, 2018

Test build #95602 has finished for PR 22311 at commit 96b4cca.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 3, 2018

Test build #95605 has finished for PR 22311 at commit 96b4cca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

gengliangwang added a commit to gengliangwang/spark that referenced this pull request Sep 6, 2018
## What changes were proposed in this pull request?

Let's see the follow example:
```
        val location = "/tmp/t"
        val df = spark.range(10).toDF("id")
        df.write.format("parquet").saveAsTable("tbl")
        spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
        spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location $location")
        spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
        println(spark.read.parquet(location).schema)
        spark.table("tbl2").show()
```
The output column name in schema will be `id` instead of `ID`, thus the last query shows nothing from `tbl2`.
By enabling the debug message we can see that the output naming is changed from `ID` to `id`, and then the `outputColumns` in `InsertIntoHadoopFsRelationCommand` is changed in `RemoveRedundantAliases`.
![wechatimg5](https://user-images.githubusercontent.com/1097932/44947871-6299f200-ae46-11e8-9c96-d45fe368206c.jpeg)

![wechatimg4](https://user-images.githubusercontent.com/1097932/44947866-56ae3000-ae46-11e8-8923-8b3bbe060075.jpeg)

**To guarantee correctness**, we should change the output columns from `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by optimizer.

I will fix project elimination related rules in apache#22311 after this one.

## How was this patch tested?

Unit test.

Closes apache#22320 from gengliangwang/fixOutputSchema.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
asfgit pushed a commit that referenced this pull request Sep 6, 2018
…put names

Port #22320 to branch-2.3
## What changes were proposed in this pull request?

Let's see the follow example:
```
        val location = "/tmp/t"
        val df = spark.range(10).toDF("id")
        df.write.format("parquet").saveAsTable("tbl")
        spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
        spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location $location")
        spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
        println(spark.read.parquet(location).schema)
        spark.table("tbl2").show()
```
The output column name in schema will be `id` instead of `ID`, thus the last query shows nothing from `tbl2`.
By enabling the debug message we can see that the output naming is changed from `ID` to `id`, and then the `outputColumns` in `InsertIntoHadoopFsRelationCommand` is changed in `RemoveRedundantAliases`.
![wechatimg5](https://user-images.githubusercontent.com/1097932/44947871-6299f200-ae46-11e8-9c96-d45fe368206c.jpeg)

![wechatimg4](https://user-images.githubusercontent.com/1097932/44947866-56ae3000-ae46-11e8-8923-8b3bbe060075.jpeg)

**To guarantee correctness**, we should change the output columns from `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by optimizer.

I will fix project elimination related rules in #22311 after this one.

## How was this patch tested?

Unit test.

Closes #22346 from gengliangwang/portSchemaOutputName2.3.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 7, 2018

Test build #95778 has finished for PR 22311 at commit 96b4cca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Sep 8, 2018

Can this pr include the revert code for the #22320 change? Or, we should keep the change for safeguard for a while? (I thinks we'd be better to choose the second approach, so this is just a question).

@gengliangwang
Copy link
Member Author

@maropu Let's keep the change of #22320 for a while. Thanks.

@maropu
Copy link
Member

maropu commented Sep 8, 2018

ok, thanks for the check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants