Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44355][SQL] Move WithCTE into command queries #42036

Closed
wants to merge 9 commits into from

Conversation

cloud-fan
Copy link
Contributor

This PR is based on #42022 to fix tests, as the PR author is on vacation.

What changes were proposed in this pull request?

In the PR, I propose to add new trait CTEInChildren and mix it to some commands that should have WithCTE on top of their children (queries) instead of main query. Also I modified the CTESubstitution rule and removed special handling of Commands and similar nodes. After the changes, Command, ParsedStatement and InsertIntoDir are handled in the same way as other queries by referring to CTE Defs. Only the difference is in WithCTE is not not placed on the top of main query but on top of command queries.

Closes #41922

Why are the changes needed?

To improve code maintenance. Right now the CTE resolution code path is diverged: query with commands go into CTE inline code path where non-command queries go into CTEDef code path.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By running new test:

$ build/sbt "test:testOnly *InsertSuite"

@github-actions github-actions bot added the SQL label Jul 17, 2023
// CTE normally and don't need to force inline.
!commands.head.isInstanceOf[CTEInChildren]
} else if (commands.length > 1) {
// This can happen with the multi-insert statement. We should fall back to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: There is duplicated logic here.
To make the code more readable, we can always collect the commands first. If the length of commands is 1, there is a different behavior based on the legacy conf. Otherwise the logic is determined.

* children. There are two reasons:
* 1. Some rules will pattern match the root command nodes, and we should keep command
* as the root node to not break them.
* 2. `Dataset` eagerly executes the commands inside a query plan. However, the CTE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Shall we have an example for the eager execution?

@@ -0,0 +1,33 @@
-- WITH inside CTE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: Is there a case will fail before the code change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, but the analyzed plan is different, as we always inline CTE before.

@gengliangwang
Copy link
Member

LGTM over

Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just minor comments.

override def left: LogicalPlan = child
override def right: LogicalPlan = query
override protected def withNewChildrenInternal(
newLeft: LogicalPlan, newRight: LogicalPlan): LogicalPlan =
copy(child = newLeft, query = newRight)

override def withCTEDefs(cteDefs: Seq[CTERelationDef]): LogicalPlan = {
withNewChildren(Seq(child, WithCTE(query, cteDefs)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just copy(query = WithCTE(... like at other places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withNewChildren can copy over the tree node tags.

"legacy behavior which may produce incorrect results because Spark may evaluate a CTE " +
"relation more than once, even if it's nondeterministic.")
.booleanConf
.createWithDefault(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need version here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

*/
trait CTEInChildren extends LogicalPlan {
def withCTEDefs(cteDefs: Seq[CTERelationDef]): LogicalPlan = {
withNewChildren(children.map(WithCTE(_, cteDefs)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes sense to assert that we have only 1 child.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems fine to have multiple children, we just duplicate the CTE relations. The current code does not allow it though, and go back to inline CTE.

Copy link
Contributor

@peter-toth peter-toth Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it is always fine to duplicate CTE relations into multiple childrens.
For example, if we have a non-deterministic relation definition and 1-1 reference to it in 2 childrens of CTEInChildren and then here we duplicate the relations into the 2 childrens then I think the InlineCTE rule will decide to inline the relation 2 times, which is not correct.
But I agree with you, I don't see that this could happen now.

@cloud-fan
Copy link
Contributor Author

the test failure is unrelated, I'm merging it to master, thanks for the review!

@cloud-fan cloud-fan closed this in da84f81 Jul 25, 2023
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
This PR is based on apache#42022 to fix tests, as the PR author is on vacation.

### What changes were proposed in this pull request?
In the PR, I propose to add new trait `CTEInChildren` and mix it to some commands that should have `WithCTE` on top of their children (queries) instead of main query. Also I modified  the `CTESubstitution` rule and removed special handling of `Command`s and similar nodes. After the changes, `Command`, `ParsedStatement` and `InsertIntoDir` are handled in the same way as other queries by referring to CTE Defs. Only the difference is in `WithCTE` is not not placed on the top of main query but on top of command queries.

Closes apache#41922

### Why are the changes needed?
To improve code maintenance. Right now the CTE resolution code path is diverged: query with commands go into CTE inline code path where non-command queries go into CTEDef code path.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running new test:
```
$ build/sbt "test:testOnly *InsertSuite"
```

Closes apache#42036 from cloud-fan/help.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Jun 4, 2024
… original WithCTE node

### What changes were proposed in this pull request?

I noticed an outdated comment in the rule `InlineCTE`
```
      // CTEs in SQL Commands have been inlined by `CTESubstitution` already, so it is safe to add
      // WithCTE as top node here.
```

This is not true anymore after #42036 . It's not a big deal as we replace not-inlined CTE relations with `Repartition` during optimization, so it doesn't matter where we put the `WithCTE` node with not-inlined CTE relations, as it will disappear eventually. But it's still better to keep it at its original place, as third-party rules may be sensitive about the plan shape.

### Why are the changes needed?

to keep the plan shape as much as can after inlining CTE relations.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46617 from cloud-fan/cte.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants