-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44355][SQL] Move WithCTE into command queries #42036
Conversation
// CTE normally and don't need to force inline. | ||
!commands.head.isInstanceOf[CTEInChildren] | ||
} else if (commands.length > 1) { | ||
// This can happen with the multi-insert statement. We should fall back to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: There is duplicated logic here.
To make the code more readable, we can always collect the commands first. If the length of commands is 1, there is a different behavior based on the legacy conf. Otherwise the logic is determined.
* children. There are two reasons: | ||
* 1. Some rules will pattern match the root command nodes, and we should keep command | ||
* as the root node to not break them. | ||
* 2. `Dataset` eagerly executes the commands inside a query plan. However, the CTE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Shall we have an example for the eager execution?
@@ -0,0 +1,33 @@ | |||
-- WITH inside CTE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Is there a case will fail before the code change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, but the analyzed plan is different, as we always inline CTE before.
LGTM over |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just minor comments.
override def left: LogicalPlan = child | ||
override def right: LogicalPlan = query | ||
override protected def withNewChildrenInternal( | ||
newLeft: LogicalPlan, newRight: LogicalPlan): LogicalPlan = | ||
copy(child = newLeft, query = newRight) | ||
|
||
override def withCTEDefs(cteDefs: Seq[CTERelationDef]): LogicalPlan = { | ||
withNewChildren(Seq(child, WithCTE(query, cteDefs))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just copy(query = WithCTE(...
like at other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withNewChildren
can copy over the tree node tags.
"legacy behavior which may produce incorrect results because Spark may evaluate a CTE " + | ||
"relation more than once, even if it's nondeterministic.") | ||
.booleanConf | ||
.createWithDefault(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need version here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
*/ | ||
trait CTEInChildren extends LogicalPlan { | ||
def withCTEDefs(cteDefs: Seq[CTERelationDef]): LogicalPlan = { | ||
withNewChildren(children.map(WithCTE(_, cteDefs))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it makes sense to assert that we have only 1 child.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems fine to have multiple children, we just duplicate the CTE relations. The current code does not allow it though, and go back to inline CTE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that it is always fine to duplicate CTE relations into multiple childrens.
For example, if we have a non-deterministic relation definition and 1-1 reference to it in 2 childrens of CTEInChildren
and then here we duplicate the relations into the 2 childrens then I think the InlineCTE
rule will decide to inline the relation 2 times, which is not correct.
But I agree with you, I don't see that this could happen now.
the test failure is unrelated, I'm merging it to master, thanks for the review! |
This PR is based on apache#42022 to fix tests, as the PR author is on vacation. ### What changes were proposed in this pull request? In the PR, I propose to add new trait `CTEInChildren` and mix it to some commands that should have `WithCTE` on top of their children (queries) instead of main query. Also I modified the `CTESubstitution` rule and removed special handling of `Command`s and similar nodes. After the changes, `Command`, `ParsedStatement` and `InsertIntoDir` are handled in the same way as other queries by referring to CTE Defs. Only the difference is in `WithCTE` is not not placed on the top of main query but on top of command queries. Closes apache#41922 ### Why are the changes needed? To improve code maintenance. Right now the CTE resolution code path is diverged: query with commands go into CTE inline code path where non-command queries go into CTEDef code path. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running new test: ``` $ build/sbt "test:testOnly *InsertSuite" ``` Closes apache#42036 from cloud-fan/help. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… original WithCTE node ### What changes were proposed in this pull request? I noticed an outdated comment in the rule `InlineCTE` ``` // CTEs in SQL Commands have been inlined by `CTESubstitution` already, so it is safe to add // WithCTE as top node here. ``` This is not true anymore after #42036 . It's not a big deal as we replace not-inlined CTE relations with `Repartition` during optimization, so it doesn't matter where we put the `WithCTE` node with not-inlined CTE relations, as it will disappear eventually. But it's still better to keep it at its original place, as third-party rules may be sensitive about the plan shape. ### Why are the changes needed? to keep the plan shape as much as can after inlining CTE relations. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #46617 from cloud-fan/cte. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This PR is based on #42022 to fix tests, as the PR author is on vacation.
What changes were proposed in this pull request?
In the PR, I propose to add new trait
CTEInChildren
and mix it to some commands that should haveWithCTE
on top of their children (queries) instead of main query. Also I modified theCTESubstitution
rule and removed special handling ofCommand
s and similar nodes. After the changes,Command
,ParsedStatement
andInsertIntoDir
are handled in the same way as other queries by referring to CTE Defs. Only the difference is inWithCTE
is not not placed on the top of main query but on top of command queries.Closes #41922
Why are the changes needed?
To improve code maintenance. Right now the CTE resolution code path is diverged: query with commands go into CTE inline code path where non-command queries go into CTEDef code path.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
By running new test: