-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24497][SQL] Support recursive SQL #40744
Conversation
3582a91
to
a46c068
Compare
a46c068
to
8f18a77
Compare
042a018
to
33b6703
Compare
This PR is WIP as it contains #40856. Once that PR is merged I will rebase and remove the WIP flag. |
9302d52
to
38f8324
Compare
#40856 got merged and I've rebased this PR. I'm removing the WIP flag and the PR is ready for review. cc @cloud-fan, @wangyum, @maryannxue, @sigmod |
Thanks @peter-toth. I tested this patch locally. But it seem it throws
|
Thanks for testing this PR @wangyum. Iterestingly, I didn't encounter stack overflow when recursion level is <100. The error starts to appear at level ~170 in my tests. I guess this depends on your default stack size. Since recursion works in a way that each iteration depends on the previous iteration, the RDD lineage of the tasks are getting bigger and bigger and the deserialization of those tasks can throw stack overflow error at some point. Let me amend this PR with adding optional checkpointing so as to truncate RDD linage and be able to deal with deeper recursion... |
|
||
private def cacheAndCount(plan: LogicalPlan, limit: Option[Long]) = { | ||
val limitedPlan = limit.map(l => Limit(Literal(l.toInt), plan)).getOrElse(plan) | ||
val df = Dataset.ofRows(session, limitedPlan).persist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we replace persist()
with repartition()
to avoid stack overflow issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repartition()
seems to be good option to truncate RDD lineage and decrease task sizes to avoid stack overflow. I added it as the default cache mode in 2c206a0.
var currentLimit = limit.map(_.toLong) | ||
var (prevDF, prevCount) = cacheAndCount(anchor, currentLimit) | ||
|
||
var currentLevel = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why currentLevel
is 0, not 1?
02f527d
to
206e9a8
Compare
Hey folks, |
This feature very likely won't make it into the next release (Spark 3.5) as tbe branch cut is in 2 weeks. But I will try to add it to the one after next (Spark 4.0). |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
8d0498d
to
386c038
Compare
@peter-toth thank you so much for sticking with this over three major versions and three separate pull requests. Recursive queries would be really nice to have in Spark SQL. |
@peter-toth Hi, we are very much expecting a recursive sql. We hope you will be able to complete this pull request :) |
3dfb1f6
to
ae25f5f
Compare
ae25f5f
to
a325020
Compare
@milastdbx do you think you can take over this PR? cc @cloud-fan, @mitkedb, @MaxGekk |
Yes, thank you.
Milan
…On Thu, Dec 21, 2023 at 12:29 PM Peter Toth ***@***.***> wrote:
@milastdbx <https://github.com/milastdbx> do you think you can take over
this PR?
cc @cloud-fan <https://github.com/cloud-fan>, @mitkedb
<https://github.com/mitkedb>
—
Reply to this email directly, view it on GitHub
<#40744 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/BD3GPBCAU24GF5QGXTQCYLLYKQMRHAVCNFSM6AAAAAAW2SUTGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRWGA4TENRUGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Can this PR be merged? I also encountered this scenario |
If want to achieve hierarch query, you could try following while this PR is not available atm. |
Any update on this PR? |
@milastdbx are you still planning to take this up? |
@wangyum I see that you started the review last year and the issues you raised were addressed by Peter. Then @milastdbx was tagged to take over the PR, but I don't see the issue being assigned to you yet. How do we get this PR reviewed? cc @cloud-fan, @mitkedb, @MaxGekk |
@peter-toth I have not looked closely at the implementation but I do have a question about this: has the logic been implemented in some way similar to tail call optimization such that there is no recursion depth limit? |
Any update ? Thanks ! |
Let me close this PR as seemingly its open state causes some confusion. |
@peter-toth can you also update the status on the Jira ticket? https://issues.apache.org/jira/browse/SPARK-24497 |
What changes were proposed in this pull request?
This PR adds recursive query feature to Spark SQL.
A recursive query is defined using the
WITH RECURSIVE
keywords and referring the name of the common table expression within the query.The implementation complies with SQL standard and follows similar rules to other relational databases:
UNION
orUNION ALL
operators.Please see
cte-recursive.sql
for some examples.The implemetation has the same limiation that SPARK-36447 / #33671 has:
which means that recursive queries are not supported in SQL commands and DMLs.
With #42036 this restriction is lifted and a recursive CTE only doesn't work when the CTE is force inlined (
spark.sql.legacy.inlineCTEInCommands=true
or the command is a multi-insert statement).Why are the changes needed?
Recursive query is an ANSI SQL feature that is useful to process hierarchical data.
Does this PR introduce any user-facing change?
Yes, adds recursive query feature.
How was this patch tested?
Added new UTs and tests in
cte-recursion.sql
.