[SPARK-24497][SQL] Support recursive SQL #40744

peter-toth · 2023-04-11T17:39:07Z

What changes were proposed in this pull request?

This PR adds recursive query feature to Spark SQL.

A recursive query is defined using the WITH RECURSIVE keywords and referring the name of the common table expression within the query.
The implementation complies with SQL standard and follows similar rules to other relational databases:

A query is made of an anchor followed by a recursive term.
The anchor terms doesn't contain self reference and it is used to initialize the query.
The recursive term contains a self reference and it is used to expand the current set of rows with new ones.
The anchor and recursive terms must be joined with each other by UNION or UNION ALL operators.
New rows can only be derived from the newly added rows of the previous iteration (or from the initial set of rows of anchor term). This limitation implies that recursive references can't be used with some of the joins, aggregations or subqueries.

Please see cte-recursive.sql for some examples.

The implemetation has the same limiation that SPARK-36447 / #33671 has:

With-CTEs mixed with SQL commands or DMLs will still go through the old inline code path because of our non-standard language specs and not-unified command/DML interfaces.

which means that recursive queries are not supported in SQL commands and DMLs.
With #42036 this restriction is lifted and a recursive CTE only doesn't work when the CTE is force inlined (spark.sql.legacy.inlineCTEInCommands=true or the command is a multi-insert statement).

Why are the changes needed?

Recursive query is an ANSI SQL feature that is useful to process hierarchical data.

Does this PR introduce any user-facing change?

Yes, adds recursive query feature.

How was this patch tested?

Added new UTs and tests in cte-recursion.sql.

peter-toth · 2023-04-20T08:57:33Z

This PR is WIP as it contains #40856. Once that PR is merged I will rebase and remove the WIP flag.

peter-toth · 2023-04-27T11:38:46Z

#40856 got merged and I've rebased this PR. I'm removing the WIP flag and the PR is ready for review.

cc @cloud-fan, @wangyum, @maryannxue, @sigmod

wangyum · 2023-05-30T05:31:22Z

Thanks @peter-toth. I tested this patch locally. But it seem it throws StackOverflowError.
How to reproduce:

./dev/make-distribution.sh --tgz  -Phive -Phive-thriftserver
tar -zxf spark-3.5.0-SNAPSHOT-bin-3.3.5.tgz
cd spark-3.5.0-SNAPSHOT-bin-3.3.5
bin/spark-sql

spark-sql (default)> WITH RECURSIVE t(n) AS (
                   >     VALUES (1)
                   > UNION ALL
                   >     SELECT n+1 FROM t WHERE n < 100
                   > )
                   > SELECT sum(n) FROM t;
23/05/30 13:21:21 ERROR Executor: Exception in task 0.0 in stage 265.0 (TID 199)
java.lang.StackOverflowError
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

peter-toth · 2023-05-30T15:06:29Z

Thanks @peter-toth. I tested this patch locally. But it seem it throws StackOverflowError. How to reproduce:

./dev/make-distribution.sh --tgz  -Phive -Phive-thriftserver
tar -zxf spark-3.5.0-SNAPSHOT-bin-3.3.5.tgz
cd spark-3.5.0-SNAPSHOT-bin-3.3.5
bin/spark-sql

spark-sql (default)> WITH RECURSIVE t(n) AS (
                   >     VALUES (1)
                   > UNION ALL
                   >     SELECT n+1 FROM t WHERE n < 100
                   > )
                   > SELECT sum(n) FROM t;
23/05/30 13:21:21 ERROR Executor: Exception in task 0.0 in stage 265.0 (TID 199)
java.lang.StackOverflowError
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

Thanks for testing this PR @wangyum. Iterestingly, I didn't encounter stack overflow when recursion level is <100. The error starts to appear at level ~170 in my tests. I guess this depends on your default stack size. Since recursion works in a way that each iteration depends on the previous iteration, the RDD lineage of the tasks are getting bigger and bigger and the deserialization of those tasks can throw stack overflow error at some point. Let me amend this PR with adding optional checkpointing so as to truncate RDD linage and be able to deal with deeper recursion...

wangyum · 2023-06-01T03:49:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

+
+  private def cacheAndCount(plan: LogicalPlan, limit: Option[Long]) = {
+    val limitedPlan = limit.map(l => Limit(Literal(l.toInt), plan)).getOrElse(plan)
+    val df = Dataset.ofRows(session, limitedPlan).persist()


Could we replace persist() with repartition() to avoid stack overflow issue?

repartition() seems to be good option to truncate RDD lineage and decrease task sizes to avoid stack overflow. I added it as the default cache mode in 2c206a0.

wangyum · 2023-06-01T04:00:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

+    var currentLimit = limit.map(_.toLong)
+    var (prevDF, prevCount) = cacheAndCount(anchor, currentLimit)
+
+    var currentLevel = 0


Why currentLevel is 0, not 1?

ksn06 · 2023-07-06T07:51:27Z

Hey folks,
So glad to see this feature is being worked on. Do you have any estimates when this could be released ?

peter-toth · 2023-07-06T08:21:07Z

Hey folks, So glad to see this feature is being worked on. Do you have any estimates when this could be released ?

This feature very likely won't make it into the next release (Spark 3.5) as tbe branch cut is in 2 weeks. But I will try to add it to the one after next (Spark 4.0).

github-actions · 2023-10-15T00:19:21Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

milimetric · 2023-10-24T20:48:52Z

@peter-toth thank you so much for sticking with this over three major versions and three separate pull requests. Recursive queries would be really nice to have in Spark SQL.

KamilKandzia · 2023-12-10T12:08:57Z

@peter-toth Hi, we are very much expecting a recursive sql. We hope you will be able to complete this pull request :)

peter-toth · 2023-12-21T11:29:07Z

@milastdbx do you think you can take over this PR?

cc @cloud-fan, @mitkedb, @MaxGekk

milastdbx · 2023-12-25T16:35:47Z

Yes, thank you. Milan

…

On Thu, Dec 21, 2023 at 12:29 PM Peter Toth ***@***.***> wrote: @milastdbx <https://github.com/milastdbx> do you think you can take over this PR? cc @cloud-fan <https://github.com/cloud-fan>, @mitkedb <https://github.com/mitkedb> — Reply to this email directly, view it on GitHub <#40744 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BD3GPBCAU24GF5QGXTQCYLLYKQMRHAVCNFSM6AAAAAAW2SUTGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRWGA4TENRUGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

waywtdcc · 2024-03-04T02:25:03Z

Can this PR be merged? I also encountered this scenario

firstim · 2024-03-06T18:52:38Z

If want to achieve hierarch query, you could try following while this PR is not available atm.

https://pypi.org/project/pyspark-connectby/

SvenRelijveld1995 · 2024-04-29T19:27:47Z

Any update on this PR?

jeremyjh · 2024-05-09T14:59:18Z

@milastdbx are you still planning to take this up?

jboarman · 2024-06-11T19:44:04Z

@wangyum I see that you started the review last year and the issues you raised were addressed by Peter.

Then @milastdbx was tagged to take over the PR, but I don't see the issue being assigned to you yet.

How do we get this PR reviewed?

cc @cloud-fan, @mitkedb, @MaxGekk

travis-leith · 2024-07-03T11:36:02Z

Thanks @peter-toth. I tested this patch locally. But it seem it throws StackOverflowError. How to reproduce:
./dev/make-distribution.sh --tgz  -Phive -Phive-thriftserver
tar -zxf spark-3.5.0-SNAPSHOT-bin-3.3.5.tgz
cd spark-3.5.0-SNAPSHOT-bin-3.3.5
bin/spark-sql
spark-sql (default)> WITH RECURSIVE t(n) AS (
                   >     VALUES (1)
                   > UNION ALL
                   >     SELECT n+1 FROM t WHERE n < 100
                   > )
                   > SELECT sum(n) FROM t;
23/05/30 13:21:21 ERROR Executor: Exception in task 0.0 in stage 265.0 (TID 199)
java.lang.StackOverflowError
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Thanks for testing this PR @wangyum. Iterestingly, I didn't encounter stack overflow when recursion level is <100. The error starts to appear at level ~170 in my tests. I guess this depends on your default stack size. Since recursion works in a way that each iteration depends on the previous iteration, the RDD lineage of the tasks are getting bigger and bigger and the deserialization of those tasks can throw stack overflow error at some point. Let me amend this PR with adding optional checkpointing so as to truncate RDD linage and be able to deal with deeper recursion...

@peter-toth I have not looked closely at the implementation but I do have a question about this: has the logic been implemented in some way similar to tail call optimization such that there is no recursion depth limit?

sb-mirakl · 2024-09-19T07:40:37Z

Any update ? Thanks !

peter-toth · 2024-09-19T08:56:25Z

Let me close this PR as seemingly its open state causes some confusion.
Feel free to use reuse the code if anyone wants to tacke this issue.

jeremyjh · 2024-09-24T10:42:06Z

@peter-toth can you also update the status on the Jira ticket? https://issues.apache.org/jira/browse/SPARK-24497

github-actions bot added CORE DOCS SQL labels Apr 11, 2023

peter-toth force-pushed the SPARK-24497-recursive-cte branch 3 times, most recently from 3582a91 to a46c068 Compare April 12, 2023 19:15

peter-toth force-pushed the SPARK-24497-recursive-cte branch from a46c068 to 8f18a77 Compare April 19, 2023 13:53

peter-toth mentioned this pull request Apr 19, 2023

[WIP][SPARK-24497][SQL] Support recursive SQL query #29210

Closed

peter-toth force-pushed the SPARK-24497-recursive-cte branch 2 times, most recently from 042a018 to 33b6703 Compare April 20, 2023 08:56

peter-toth force-pushed the SPARK-24497-recursive-cte branch 3 times, most recently from 9302d52 to 38f8324 Compare April 26, 2023 13:51

peter-toth changed the title ~~[WIP][SPARK-24497][SQL] Support recursive SQL~~ [SPARK-24497][SQL] Support recursive SQL Apr 27, 2023

wangyum reviewed Jun 1, 2023

View reviewed changes

peter-toth force-pushed the SPARK-24497-recursive-cte branch 2 times, most recently from 02f527d to 206e9a8 Compare June 2, 2023 08:54

github-actions bot added Stale and removed CORE labels Oct 15, 2023

peter-toth removed the Stale label Oct 15, 2023

[SPARK-24497][SQL] Support recursive SQL

386c038

peter-toth force-pushed the SPARK-24497-recursive-cte branch from 8d0498d to 386c038 Compare October 16, 2023 08:03

peter-toth force-pushed the SPARK-24497-recursive-cte branch 2 times, most recently from 3dfb1f6 to ae25f5f Compare December 14, 2023 15:30

Merge branch 'master' into SPARK-24497-recursive-cte

a325020

peter-toth force-pushed the SPARK-24497-recursive-cte branch from ae25f5f to a325020 Compare December 15, 2023 09:37

peter-toth closed this Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24497][SQL] Support recursive SQL #40744

[SPARK-24497][SQL] Support recursive SQL #40744

peter-toth commented Apr 11, 2023 •

edited

Loading

peter-toth commented Apr 20, 2023

peter-toth commented Apr 27, 2023

wangyum commented May 30, 2023

peter-toth commented May 30, 2023 •

edited

Loading

wangyum Jun 1, 2023

peter-toth Jun 1, 2023 •

edited

Loading

wangyum Jun 1, 2023

ksn06 commented Jul 6, 2023

peter-toth commented Jul 6, 2023 •

edited

Loading

github-actions bot commented Oct 15, 2023

milimetric commented Oct 24, 2023

KamilKandzia commented Dec 10, 2023

peter-toth commented Dec 21, 2023 •

edited

Loading

milastdbx commented Dec 25, 2023 via email

waywtdcc commented Mar 4, 2024

firstim commented Mar 6, 2024 •

edited

Loading

SvenRelijveld1995 commented Apr 29, 2024

jeremyjh commented May 9, 2024

jboarman commented Jun 11, 2024

travis-leith commented Jul 3, 2024

sb-mirakl commented Sep 19, 2024

peter-toth commented Sep 19, 2024 •

edited

Loading

jeremyjh commented Sep 24, 2024

[SPARK-24497][SQL] Support recursive SQL #40744

[SPARK-24497][SQL] Support recursive SQL #40744

Conversation

peter-toth commented Apr 11, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

peter-toth commented Apr 20, 2023

peter-toth commented Apr 27, 2023

wangyum commented May 30, 2023

peter-toth commented May 30, 2023 • edited Loading

wangyum Jun 1, 2023

Choose a reason for hiding this comment

peter-toth Jun 1, 2023 • edited Loading

Choose a reason for hiding this comment

wangyum Jun 1, 2023

Choose a reason for hiding this comment

ksn06 commented Jul 6, 2023

peter-toth commented Jul 6, 2023 • edited Loading

github-actions bot commented Oct 15, 2023

milimetric commented Oct 24, 2023

KamilKandzia commented Dec 10, 2023

peter-toth commented Dec 21, 2023 • edited Loading

milastdbx commented Dec 25, 2023 via email

waywtdcc commented Mar 4, 2024

firstim commented Mar 6, 2024 • edited Loading

SvenRelijveld1995 commented Apr 29, 2024

jeremyjh commented May 9, 2024

jboarman commented Jun 11, 2024

travis-leith commented Jul 3, 2024

sb-mirakl commented Sep 19, 2024

peter-toth commented Sep 19, 2024 • edited Loading

jeremyjh commented Sep 24, 2024

peter-toth commented Apr 11, 2023 •

edited

Loading

peter-toth commented May 30, 2023 •

edited

Loading

peter-toth Jun 1, 2023 •

edited

Loading

peter-toth commented Jul 6, 2023 •

edited

Loading

peter-toth commented Dec 21, 2023 •

edited

Loading

firstim commented Mar 6, 2024 •

edited

Loading

peter-toth commented Sep 19, 2024 •

edited

Loading