Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WITH RECURSIVE is inefficient #4771

Closed
lhofhansl opened this issue Aug 10, 2020 · 3 comments
Closed

WITH RECURSIVE is inefficient #4771

lhofhansl opened this issue Aug 10, 2020 · 3 comments

Comments

@lhofhansl
Copy link
Member

lhofhansl commented Aug 10, 2020

First off, I am excited about Presto supporting this now and I get why it works the way it does. Just filing this for a discussion.

Looking here: https://github.com/prestosql/presto/blob/340/presto-main/src/main/java/io/prestosql/sql/planner/QueryPlanner.java#L235, there's a single plan built with size O(n^2) and it's always building the plan to maxRecursionDepth.
This is not complaining, just pointing out that - as is - it's only partially useful.

As an engineer I totally get it - Presto has no place to store intermediary results.
(If we had ... we could create a temp table with an extra current_depth column and store/query intermediary results that way, we'd also need "looping" query execution. In the end we can query the whole temp table - UNION ALL - or do a DISTINCT scan over all columns but the depth - UNION.)

As a user, though, I would try something like the below and would either be surprised or disappointed.
The user is forced to pretty closely anticipate the recursion depth, or the query will either be inefficient (depth too high) or fail (depth too low).

WITH RECURSIVE fib (cur, prev) AS
(
  SELECT 0, 1
    UNION ALL 
  SELECT cur + prev, cur FROM fib WHERE cur + prev < 1000000
)
SELECT cur FROM fib ORDER BY cur;

In PostgreSQL:

  cur   
--------
      0
 [...]
 832040
(31 rows)

Time: 0.836 ms

In Presto:

Query 20200810_212624_00016_iampk, FAILED, 1 node
Splits: 18 total, 0 done (0.00%)
1.91 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20200810_212624_00016_iampk failed: Recursion depth limit exceeded (10). Use 'max_recursion_depth' session property to modify the limit.

Ok... So set session max_recursion_depth=100;

Now:

Query 20200810_212703_00018_iampk failed: statement is too large (stack overflow during analysis)

Ok... So set session max_recursion_depth=40;

Now it works, but it takes almost 35s:

  cur   
--------
      0 
 [...]
 832040 
(31 rows)

Query 20200810_212808_00020_iampk, FINISHED, 1 node
Splits: 685 total, 685 done (100.00%)
34.90 [0 rows, 0B] [0 rows/s, 0B/s]

And now even when I change the 1000000 limit to 10, it still takes ~35s (since it's going by the max depth property).
I don't know offhand how to improve this, unless we invent a place to store intermediary data; but at the very least we should document the limitation very carefully.

@martint
Copy link
Member

martint commented Aug 10, 2020

Indeed. That's why the feature is experimental :)

I wrote some details some time ago on what we need to implement it efficiently: #1122 (comment)

Basically, it boils down to:

  • Temporary tables/storage
  • Multi-step query plans, with strict execution sequencing

There are also probably some opportunities for algebraic optimization of recursive plans to remove the recursion altogether. Something to investigate.

@lhofhansl
Copy link
Member Author

I see. Missed your comment in #1122 (even though I also commented on it earlier.)
Fully expanding the plan is nice to try it out at least.

@dain
Copy link
Member

dain commented Aug 13, 2020

Let's consolidate the discussion in #1122

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants