Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for LIMIT push down during query planning to reduce number of files scanned #1495

Closed
wants to merge 1 commit into from

Conversation

scottsand-db
Copy link
Collaborator

Description

This PR adds support for limit pushdown, where we will "push down" any LIMITs during query planning so that we scan the minimum number of files necessary.

How was this patch tested?

New test suite.

Does this PR introduce any user-facing changes?

No.


override def close(): Unit = {
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line

@tdas
Copy link
Contributor

tdas commented Nov 22, 2022

Please make the title better. LimitPushdown into what?

@scottsand-db scottsand-db changed the title Add support for LimitPushDown Add support for LIMIT push down during query planning to reduce number of files scanned Nov 22, 2022
Copy link
Contributor

@tdas tdas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@vkorukanti vkorukanti added this to the 2.2.0 milestone Dec 5, 2022
@ciurlaro42
Copy link

@scottsand-db Is OFFSET also affected by this change? It seems plausible for optimization to occur, as files without the required data could be skipped.

For example, if there were two files with 1000 rows each, and an attempt was made to read them with OFFSET 1001, the first file could be easily skipped using the metadata information.

@zsxwing
Copy link
Member

zsxwing commented Apr 21, 2023

Good question. However, using OFFSET on a Delta table is an undefined behavior. There is no ordering in a Delta table, hence using OFFSET on a Delta table may get non-deterministic results.

@ciurlaro42
Copy link

@zsxwing What about z-ordered tables? I would assume that's a fairly common optimisation

@camlee
Copy link

camlee commented Nov 18, 2023

I'm wondering about a possible enhancement on top of this limit pushdown support. If the query specifies an ORDER BY, that isn't currently pushed down: the whole table is scanned. On either a partitioned or z-ordered table, couldn't some data skipping kick in? Using file statistics, could delta pick which partitions or files to check first and stop when enough records to satisfy the LIMIT are found? Or otherwise prune out substantial amounts of data. I'm wondering if this would make sense as a new Feature Request or if it's technically very challenging or something like that.
I think this is a common use case, for example, selecting the oldest or most recent records from a table that is partitioned on day or week or z-ordered on the timestamp column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants