-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Optimize common case: SELECT COUNT(*) FROM Table #1192
Comments
@zsxwing, you assigned @vkorukanti, it means you plan to implement it? |
@felipepessoto No, I am not planning to implement. @zsxwing assigned it to me to look at it and provide any guidance. We are currently busy with the next release of Delta, will get back to you after the release? I will assign this issue back to you, given you are interested in implementing it. |
@vkorukanti do you have any example of code where the query plan is replaced by a optimized version? I think it would be a good start point. |
I can confirm that this bug exists and it is impacting me as well. The strange thing is: we have a large bronze table and an even larger silver table, and counting the data by partition is very fast in silver (it reads the metadata), but extremely slow in bronze (it scans all the files). Not sure what is happening here, but if some of the maintainers could provide some guidance on how to implement it I would be willing to help out as well. |
Hi @vkorukanti, I'm doing some experiments and I have two different approaches (it is very high level only, I not sure if they are feasible), I'd like to hear your opinion on it. First:
Second:
|
I started working on option #1. Have a PoC working |
@felipepessoto - awesome! Can you post a PR? |
@scottsand-db, PR is published: #1377 |
Thanks for working on this @felipepessoto. Is there any chance of subsequently expanding this to work when there are filters on partition columns. |
@Tom-Newton yes, I plan to continue improving it. I'd like to first finish this first step to have the foundation, and then expand the scenarios. |
## Description Follow up of #1192, which optimizes COUNT. This PR adds support for MIN/MAX as well. Fix #2092 Created additional unit tests to cover MIN/MAX. ## Does this PR introduce _any_ user-facing changes? Only performance improvement Closes #1525 Signed-off-by: vkorukanti <venki.korukanti@gmail.com> GitOrigin-RevId: 9b88f76bf99cc38bd4cf9d3397b7bb8ade822d0b
Feature request
Running the query "SELECT COUNT(*) FROM Table" should read only Delta logs
Overview
Running the query "SELECT COUNT(*) FROM Table" takes a lot of time for big tables, Spark scan all the parquet files just to return the number of rows, that information is available from Delta Logs.
The same for "SELECT COUNT(*) FROM Table Group BY PartitionColumn"
Motivation
Huge performance overhead.
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
The text was updated successfully, but these errors were encountered: