Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Optimize common case: SELECT COUNT(*) FROM Table #1192

Closed
1 of 3 tasks
felipepessoto opened this issue Jun 10, 2022 · 10 comments
Closed
1 of 3 tasks
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@felipepessoto
Copy link
Contributor

Feature request

Running the query "SELECT COUNT(*) FROM Table" should read only Delta logs

Overview

Running the query "SELECT COUNT(*) FROM Table" takes a lot of time for big tables, Spark scan all the parquet files just to return the number of rows, that information is available from Delta Logs.

The same for "SELECT COUNT(*) FROM Table Group BY PartitionColumn"

Motivation

Huge performance overhead.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.
@felipepessoto felipepessoto added the enhancement New feature or request label Jun 10, 2022
@felipepessoto
Copy link
Contributor Author

@zsxwing, you assigned @vkorukanti, it means you plan to implement it?

@vkorukanti
Copy link
Collaborator

@felipepessoto No, I am not planning to implement. @zsxwing assigned it to me to look at it and provide any guidance. We are currently busy with the next release of Delta, will get back to you after the release? I will assign this issue back to you, given you are interested in implementing it.

@felipepessoto
Copy link
Contributor Author

@vkorukanti do you have any example of code where the query plan is replaced by a optimized version? I think it would be a good start point.

@moredatapls
Copy link
Contributor

I can confirm that this bug exists and it is impacting me as well. The strange thing is: we have a large bronze table and an even larger silver table, and counting the data by partition is very fast in silver (it reads the metadata), but extremely slow in bronze (it scans all the files). Not sure what is happening here, but if some of the maintainers could provide some guidance on how to implement it I would be willing to help out as well.

@felipepessoto
Copy link
Contributor Author

Hi @vkorukanti, I'm doing some experiments and I have two different approaches (it is very high level only, I not sure if they are feasible), I'd like to hear your opinion on it.

First:

Second:

  • In DataSkippingReader, change filesForScan to skip files if the query is a SELECT COUNT and use stats only.
  • The pros of doing this by file is I can also use it when the COUNT is by PartitionKey. But I'm not sure how to get the query plan here.

@felipepessoto
Copy link
Contributor Author

I started working on option #1. Have a PoC working

@scottsand-db scottsand-db self-assigned this Jul 26, 2022
@scottsand-db
Copy link
Collaborator

@felipepessoto - awesome! Can you post a PR?

@felipepessoto
Copy link
Contributor Author

felipepessoto commented Sep 12, 2022

@scottsand-db, PR is published: #1377

@Tom-Newton
Copy link
Contributor

Thanks for working on this @felipepessoto. Is there any chance of subsequently expanding this to work when there are filters on partition columns.

@felipepessoto
Copy link
Contributor Author

@Tom-Newton yes, I plan to continue improving it. I'd like to first finish this first step to have the foundation, and then expand the scenarios.

@zsxwing zsxwing added this to the 2.2.0 milestone Nov 29, 2022
vkorukanti pushed a commit that referenced this issue Jan 9, 2024
## Description

Follow up of #1192, which optimizes COUNT. This PR adds support for MIN/MAX as well.

Fix #2092

Created additional unit tests to cover MIN/MAX.

## Does this PR introduce _any_ user-facing changes?

Only performance improvement

Closes #1525

Signed-off-by: vkorukanti <venki.korukanti@gmail.com>
GitOrigin-RevId: 9b88f76bf99cc38bd4cf9d3397b7bb8ade822d0b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants