[Feature Request] Optimize common case: SELECT COUNT(*) FROM Table #1192

felipepessoto · 2022-06-10T21:30:02Z

Feature request

Running the query "SELECT COUNT(*) FROM Table" should read only Delta logs

Overview

Running the query "SELECT COUNT(*) FROM Table" takes a lot of time for big tables, Spark scan all the parquet files just to return the number of rows, that information is available from Delta Logs.

The same for "SELECT COUNT(*) FROM Table Group BY PartitionColumn"

Motivation

Huge performance overhead.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
No. I cannot contribute this feature at this time.

felipepessoto · 2022-06-17T20:09:42Z

@zsxwing, you assigned @vkorukanti, it means you plan to implement it?

vkorukanti · 2022-06-17T20:25:01Z

@felipepessoto No, I am not planning to implement. @zsxwing assigned it to me to look at it and provide any guidance. We are currently busy with the next release of Delta, will get back to you after the release? I will assign this issue back to you, given you are interested in implementing it.

felipepessoto · 2022-06-18T03:00:13Z

@vkorukanti do you have any example of code where the query plan is replaced by a optimized version? I think it would be a good start point.

moredatapls · 2022-07-08T07:45:46Z

I can confirm that this bug exists and it is impacting me as well. The strange thing is: we have a large bronze table and an even larger silver table, and counting the data by partition is very fast in silver (it reads the metadata), but extremely slow in bronze (it scans all the files). Not sure what is happening here, but if some of the maintainers could provide some guidance on how to implement it I would be willing to help out as well.

felipepessoto · 2022-07-12T21:20:39Z

Hi @vkorukanti, I'm doing some experiments and I have two different approaches (it is very high level only, I not sure if they are feasible), I'd like to hear your opinion on it.

First:

Change the DeltaCatalog.loadTable

delta/core/src/main/scala/org/apache/spark/sql/delta/catalog/DeltaCatalog.scala

Line 176 in ab0946e

catalogTable = Some(v1.catalogTable),

to fill catalog stats.
Later use the statistics to rewrite the query plan, replacing the parquet file scan to a no-op, and return the count.

Second:

In DataSkippingReader, change filesForScan to skip files if the query is a SELECT COUNT and use stats only.
The pros of doing this by file is I can also use it when the COUNT is by PartitionKey. But I'm not sure how to get the query plan here.

felipepessoto · 2022-07-24T19:20:44Z

I started working on option #1. Have a PoC working

scottsand-db · 2022-08-10T02:50:57Z

@felipepessoto - awesome! Can you post a PR?

felipepessoto · 2022-09-12T18:38:02Z

@scottsand-db, PR is published: #1377

Tom-Newton · 2022-11-21T13:11:27Z

Thanks for working on this @felipepessoto. Is there any chance of subsequently expanding this to work when there are filters on partition columns.

felipepessoto · 2022-11-22T00:27:01Z

@Tom-Newton yes, I plan to continue improving it. I'd like to first finish this first step to have the foundation, and then expand the scenarios.

## Description Follow up of #1192, which optimizes COUNT. This PR adds support for MIN/MAX as well. Fix #2092 Created additional unit tests to cover MIN/MAX. ## Does this PR introduce _any_ user-facing changes? Only performance improvement Closes #1525 Signed-off-by: vkorukanti <venki.korukanti@gmail.com> GitOrigin-RevId: 9b88f76bf99cc38bd4cf9d3397b7bb8ade822d0b

felipepessoto added the enhancement New feature or request label Jun 10, 2022

zsxwing assigned vkorukanti Jun 14, 2022

vkorukanti assigned felipepessoto Jun 17, 2022

scottsand-db self-assigned this Jul 26, 2022

felipepessoto mentioned this issue Sep 12, 2022

Optimize common case: SELECT COUNT(*) FROM Table Fix #1192 #1377

Closed

allisonport-db closed this as completed in 0c349da Nov 28, 2022

zsxwing added this to the 2.2.0 milestone Nov 29, 2022

felipepessoto mentioned this issue Dec 17, 2022

Optimize Min/Max using Delta metadata #1525

Closed

keen85 mentioned this issue Jul 17, 2023

[Feature Request] optimize COUNT(*) on partitioned tables #1916

Open

8 tasks

felipepessoto mentioned this issue Sep 22, 2023

[Feature Request][Spark] Optimize Min/Max using Delta metadata #2092

Closed

8 tasks

felipepessoto mentioned this issue Jan 31, 2024

[Feature Request][Spark][WIP] Metadata only queries - Umbrella issue #2589

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Optimize common case: SELECT COUNT(*) FROM Table #1192

[Feature Request] Optimize common case: SELECT COUNT(*) FROM Table #1192

felipepessoto commented Jun 10, 2022

felipepessoto commented Jun 17, 2022

vkorukanti commented Jun 17, 2022

felipepessoto commented Jun 18, 2022

moredatapls commented Jul 8, 2022

felipepessoto commented Jul 12, 2022

felipepessoto commented Jul 24, 2022

scottsand-db commented Aug 10, 2022

felipepessoto commented Sep 12, 2022 •

edited

Loading

Tom-Newton commented Nov 21, 2022

felipepessoto commented Nov 22, 2022

[Feature Request] Optimize common case: SELECT COUNT(*) FROM Table #1192

[Feature Request] Optimize common case: SELECT COUNT(*) FROM Table #1192

Comments

felipepessoto commented Jun 10, 2022

Feature request

Overview

Motivation

Willingness to contribute

felipepessoto commented Jun 17, 2022

vkorukanti commented Jun 17, 2022

felipepessoto commented Jun 18, 2022

moredatapls commented Jul 8, 2022

felipepessoto commented Jul 12, 2022

felipepessoto commented Jul 24, 2022

scottsand-db commented Aug 10, 2022

felipepessoto commented Sep 12, 2022 • edited Loading

Tom-Newton commented Nov 21, 2022

felipepessoto commented Nov 22, 2022

felipepessoto commented Sep 12, 2022 •

edited

Loading