Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ANALYZE stats composed of expressions #14222

Closed

Conversation

findepi
Copy link
Member

@findepi findepi commented Sep 20, 2022

A connector may ask engine to collect anything defined by ColumnStatisticType SPI enum. This is convenient, but sometimes a connector needs to provide its own way of calculating statistics.

For example, Iceberg statistics include

apache-datasketches-theta-v1 blob type

A serialized form of a "compact" Theta sketch produced by the Apache
DataSketches
library. The sketch is obtained by
constructing Alpha family sketch with default seed, and feeding it with individual
distinct values converted to bytes using Iceberg's single-value serialization.

This has two components which are not supported today

  • a new data sketch with a specific configuration (so that results can be shared with different query engines)
  • a well-defined input pre-processing, which relies on existing Iceberg concepts, which are alien to Trino engine

This PR addresses the second limitation, building on top of #14220

@cla-bot cla-bot bot added the cla-signed label Sep 20, 2022
@findepi findepi marked this pull request as draft September 20, 2022 21:25
@findepi findepi force-pushed the findepi/arbitrary-stats-composable branch from 51d1904 to 0c07bc8 Compare September 21, 2022 12:22
@findepi findepi force-pushed the findepi/arbitrary-stats-composable branch from 0c07bc8 to ef71f43 Compare September 21, 2022 13:01
@findepi findepi force-pushed the findepi/arbitrary-stats-composable branch from ef71f43 to 1d4f9ee Compare September 23, 2022 12:56
Reduce code noise by moving Symbol -> SymbolReference converstion to the
construction method.
This allows a connector to compose statistics collection from an
aggregation function and a projection, reducing need for specialized
aggregation functions just for stats collection. For example,
`$max_data_size_for_stats` and `$sum_data_size_for_stats` can now be
decomposed as a simple scalar plus either max or sum aggregation.
@mosabua
Copy link
Member

mosabua commented Jan 12, 2024

I assume this is still in progress @findepi ...

@findepi
Copy link
Member Author

findepi commented Jan 31, 2024

currently not needed.

@findepi findepi closed this Jan 31, 2024
@findepi findepi deleted the findepi/arbitrary-stats-composable branch January 31, 2024 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants