You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With #11011, trino aggregation operator partial step can be switched off if, for the rows processed so far, it did not reduce the number of rows much (most of the input rows are distinct).
If this happens, but the rows that are yet to be processed have a different distribution i.e. a small number of distinct values, we want the partial aggregation step to be re-enabled.
One idea of how to implement this is to calculate or estimate (e.g. using hyper log log) the number of distinct values in the split once in a while (possibly with exponential backoff for the window between calculations), and enable partial aggregation again if the number of distinct values to input position count is low for the given split.
Another idea, that may not be doable but I will just put it here, is that if we had per split statistics of the number of distinct values per column (with the correlation between column stats in a perfect world), we could decide to enable or disable partial aggregation on per split basis.
Parquet format has support for per column chunk and per page distinct_count but I suspect it's not populated in most real-life scenarios
The text was updated successfully, but these errors were encountered:
lukasz-stec
changed the title
Support enabling partial aggregation adaptively
Support re-enabling partial aggregation adaptively
Mar 8, 2022
With #11011, trino aggregation operator partial step can be switched off if, for the rows processed so far, it did not reduce the number of rows much (most of the input rows are distinct).
If this happens, but the rows that are yet to be processed have a different distribution i.e. a small number of distinct values, we want the partial aggregation step to be re-enabled.
One idea of how to implement this is to calculate or estimate (e.g. using hyper log log) the number of distinct values in the split once in a while (possibly with exponential backoff for the window between calculations), and enable partial aggregation again if the number of distinct values to input position count is low for the given split.
Another idea, that may not be doable but I will just put it here, is that if we had per split statistics of the number of distinct values per column (with the correlation between column stats in a perfect world), we could decide to enable or disable partial aggregation on per split basis.
Parquet format has support for per column chunk and per page
distinct_count
but I suspect it's not populated in most real-life scenariosThe text was updated successfully, but these errors were encountered: