feat: histogram aggregate function #14839

suimenno3002 · 2024-03-05T02:25:39Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR.

Feat Feature: histogram aggregate function #14588

This Draft PR is only used to display task disassembly, current progress, and exchange ideas. The quality of the code is currently low. Today I will improve it to a state that can be reviewed.

Tests

Unit Test
Logic Test

Type of change

New Feature (non-breaking change which adds functionality)

Progress

Migrate doris' equal height histogram implementation and unit test
make code clean
license compliance using doris code
histogram aggre for types covered by RangeIndex::supported_type
[-] move the equal-weight histogram code into a common path
[-] discussion on the output format of aggregation results
add test case for aggregate function, and complete testing
make histogram of date date_time decimal readable
docs for histogram aggregate function

For Reviewers

The implementation refers to doris' histogram function, so these may be helpful:

WIP

This change is

src/query/functions/src/aggregates/aggregate_histogram.rs

suimenno3002 · 2024-03-06T21:24:44Z

Is there anyway to cast ScalarRef to &Scalar?

sundy-li · 2024-03-07T01:31:57Z

Is there anyway to cast ScalarRef to &Scalar?

to_owned

src/query/functions/src/aggregates/aggregate_histogram.rs

suimenno3002 · 2024-03-10T17:34:18Z

Is there anyway to format Scalar to be readable? I think neither Scalar nor ScalarRef implement Display. Or maybe I should change the output format of histogram, from one line string to multi line flat bucket

src/query/functions/src/aggregates/aggregate_histogram.rs

src/query/sql/src/planner/semantic/type_check.rs

suimenno3002 · 2024-03-14T23:30:11Z

= =, I found some small improvements. Should I submit a new PR directly or open an issue?

diff --git a/src/query/functions/src/aggregates/aggregate_histogram.rs b/src/query/functions/src/aggregates/aggregate_histogram.rs
index b1935c0a1d..7c0e2c4169 100644
--- a/src/query/functions/src/aggregates/aggregate_histogram.rs
+++ b/src/query/functions/src/aggregates/aggregate_histogram.rs
@@ -145,8 +145,8 @@ where
             &buckets
                 .drain(..)
                 .map(|raw| Bucket {
-                    lower: format_scalar(raw.lower.clone()),
-                    upper: format_scalar(raw.upper.clone()),
+                    lower: format_scalar(raw.lower),
+                    upper: format_scalar(raw.upper),
                     ndv: raw.ndv,
                     count: raw.count,
                     pre_sum: raw.pre_sum,
@@ -352,7 +352,7 @@ fn calculate_bucket_max_values<T: Ord>(value_map: &BTreeMap<T, u64>, num_buckets
     // Assume that the value map is not empty
     debug_assert!(!value_map.is_empty());
 
-    // Calculate the total number of values in the map using std::accumulate()
+    // Calculate the total number of values in the map
     let total_values = value_map.values().sum();
 
     // If there is only one bucket, then all values will be assigned to that bucket
diff --git a/src/query/sql/src/planner/semantic/type_check.rs b/src/query/sql/src/planner/semantic/type_check.rs
index c691f9a450..d11fe9b734 100644
--- a/src/query/sql/src/planner/semantic/type_check.rs
+++ b/src/query/sql/src/planner/semantic/type_check.rs
@@ -1726,7 +1726,7 @@ impl<'a> TypeChecker<'a> {
             } && arg_types[1].is_integer();
             if !is_positive_integer {
                 return Err(ErrorCode::SemanticError(
-                    "The delimiter of `histogram` must be a constant positive int",
+                    "The max_num_buckets of `histogram` must be a constant positive int",
                 ));
             }

BohuTANG · 2024-03-15T00:40:04Z

= =, I found some small improvements. Should I submit a new PR directly or open an issue?

diff --git a/src/query/functions/src/aggregates/aggregate_histogram.rs b/src/query/functions/src/aggregates/aggregate_histogram.rs
index b1935c0a1d..7c0e2c4169 100644
--- a/src/query/functions/src/aggregates/aggregate_histogram.rs
+++ b/src/query/functions/src/aggregates/aggregate_histogram.rs
@@ -145,8 +145,8 @@ where
             &buckets
                 .drain(..)
                 .map(|raw| Bucket {
-                    lower: format_scalar(raw.lower.clone()),
-                    upper: format_scalar(raw.upper.clone()),
+                    lower: format_scalar(raw.lower),
+                    upper: format_scalar(raw.upper),
                     ndv: raw.ndv,
                     count: raw.count,
                     pre_sum: raw.pre_sum,
@@ -352,7 +352,7 @@ fn calculate_bucket_max_values<T: Ord>(value_map: &BTreeMap<T, u64>, num_buckets
     // Assume that the value map is not empty
     debug_assert!(!value_map.is_empty());
 
-    // Calculate the total number of values in the map using std::accumulate()
+    // Calculate the total number of values in the map
     let total_values = value_map.values().sum();
 
     // If there is only one bucket, then all values will be assigned to that bucket
diff --git a/src/query/sql/src/planner/semantic/type_check.rs b/src/query/sql/src/planner/semantic/type_check.rs
index c691f9a450..d11fe9b734 100644
--- a/src/query/sql/src/planner/semantic/type_check.rs
+++ b/src/query/sql/src/planner/semantic/type_check.rs
@@ -1726,7 +1726,7 @@ impl<'a> TypeChecker<'a> {
             } && arg_types[1].is_integer();
             if !is_positive_integer {
                 return Err(ErrorCode::SemanticError(
-                    "The delimiter of `histogram` must be a constant positive int",
+                    "The max_num_buckets of `histogram` must be a constant positive int",
                 ));
             }

Cool, only a new PR is ok.

github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Mar 5, 2024

suimenno3002 force-pushed the feat/histogram-aggregate-function branch from d1ac941 to 31e8547 Compare March 5, 2024 02:54

sundy-li reviewed Mar 5, 2024

View reviewed changes

src/query/functions/src/aggregates/aggregate_histogram.rs Outdated Show resolved Hide resolved

suimenno3002 force-pushed the feat/histogram-aggregate-function branch 3 times, most recently from 9441e43 to fba1126 Compare March 6, 2024 21:14

suimenno3002 force-pushed the feat/histogram-aggregate-function branch from fba1126 to 4675297 Compare March 6, 2024 21:26

sundy-li reviewed Mar 7, 2024

View reviewed changes

src/query/functions/src/aggregates/aggregate_histogram.rs Show resolved Hide resolved

suimenno3002 force-pushed the feat/histogram-aggregate-function branch 2 times, most recently from 918b413 to 02daad4 Compare March 8, 2024 21:51

sundy-li reviewed Mar 11, 2024

View reviewed changes

src/query/functions/src/aggregates/aggregate_histogram.rs Outdated Show resolved Hide resolved

suimenno3002 force-pushed the feat/histogram-aggregate-function branch 4 times, most recently from a7ff579 to 1667ad2 Compare March 12, 2024 20:48

suimenno3002 marked this pull request as ready for review March 13, 2024 14:52

sundy-li approved these changes Mar 14, 2024

View reviewed changes

sundy-li requested a review from ariesdevil March 14, 2024 01:10

ariesdevil reviewed Mar 14, 2024

View reviewed changes

src/query/sql/src/planner/semantic/type_check.rs Outdated Show resolved Hide resolved

feat: histogram aggregate function

363af45

suimenno3002 force-pushed the feat/histogram-aggregate-function branch from 1667ad2 to 363af45 Compare March 14, 2024 21:32

sundy-li added this pull request to the merge queue Mar 14, 2024

Merged via the queue into databendlabs:main with commit 9cbe53d Mar 14, 2024
75 of 76 checks passed

suimenno3002 mentioned this pull request Mar 14, 2024

add documents for histogram aggregate functions databendlabs/databend-docs#636

Closed

suimenno3002 mentioned this pull request Mar 16, 2024

chore: improve histogram‘s implement and comments #14976

Merged

11 tasks

This was referenced Jul 18, 2024

Link Checker Report databendlabs/databend-docs#978

Closed

Link Checker Report databendlabs/databend-docs#991

Closed

wzslr321 mentioned this pull request Aug 17, 2024

Feature: histogram aggregate function #14588

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: histogram aggregate function #14839

feat: histogram aggregate function #14839

suimenno3002 commented Mar 5, 2024 •

edited

Loading

suimenno3002 commented Mar 6, 2024

sundy-li commented Mar 7, 2024

suimenno3002 commented Mar 10, 2024 •

edited

Loading

suimenno3002 commented Mar 14, 2024

BohuTANG commented Mar 15, 2024

feat: histogram aggregate function #14839

feat: histogram aggregate function #14839

Conversation

suimenno3002 commented Mar 5, 2024 • edited Loading

Summary

Tests

Type of change

Progress

For Reviewers

suimenno3002 commented Mar 6, 2024

sundy-li commented Mar 7, 2024

suimenno3002 commented Mar 10, 2024 • edited Loading

suimenno3002 commented Mar 14, 2024

BohuTANG commented Mar 15, 2024

suimenno3002 commented Mar 5, 2024 •

edited

Loading

suimenno3002 commented Mar 10, 2024 •

edited

Loading