Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: histogram aggregate function #14839

Merged

Conversation

suimenno3002
Copy link
Contributor

@suimenno3002 suimenno3002 commented Mar 5, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Briefly describe what this PR aims to solve. Include background context that will help reviewers understand the purpose of the PR.

This Draft PR is only used to display task disassembly, current progress, and exchange ideas. The quality of the code is currently low. Today I will improve it to a state that can be reviewed.

Tests

  • Unit Test
  • Logic Test

Type of change

  • New Feature (non-breaking change which adds functionality)

Progress

  • Migrate doris' equal height histogram implementation and unit test
  • make code clean
  • license compliance using doris code
  • histogram aggre for types covered by RangeIndex::supported_type
  • [-] move the equal-weight histogram code into a common path
  • [-] discussion on the output format of aggregation results
  • add test case for aggregate function, and complete testing
  • make histogram of date date_time decimal readable
  • docs for histogram aggregate function

For Reviewers

The implementation refers to doris' histogram function, so these may be helpful:

WIP


This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Mar 5, 2024
@suimenno3002 suimenno3002 force-pushed the feat/histogram-aggregate-function branch from d1ac941 to 31e8547 Compare March 5, 2024 02:54
@suimenno3002 suimenno3002 force-pushed the feat/histogram-aggregate-function branch 3 times, most recently from 9441e43 to fba1126 Compare March 6, 2024 21:14
@suimenno3002
Copy link
Contributor Author

Is there anyway to cast ScalarRef to &Scalar?

@suimenno3002 suimenno3002 force-pushed the feat/histogram-aggregate-function branch from fba1126 to 4675297 Compare March 6, 2024 21:26
@sundy-li
Copy link
Member

sundy-li commented Mar 7, 2024

Is there anyway to cast ScalarRef to &Scalar?

to_owned

@suimenno3002 suimenno3002 force-pushed the feat/histogram-aggregate-function branch 2 times, most recently from 918b413 to 02daad4 Compare March 8, 2024 21:51
@suimenno3002
Copy link
Contributor Author

suimenno3002 commented Mar 10, 2024

Is there anyway to format Scalar to be readable? I think neither Scalar nor ScalarRef implement Display. Or maybe I should change the output format of histogram, from one line string to multi line flat bucket

@suimenno3002 suimenno3002 force-pushed the feat/histogram-aggregate-function branch 4 times, most recently from a7ff579 to 1667ad2 Compare March 12, 2024 20:48
@suimenno3002 suimenno3002 marked this pull request as ready for review March 13, 2024 14:52
@sundy-li sundy-li requested a review from ariesdevil March 14, 2024 01:10
@suimenno3002 suimenno3002 force-pushed the feat/histogram-aggregate-function branch from 1667ad2 to 363af45 Compare March 14, 2024 21:32
@sundy-li sundy-li added this pull request to the merge queue Mar 14, 2024
Merged via the queue into databendlabs:main with commit 9cbe53d Mar 14, 2024
75 of 76 checks passed
@suimenno3002
Copy link
Contributor Author

= =, I found some small improvements. Should I submit a new PR directly or open an issue?

diff --git a/src/query/functions/src/aggregates/aggregate_histogram.rs b/src/query/functions/src/aggregates/aggregate_histogram.rs
index b1935c0a1d..7c0e2c4169 100644
--- a/src/query/functions/src/aggregates/aggregate_histogram.rs
+++ b/src/query/functions/src/aggregates/aggregate_histogram.rs
@@ -145,8 +145,8 @@ where
             &buckets
                 .drain(..)
                 .map(|raw| Bucket {
-                    lower: format_scalar(raw.lower.clone()),
-                    upper: format_scalar(raw.upper.clone()),
+                    lower: format_scalar(raw.lower),
+                    upper: format_scalar(raw.upper),
                     ndv: raw.ndv,
                     count: raw.count,
                     pre_sum: raw.pre_sum,
@@ -352,7 +352,7 @@ fn calculate_bucket_max_values<T: Ord>(value_map: &BTreeMap<T, u64>, num_buckets
     // Assume that the value map is not empty
     debug_assert!(!value_map.is_empty());
 
-    // Calculate the total number of values in the map using std::accumulate()
+    // Calculate the total number of values in the map
     let total_values = value_map.values().sum();
 
     // If there is only one bucket, then all values will be assigned to that bucket
diff --git a/src/query/sql/src/planner/semantic/type_check.rs b/src/query/sql/src/planner/semantic/type_check.rs
index c691f9a450..d11fe9b734 100644
--- a/src/query/sql/src/planner/semantic/type_check.rs
+++ b/src/query/sql/src/planner/semantic/type_check.rs
@@ -1726,7 +1726,7 @@ impl<'a> TypeChecker<'a> {
             } && arg_types[1].is_integer();
             if !is_positive_integer {
                 return Err(ErrorCode::SemanticError(
-                    "The delimiter of `histogram` must be a constant positive int",
+                    "The max_num_buckets of `histogram` must be a constant positive int",
                 ));
             }
 

@BohuTANG
Copy link
Member

= =, I found some small improvements. Should I submit a new PR directly or open an issue?

diff --git a/src/query/functions/src/aggregates/aggregate_histogram.rs b/src/query/functions/src/aggregates/aggregate_histogram.rs
index b1935c0a1d..7c0e2c4169 100644
--- a/src/query/functions/src/aggregates/aggregate_histogram.rs
+++ b/src/query/functions/src/aggregates/aggregate_histogram.rs
@@ -145,8 +145,8 @@ where
             &buckets
                 .drain(..)
                 .map(|raw| Bucket {
-                    lower: format_scalar(raw.lower.clone()),
-                    upper: format_scalar(raw.upper.clone()),
+                    lower: format_scalar(raw.lower),
+                    upper: format_scalar(raw.upper),
                     ndv: raw.ndv,
                     count: raw.count,
                     pre_sum: raw.pre_sum,
@@ -352,7 +352,7 @@ fn calculate_bucket_max_values<T: Ord>(value_map: &BTreeMap<T, u64>, num_buckets
     // Assume that the value map is not empty
     debug_assert!(!value_map.is_empty());
 
-    // Calculate the total number of values in the map using std::accumulate()
+    // Calculate the total number of values in the map
     let total_values = value_map.values().sum();
 
     // If there is only one bucket, then all values will be assigned to that bucket
diff --git a/src/query/sql/src/planner/semantic/type_check.rs b/src/query/sql/src/planner/semantic/type_check.rs
index c691f9a450..d11fe9b734 100644
--- a/src/query/sql/src/planner/semantic/type_check.rs
+++ b/src/query/sql/src/planner/semantic/type_check.rs
@@ -1726,7 +1726,7 @@ impl<'a> TypeChecker<'a> {
             } && arg_types[1].is_integer();
             if !is_positive_integer {
                 return Err(ErrorCode::SemanticError(
-                    "The delimiter of `histogram` must be a constant positive int",
+                    "The max_num_buckets of `histogram` must be a constant positive int",
                 ));
             }
 

Cool, only a new PR is ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants