Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL-2622: Implement multi-column COUNT #28

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

mattChiaravalloti
Copy link
Collaborator

@mattChiaravalloti mattChiaravalloti commented Feb 19, 2025

This PR adds support for multi-column COUNT. It supports this for distinct and non-distinct COUNT. It also updates the desugarer to properly handle single-column COUNT when that column MAY/MUST be a document, and to avoid conditionally checking literal values. It also added some new exprs and mql functions to the air since we needed them for our desugaring ($map, $objectToArray, $allElementsTrue, etc. you'll see them when you review).

This ended up being a bit larger than originally anticipated mainly because, in addition to implementing multi-column COUNT, I covered the other things described above. These changes touch a lot of files but the core of the new work is still in the accumulators.rs desugarer, the query spec tests, and a new rewriter pass.

@@ -440,6 +440,7 @@ pub enum GroupAccumulatorExpr {
SQLAccumulator {
distinct: bool,
var: Box<Expression>,
arg_is_possibly_doc: Option<String>,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the $sql-prefixed accumulators are not "real" MQL syntax, I feel comfortable adding this for testing convenience. Generally, although this file appears first I advise you do not review it first. Consider starting with the spec tests and the desugarer tests and working out from there.

let arg_is_possibly_doc = match arg_is_possibly_doc {
Some(s) if s.to_lowercase() == "must" => Satisfaction::Must,
Some(s) if s.to_lowercase() == "may" => Satisfaction::May,
_ => Satisfaction::Not,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for testing, so I think it is ok to play a little fast and loose and use a wildcard.

Comment on lines +195 to +198
// No need to create a conditional if the argument is a literal value.
// We know null will result in the then case and non-null will result in the else.
Literal(LiteralValue::Null) => then,
Literal(_) => r#else,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a little drive-by optimization that helps avoid things like desugaring COUNT(1) into {$cond: [{$in: [{$type: {$literal: 1}}, ["missing", "null"]}, 0, 1]}. There's really no need to use the conditional when we statically know the value is not null or missing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also the obvious future extension here where we optimize at the mir-level whether or not we need to use null-checking semantics at all but that's for another time.

"$cond":
[
{
"$or":
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I flipped this from a conjunction of negative conditions to a disjunction of positive ones. This just helped line up the single- and multiple-column cases more easily, especially now that we account for single-columns that may be documents.

@@ -5177,6 +5177,7 @@ mod aggregation {
function: mir::AggregationFunction::Count,
distinct: false,
arg: mir::Expression::Literal(mir::LiteralValue::Integer(42)).into(),
arg_is_possibly_doc: Satisfaction::Not,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is irrelevant for testing for the majority of the codebase. As noted where the field is defined, it is only relevant at desugar time. That's why you keep seeing it set to Not in most unit tests.

@@ -165,6 +174,59 @@ impl Visitor for AggregateUsageCheckVisitor {
}
}

#[derive(Default)]
pub struct MultiArgCountVisitor {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not in the design doc. The algebrizer rejects aggregations with multiple arguments, and the mir and air only support single-argument accumulator exprs. I figured the smallest change was to syntactically rewrite multi-column COUNTs into COUNT(<doc>), thus making it an aggregation with a single argument. The previous ticket already enabled COUNT with document arguments so this change requires no further work down the line except desugarer updates.

.into_iter()
.map(|arg| {
let ast::Expression::Identifier(key) = arg.clone() else {
self.error = Some(Error::InvalidMultiArgCountArg);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided for now we will only support field references as the multiple arguments. It is doubtful anyone will want to do something like COUNT(a+1, WHEN b CASE 7 THEN "yes" ELSE c END, x). If someone requests this, we can support it then. For now, this is a very convenient restriction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good call, no need to add too much complexity if it's not requested.

@@ -9,7 +9,7 @@ catalog_data:

multi:
- { "_id": 1, "a": 1, "b": 2, "c": 1 }
- { "_id": 2, "a": 2, "b": 2, "c": 2 }
- { "_id": 2, "a": 2, "b": 2, "c": 3 }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change for the sake of DISTINCT testing.

Copy link
Collaborator

@pmeredit pmeredit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I really like the syntax rewrite here, and now I'm pretty convinced that is how we should handle the new Unwind support, also.

Although, we don't have schema info so we might end up with extra unwinds if we go that way.

Copy link
Member

@bucaojit bucaojit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look great and the tests are thorough.
I had one question about supporting SELECT COUNT(DISTINCT col,...).

@@ -1106,15 +1107,18 @@ impl<'a> Algebrizer<'a> {
return Err(Error::StarInNonCount);
}
ast::FunctionArguments::Args(ve) => {
let arg = if ve.len() != 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing out the change and I see that it is still throwing an error if I execute the query:

SELECT COUNT(DISTINCT a,b) from multi

I see that we support the SELECT COUNT(DISTINCT *) so I think we'd also want to support the multi-column here by having a separate case for the Count function to allow multiple arguments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's surprising! I wonder if it's a rewriter order thing 🤔 Like we do the accumulator rewrite before the select clause one. This spec test obviously passes:

- description: COUNT distinct multi column correctness test -- only counts rows with at least one non-nullish value
    query: "SELECT * FROM foo.multi AS m GROUP BY a AS a AGGREGATE COUNT(DISTINCT b, c) AS gcount"

so I'll need to figure out what's causing the COUNT(<multi>) in the select position to fail! Great catch, thank you!

MQLSemanticOperator(MQLSemanticOperator {
op: MQLOperator::In,
// Condition used to desugar a $sqlCount when the argument is not a document. This condition asserts
// that the arg is null or missing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of macros, that helped with keeping the code easy to understand 👍

.into_iter()
.map(|arg| {
let ast::Expression::Identifier(key) = arg.clone() else {
self.error = Some(Error::InvalidMultiArgCountArg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good call, no need to add too much complexity if it's not requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants