Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single row implied aggregates default to values instead of records #4420

Merged
merged 5 commits into from
Mar 3, 2023

Conversation

mccanne
Copy link
Collaborator

@mccanne mccanne commented Mar 3, 2023

This commit changes the semantics of aggregate functions of an implied summarize op to simply emit the values of the aggregation instead of putting them in a record. This changes the UX for aggregations to be more Zed-like and less SQL-like when trying to form simple aggregations that result in a single result without one or more group-by keys.

Many of us found ourselves typing the pattern 'agg() | yield agg' and realized this approach would simplify things. This way, 'echo "1 2 3"' | sum(this)' results in 6 instead of {sum:6}.

The change here is rather small but the ramifications on tests are non-trivial.

Fixes #4418

This commit changes the semantics of aggregate functions of an implied
summarize op to simply emit the values of the aggregation instead of
putting them in a record.  This changes the UX for aggregqtions to be
more Zed-like and less SQL-like when trying to form simple aggregations
that result in a single result without one or more group-by keys.

Many of us found ourselves typing the pattern 'agg() | yield agg' and
realized this approach would simplify things.  This way,
'echo "1 2 3"' | sum(this)' results in 6 instead of {sum:6}.

The change here is rather small but the ramifications on tests are
non-trivial.
@mccanne mccanne requested review from jameskerr, philrz and a team March 3, 2023 00:33
@philrz
Copy link
Contributor

philrz commented Mar 3, 2023

I did just confirm that this change breaks a couple of the end-to-end Zui tests, but the fix looks trivial and is something I'll be able to take care of.

@nwt
Copy link
Member

nwt commented Mar 3, 2023

@mccanne: I think we should do this only if we have a single aggregation. I say that because I think

$ echo 1 2 3 | zq 'count(),sum(this)' -
{count:3(uint64),sum:6}

is both more intuitive and more useful than

$ echo 1 2 3 | zq 'count(),sum(this)' -
3(uint64)
6

@@ -18,7 +18,7 @@ echo 'true false true' | zq -z 'and(this)' -
```
=>
```mdtest-output
{and:false}
false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR can effectively be seen as one big bug fix. I see now that all this time the usage guidance for all our aggregate functions have looked like:

and(bool) -> bool

i.e., they've been showing the primitive return type this PR introduces and not the record type they've always returned. So now it finally matches. 😂

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever we ultimately decide, somewhere in the docs (probably in docs/language/operators/summarize.md) I think we should have some brief-but-explicit text explaining when they can expect what's returned to be a value vs. a record.

Ok, just pushed an update to summarize.md

@philrz
Copy link
Contributor

philrz commented Mar 3, 2023

Having read through the updated examples, I think I agree with @nwt's proposal of only doing it with a single aggregation.

Whatever we ultimately decide, somewhere in the docs (probably in docs/language/operators/summarize.md) I think we should have some brief-but-explicit text explaining when they can expect what's returned to be a value vs. a record.

@philrz philrz requested a review from a team March 3, 2023 01:57
@mccanne
Copy link
Collaborator Author

mccanne commented Mar 3, 2023

Agreed! Will fix.

Copy link
Member

@nwt nwt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@@ -30,6 +30,11 @@ A key may be either an expression or a field. If the key field is omitted,
it is inferred from the expression, e.g., the field name for `by lower(s)`
is `lower`.

When the result of `summarize` is a single value (e.g., a single aggregate
function without group-by keys) and there is no field name specified, then
output is the Zed value of that result rather than a single-field record
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output is the Zed value of that result rather than a single-field record
the output is that single value rather than a single-field record

```

To format the output of a single-valued aggregation into a record, simply specify
an explicit field for the output as in:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
an explicit field for the output as in:
an explicit field for the output:

```

When multiple aggregate functions are specified, even without explicit field names,
a record result is generated from the implied fields names of the functions:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a record result is generated from the implied fields names of the functions:
a record result is generated with field names implied by the functions:

@mccanne mccanne merged commit 739e271 into main Mar 3, 2023
@mccanne mccanne deleted the single-row-agg branch March 3, 2023 16:41
nwt added a commit that referenced this pull request Mar 19, 2023
With #4420, the DAG for "from a | count()" now contains a nested
sequential operator, which prevents Optimizer.parallelizeTrunk from
parallelizing the flowgraph.  Fix this by inlining nested sequential
operators in parallelizeTrunk.
nwt added a commit that referenced this pull request Mar 20, 2023
With #4420, the DAG for "from a | count()" now contains a nested
sequential operator, which prevents Optimizer.parallelizeTrunk from
parallelizing the flowgraph.  Fix this by inlining nested sequential
operators in parallelizeTrunk.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Returning value rather than record from aggregation functions
3 participants