-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: make order_by()
have an effect after group_by()
when aggregating
#9716
Comments
Thanks for raising the issue. Personally, I prefer to the option 1: insert the orderings into the individual aggs for the following reasons:
We need to wait #9170 merged. |
I think that makes sense. Aligning with the "projected" form of this construction for consistency seems like a good thing to have. The tricky bits here are going to be ignoring aggregates that don't support ordering. One approach to make that easier might be to allow every aggregate operation to be ordered, i.e., give the aggregation base class an Make sure to look at how ordering key operations are spelled as inputs to the Also make sure to look at what sqlglot does, with |
I think I disagree here, but I also don't really understand why our In my mental model, an In [1]: import ibis
In [2]: t = ibis.table({"x": "string", "y": "string", "z": "string"})
In [3]: from ibis import _
In [4]: expr = (
...: t
...: .group_by("x")
...: .agg(ys=_.y.collect(order_by="z")) # order the collect agg by z
...: .order_by("x") # order the output table by x
...: )
In [5]: ibis.to_sql(expr)
Out[5]:
SELECT
*
FROM (
SELECT
"t0"."x",
ARRAY_AGG("t0"."y" ORDER BY "t0"."z" ASC) FILTER(WHERE
"t0"."y" IS NOT NULL) AS "ys"
FROM "unbound_table_0" AS "t0"
GROUP BY
1
) AS "t1"
ORDER BY
"t1"."x" ASC If I understand the semantics @cpcloud is describing above, the above query would be equivalent to: In [6]: expr = (
...: t
...: .group_by("x")
...: .order_by("z") # order the aggregation by z
...: .agg(ys=_.y.collect()) # collect now implicitly ordered by z?
...: .order_by("x") # order the output table by x
...: ) I personally would find this confusing. We don't support implicit persistent ordering in the non-grouped context ( |
Have similar feeling, Not sure if it is good to remove this
|
The current use case of t.group_by(g).order_by(o) is automatic windowization (broadcasting in numpy lingo) of any aggregates in a subsequent t.group_by(g).order_by(o).mutate(t.x.sum()) desugars to t.mutate(t.x.sum().over(group_by=g, order_by=o)) This is taken directly from dplyr's |
I personally find the desugared version easier to read and understand, but 🤷. If we want to keep this method around, one "fix" would be to error on a method that converts a |
IMO it seems odd to have two nearly identical constructions, one that has well-defined behavior, and one that errors even though it could have well-defined behavior. |
I think I disagree that it has well defined/consistent behavior. There are many implications from having an
|
could we do this
loh, it is no
|
What happened?
Started in #9710, but this issue is for one specific fix.
yields (note there is no ORDER BY)
I want the agg to be ordered.
Question: Where should the ORDER BY be inserted? Do we need to push this down into individual aggs, ie
?
If so, then this might have to depend upon #9170 to land first?
If I understand #9710 correctly, if we did the ORDER BY in a subquery before the groupby, ie
then some backends don't give any guarantees that this works as expected.
Code of Conduct
The text was updated successfully, but these errors were encountered: