bug: make `order_by()` have an effect after `group_by()` when aggregating #9716

NickCrews · 2024-07-29T17:23:01Z

What happened?

Started in #9710, but this issue is for one specific fix.

import ibis
from ibis import _
ibis.options.interactive = True
orders = ibis.memtable({"cust": [1, 1, 1, 2, 2], "date": [3, 2, 1, 9, 8], "amount": [1, 2, 3, 4, 5]})
e2 = orders.group_by("cust").order_by("date").agg(earliest_first=_.amount.collect())
print(ibis.to_sql(e2))

yields (note there is no ORDER BY)

SELECT
  "t0"."cust",
  ARRAY_AGG("t0"."amount") FILTER(WHERE
    "t0"."amount" IS NOT NULL) AS "earliest_first"
FROM "ibis_pandas_memtable_c6coi73fkzerleojlnpmmwq2va" AS "t0"
GROUP BY
  1

I want the agg to be ordered.

Question: Where should the ORDER BY be inserted? Do we need to push this down into individual aggs, ie

SELECT
  "t0"."cust",
  ARRAY_AGG("t0"."amount" ORDER BY "t0"."date" ASC) FILTER(WHERE
    "t0"."amount" IS NOT NULL) AS "earliest_first"
FROM "ibis_pandas_memtable_c6coi73fkzerleojlnpmmwq2va" AS "t0"
GROUP BY
  1

?

If so, then this might have to depend upon #9170 to land first?

If I understand #9710 correctly, if we did the ORDER BY in a subquery before the groupby, ie

SELECT
  "t1"."cust",
  ARRAY_AGG("t1"."amount") FILTER(WHERE
    "t1"."amount" IS NOT NULL) AS "earliest_first"
FROM (
  SELECT
    *
  FROM "ibis_pandas_memtable_c6coi73fkzerleojlnpmmwq2va" AS "t0"
  ORDER BY
    "t0"."date" ASC
) AS "t1"
GROUP BY
  1

then some backends don't give any guarantees that this works as expected.

Code of Conduct

I agree to follow this project's Code of Conduct

jitingxu1 · 2024-07-29T22:19:58Z

Thanks for raising the issue.

Personally, I prefer to the option 1: insert the orderings into the individual aggs for the following reasons:

it represents the users intent (wants the rows in agg ordered) and guarantees the ordering for the aggs
it aligns with the mutate and select for GroupedTable

We need to wait #9170 merged.

How do you think @cpcloud @gforsyth

cpcloud · 2024-07-29T22:40:13Z

I think that makes sense. Aligning with the "projected" form of this construction for consistency seems like a good thing to have.

The tricky bits here are going to be ignoring aggregates that don't support ordering.

One approach to make that easier might be to allow every aggregate operation to be ordered, i.e., give the aggregation base class an order_by field, but only expose an order_by argument where it makes sense, e.g. collect would have it, sum wouldn't. This would make the op changes relatively minor, which the main changes being in the compilers.

Make sure to look at how ordering key operations are spelled as inputs to the Sort operation. That should give some idea of how to spell them for aggregations.

Also make sure to look at what sqlglot does, with sg.parse_one('a query that contains order by in the agg') to see how to construct the various ordering objects it requires.

jcrist · 2024-07-30T17:33:04Z

I think I disagree here, but I also don't really understand why our GroupedTable object exposes an order_by method at all. Looking through the code it's not clear to me when the orderings attribute on the expr comes into effect. What's the current use of this method (probably a question for @cpcloud)?

In my mental model, an order_by on a table-like tells how to order the output table. In contrast, the ordering in an aggregate like collect describes how to perform the aggregate internally, but has no effect on the output ordering in the larger expression. This means that you can order an aggregate like collect by something completely different than the output table ordering (this example based on a WIP branch for #9170):

In [1]: import ibis

In [2]: t = ibis.table({"x": "string", "y": "string", "z": "string"})

In [3]: from ibis import _

In [4]: expr = (
   ...:     t
   ...:     .group_by("x")
   ...:     .agg(ys=_.y.collect(order_by="z"))  # order the collect agg by z
   ...:     .order_by("x")  # order the output table by x
   ...: )

In [5]: ibis.to_sql(expr)
Out[5]: 
SELECT
  *
FROM (
  SELECT
    "t0"."x",
    ARRAY_AGG("t0"."y" ORDER BY "t0"."z" ASC) FILTER(WHERE
      "t0"."y" IS NOT NULL) AS "ys"
  FROM "unbound_table_0" AS "t0"
  GROUP BY
    1
) AS "t1"
ORDER BY
  "t1"."x" ASC

If I understand the semantics @cpcloud is describing above, the above query would be equivalent to:

In [6]: expr = (
   ...:     t
   ...:     .group_by("x")
   ...:     .order_by("z")  # order the aggregation by z
   ...:     .agg(ys=_.y.collect())  # collect now implicitly ordered by z?
   ...:     .order_by("x")  # order the output table by x
   ...: )

I personally would find this confusing. We don't support implicit persistent ordering in the non-grouped context (t.order_by("z").agg(ys=_.y.collect()) won't implicitly use the z ordering), doing so in the grouped context would feel inconsistent and surprising. I also find explicitly passing the ordering that an aggregate should use when performing it clearer that the ordering affects the aggregation result, but not the output ordering of the table itself.

jitingxu1 · 2024-07-30T17:49:45Z

Have similar feeling, Not sure if it is good to remove this order_by() method and orderings attribute from the GroupedTable because:

It is confusing where the ordering should be performed (inside or outside aggregation), if we have to keep the order_by, I prefer to keep it inside the aggs.
Now we'll explicitly support order_by in agg.

but I also don't really understand why our GroupedTable object exposes an order_by method at all

cpcloud · 2024-07-30T17:50:37Z

The current use case of

t.group_by(g).order_by(o)

is automatic windowization (broadcasting in numpy lingo) of any aggregates in a subsequent mutate call.

t.group_by(g).order_by(o).mutate(t.x.sum())

desugars to

t.mutate(t.x.sum().over(group_by=g, order_by=o))

This is taken directly from dplyr's mutate + arrange semantics.

jcrist · 2024-07-30T17:54:59Z

... This is taken directly from dplyr's mutate + arrange semantics.

I personally find the desugared version easier to read and understand, but 🤷.

If we want to keep this method around, one "fix" would be to error on a method that converts a GroupedTable to a Table if there are orderings specfied but not used. So t.group_by(g).order_by(o).mutate(...) would still work, but t.group_by(g).order_by(o).agg(...) would error in an informative way telling the user what they probably wanted to do. This is easy to do, keeps our API explicit, and lets the user know that the order_by method they called isn't doing anything.

cpcloud · 2024-07-30T18:00:09Z

IMO it seems odd to have two nearly identical constructions, one that has well-defined behavior, and one that errors even though it could have well-defined behavior.

jcrist · 2024-07-30T18:11:51Z

I think I disagree that it has well defined/consistent behavior. There are many implications from having an order_by(...).agg(...) propagate the previous ordering into the aggregates:

We'd need to do it in both the grouped and ungrouped aggregation contexts (if t.group_by("c").order_by("a").agg(bs=_.b.collect()) makes the aggregate ordered by a, then t.order_by("a").agg(bs=_.b.collect()) should as well).
What about passing an aggregation to a mutate call (t.order_by("a").mutate(bs=_.b.collect()))? Is that collect ordered by a?
What about calls between the order_by and agg? Does the order still propagate? We're entering pandas/polars implicit ordering semantics, which is something we've avoided so far.

jitingxu1 · 2024-07-30T18:16:36Z

could we do this

t.group_by("c").mutate(bs=_.b.collect(order_by=_.a))

e2 = orders.group_by("cust").mutate(earliest_first=_.amount.collect(order_by=_.date))

loh, it is no

SELECT
  "t0"."cust",
  "t0"."date",
  "t0"."amount",
  ARRAY_AGG("t0"."amount") FILTER(WHERE
    "t0"."amount" IS NOT NULL) OVER (PARTITION BY "t0"."cust" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "earliest_first"
FROM "ibis_pandas_memtable_45dx6ahfyfbithx2xczy7lqmc4" AS "t0"

NickCrews added the bug Incorrect behavior inside of ibis label Jul 29, 2024

github-project-automation bot added this to Ibis planning and roadmap Jul 29, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Jul 29, 2024

NickCrews mentioned this issue Jul 29, 2024

feat/bug: order_by aggs in groupbys #9710

Closed

1 task

jitingxu1 self-assigned this Jul 29, 2024

jcrist unassigned jitingxu1 Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: make `order_by()` have an effect after `group_by()` when aggregating #9716

bug: make `order_by()` have an effect after `group_by()` when aggregating #9716

NickCrews commented Jul 29, 2024 •

edited

Loading

jitingxu1 commented Jul 29, 2024

cpcloud commented Jul 29, 2024

jcrist commented Jul 30, 2024

jitingxu1 commented Jul 30, 2024

cpcloud commented Jul 30, 2024

jcrist commented Jul 30, 2024

cpcloud commented Jul 30, 2024

jcrist commented Jul 30, 2024

jitingxu1 commented Jul 30, 2024 •

edited

Loading

bug: make order_by() have an effect after group_by() when aggregating #9716

bug: make order_by() have an effect after group_by() when aggregating #9716

Comments

NickCrews commented Jul 29, 2024 • edited Loading

What happened?

Code of Conduct

jitingxu1 commented Jul 29, 2024

cpcloud commented Jul 29, 2024

jcrist commented Jul 30, 2024

jitingxu1 commented Jul 30, 2024

cpcloud commented Jul 30, 2024

jcrist commented Jul 30, 2024

cpcloud commented Jul 30, 2024

jcrist commented Jul 30, 2024

jitingxu1 commented Jul 30, 2024 • edited Loading

bug: make `order_by()` have an effect after `group_by()` when aggregating #9716

bug: make `order_by()` have an effect after `group_by()` when aggregating #9716

NickCrews commented Jul 29, 2024 •

edited

Loading

jitingxu1 commented Jul 30, 2024 •

edited

Loading