Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: make order_by() have an effect after group_by() when aggregating #9716

Open
1 task done
NickCrews opened this issue Jul 29, 2024 · 9 comments
Open
1 task done
Labels
bug Incorrect behavior inside of ibis

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Jul 29, 2024

What happened?

Started in #9710, but this issue is for one specific fix.

import ibis
from ibis import _
ibis.options.interactive = True
orders = ibis.memtable({"cust": [1, 1, 1, 2, 2], "date": [3, 2, 1, 9, 8], "amount": [1, 2, 3, 4, 5]})
e2 = orders.group_by("cust").order_by("date").agg(earliest_first=_.amount.collect())
print(ibis.to_sql(e2))

yields (note there is no ORDER BY)

SELECT
  "t0"."cust",
  ARRAY_AGG("t0"."amount") FILTER(WHERE
    "t0"."amount" IS NOT NULL) AS "earliest_first"
FROM "ibis_pandas_memtable_c6coi73fkzerleojlnpmmwq2va" AS "t0"
GROUP BY
  1

I want the agg to be ordered.

Question: Where should the ORDER BY be inserted? Do we need to push this down into individual aggs, ie

SELECT
  "t0"."cust",
  ARRAY_AGG("t0"."amount" ORDER BY "t0"."date" ASC) FILTER(WHERE
    "t0"."amount" IS NOT NULL) AS "earliest_first"
FROM "ibis_pandas_memtable_c6coi73fkzerleojlnpmmwq2va" AS "t0"
GROUP BY
  1

?

If so, then this might have to depend upon #9170 to land first?

If I understand #9710 correctly, if we did the ORDER BY in a subquery before the groupby, ie

SELECT
  "t1"."cust",
  ARRAY_AGG("t1"."amount") FILTER(WHERE
    "t1"."amount" IS NOT NULL) AS "earliest_first"
FROM (
  SELECT
    *
  FROM "ibis_pandas_memtable_c6coi73fkzerleojlnpmmwq2va" AS "t0"
  ORDER BY
    "t0"."date" ASC
) AS "t1"
GROUP BY
  1

then some backends don't give any guarantees that this works as expected.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the bug Incorrect behavior inside of ibis label Jul 29, 2024
@jitingxu1 jitingxu1 self-assigned this Jul 29, 2024
@jitingxu1
Copy link
Contributor

Thanks for raising the issue.

Personally, I prefer to the option 1: insert the orderings into the individual aggs for the following reasons:

  • it represents the users intent (wants the rows in agg ordered) and guarantees the ordering for the aggs
  • it aligns with the mutate and select for GroupedTable

We need to wait #9170 merged.

How do you think @cpcloud @gforsyth

@cpcloud
Copy link
Member

cpcloud commented Jul 29, 2024

I think that makes sense. Aligning with the "projected" form of this construction for consistency seems like a good thing to have.

The tricky bits here are going to be ignoring aggregates that don't support ordering.

One approach to make that easier might be to allow every aggregate operation to be ordered, i.e., give the aggregation base class an order_by field, but only expose an order_by argument where it makes sense, e.g. collect would have it, sum wouldn't. This would make the op changes relatively minor, which the main changes being in the compilers.

Make sure to look at how ordering key operations are spelled as inputs to the Sort operation. That should give some idea of how to spell them for aggregations.

Also make sure to look at what sqlglot does, with sg.parse_one('a query that contains order by in the agg') to see how to construct the various ordering objects it requires.

@jcrist
Copy link
Member

jcrist commented Jul 30, 2024

I think I disagree here, but I also don't really understand why our GroupedTable object exposes an order_by method at all. Looking through the code it's not clear to me when the orderings attribute on the expr comes into effect. What's the current use of this method (probably a question for @cpcloud)?


In my mental model, an order_by on a table-like tells how to order the output table. In contrast, the ordering in an aggregate like collect describes how to perform the aggregate internally, but has no effect on the output ordering in the larger expression. This means that you can order an aggregate like collect by something completely different than the output table ordering (this example based on a WIP branch for #9170):

In [1]: import ibis

In [2]: t = ibis.table({"x": "string", "y": "string", "z": "string"})

In [3]: from ibis import _

In [4]: expr = (
   ...:     t
   ...:     .group_by("x")
   ...:     .agg(ys=_.y.collect(order_by="z"))  # order the collect agg by z
   ...:     .order_by("x")  # order the output table by x
   ...: )

In [5]: ibis.to_sql(expr)
Out[5]: 
SELECT
  *
FROM (
  SELECT
    "t0"."x",
    ARRAY_AGG("t0"."y" ORDER BY "t0"."z" ASC) FILTER(WHERE
      "t0"."y" IS NOT NULL) AS "ys"
  FROM "unbound_table_0" AS "t0"
  GROUP BY
    1
) AS "t1"
ORDER BY
  "t1"."x" ASC

If I understand the semantics @cpcloud is describing above, the above query would be equivalent to:

In [6]: expr = (
   ...:     t
   ...:     .group_by("x")
   ...:     .order_by("z")  # order the aggregation by z
   ...:     .agg(ys=_.y.collect())  # collect now implicitly ordered by z?
   ...:     .order_by("x")  # order the output table by x
   ...: )

I personally would find this confusing. We don't support implicit persistent ordering in the non-grouped context (t.order_by("z").agg(ys=_.y.collect()) won't implicitly use the z ordering), doing so in the grouped context would feel inconsistent and surprising. I also find explicitly passing the ordering that an aggregate should use when performing it clearer that the ordering affects the aggregation result, but not the output ordering of the table itself.

@jitingxu1
Copy link
Contributor

Have similar feeling, Not sure if it is good to remove this order_by() method and orderings attribute from the GroupedTable because:

  • It is confusing where the ordering should be performed (inside or outside aggregation), if we have to keep the order_by, I prefer to keep it inside the aggs.
  • Now we'll explicitly support order_by in agg.

but I also don't really understand why our GroupedTable object exposes an order_by method at all

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2024

The current use case of

t.group_by(g).order_by(o)

is automatic windowization (broadcasting in numpy lingo) of any aggregates in a subsequent mutate call.

t.group_by(g).order_by(o).mutate(t.x.sum())

desugars to

t.mutate(t.x.sum().over(group_by=g, order_by=o))

This is taken directly from dplyr's mutate + arrange semantics.

@jcrist
Copy link
Member

jcrist commented Jul 30, 2024

... This is taken directly from dplyr's mutate + arrange semantics.

I personally find the desugared version easier to read and understand, but 🤷.

If we want to keep this method around, one "fix" would be to error on a method that converts a GroupedTable to a Table if there are orderings specfied but not used. So t.group_by(g).order_by(o).mutate(...) would still work, but t.group_by(g).order_by(o).agg(...) would error in an informative way telling the user what they probably wanted to do. This is easy to do, keeps our API explicit, and lets the user know that the order_by method they called isn't doing anything.

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2024

IMO it seems odd to have two nearly identical constructions, one that has well-defined behavior, and one that errors even though it could have well-defined behavior.

@jcrist
Copy link
Member

jcrist commented Jul 30, 2024

I think I disagree that it has well defined/consistent behavior. There are many implications from having an order_by(...).agg(...) propagate the previous ordering into the aggregates:

  • We'd need to do it in both the grouped and ungrouped aggregation contexts (if t.group_by("c").order_by("a").agg(bs=_.b.collect()) makes the aggregate ordered by a, then t.order_by("a").agg(bs=_.b.collect()) should as well).
  • What about passing an aggregation to a mutate call (t.order_by("a").mutate(bs=_.b.collect()))? Is that collect ordered by a?
  • What about calls between the order_by and agg? Does the order still propagate? We're entering pandas/polars implicit ordering semantics, which is something we've avoided so far.

@jitingxu1
Copy link
Contributor

jitingxu1 commented Jul 30, 2024

could we do this

t.group_by("c").mutate(bs=_.b.collect(order_by=_.a))

e2 = orders.group_by("cust").mutate(earliest_first=_.amount.collect(order_by=_.date))

loh, it is no

SELECT
  "t0"."cust",
  "t0"."date",
  "t0"."amount",
  ARRAY_AGG("t0"."amount") FILTER(WHERE
    "t0"."amount" IS NOT NULL) OVER (PARTITION BY "t0"."cust" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "earliest_first"
FROM "ibis_pandas_memtable_45dx6ahfyfbithx2xczy7lqmc4" AS "t0"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Status: backlog
Development

No branches or pull requests

4 participants