Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: grouping sets functionality #550

Open
wesm opened this issue Aug 17, 2015 · 4 comments · May be fixed by #9945
Open

feat: grouping sets functionality #550

wesm opened this issue Aug 17, 2015 · 4 comments · May be fixed by #9945
Assignees
Labels
analytics Higher-level data analysis API related issues or PRs feature Features or general enhancements

Comments

@wesm
Copy link
Member

wesm commented Aug 17, 2015

No description provided.

@wesm wesm added feature Features or general enhancements analytics Higher-level data analysis API related issues or PRs labels Aug 17, 2015
@wesm wesm added this to the 0.5 milestone Aug 17, 2015
@wesm wesm self-assigned this Aug 27, 2015
@wesm wesm modified the milestones: 0.7, 0.5 Sep 4, 2015
@cpcloud cpcloud modified the milestones: 0.11, Future Jun 12, 2017
@datapythonista datapythonista removed this from the Future milestone Nov 13, 2020
@cpcloud
Copy link
Member

cpcloud commented Dec 7, 2021

This could also capture generic GROUPING SETS idioms, not just ROLLUP.

@cpcloud cpcloud changed the title Simulate SQL ROLLUP with unions feat: grouping sets functionality Dec 28, 2021
@cpcloud
Copy link
Member

cpcloud commented Dec 28, 2021

GROUPING SETS is a useful tool for data scientists and engineers. It allows computing aggregations for different combinations of grouping keys, all in the same query.

PostgreSQL has great docs on the feature: https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUPING-SETS.

I'm not entirely sure what the API should look like, but here are a few requirements and considerations for anyone who wants to work on this:

  • Whatever the API looks like, it needs to compose with the existing group_by method. It is both legitimate and useful to combine grouping sets, cube, rollup, and group by in a single query.
    For example (from the above postgres docs):
GROUP BY a, CUBE (b, c), GROUPING SETS ((d), (e))

which expands to

GROUP BY GROUPING SETS (
    (a, b, c, d), (a, b, c, e),
    (a, b, d),    (a, b, e),
    (a, c, d),    (a, c, e),
    (a, d),       (a, e)
)

In my mind this rules out a method that exists only on GroupedTableExpr (in an effort to prevent people from accessing the API without calling group_by) because

t.group_by(...).cube()

doesn't make a whole lot of sense unless we allow group_by to take zero arguments, itself a change that I think will lead to unmaintainable hacks.

There are two viable approaches I see.

  1. group_by grows the ability to accept special (new) objects ibis.grouping_sets(sequence_of_tuples), ibis.cube(tuple), ibis.rollup(tuple). The above postgres example would be written as:
import ibis

t = ibis.table([
    ("a", "string"),
    ("b", "string"),
    ("c", "string"),
    ("d", "string"),
    ("money", "float64"),
])

t.group_by([
    "a",
    ibis.cube(("b", "c")),
    ibis.grouping_sets([("d",), ("e",)])
])

I prefer this approach because it requires the least number of invasive changes and I believe will not require any hacks.

  1. TableExpr grows some new methods (grouping_sets, cube, rollup).

This is pretty tantalizing, and superficially seems to compose well until you start trying to ... compose these methods:

# using t from the previous example, seems nice but see below
t.group_by("a").cube(("b", "c")).grouping_sets([("d",), ("e",)])

# equivalent except for column ordering
t.cube(("b", "c")).group_by("a").grouping_sets([("d",), ("e",)])
# t.cube would need to have a single object that tracks all the grouping set "things"
# to make sure it doesn't generate `GROUP BY` multiple times.

Since cube et al are limited to grouping in SQL, we should not introduce a new API that allows grouping set-like things to be use in an apparently-outside-of-group-by-but-not-actually context.

@cpcloud cpcloud added this to the 4.0.0 milestone Apr 19, 2022
@cpcloud cpcloud modified the milestones: 4.0.0, 4.x Aug 31, 2022
@cpcloud cpcloud modified the milestones: 4.1, 5.0 Jan 24, 2023
@cpcloud cpcloud removed this from the 5.0 milestone Feb 2, 2023
@cpcloud cpcloud closed this as completed Mar 30, 2023
@cpcloud cpcloud added the unresolved Issues that are closed but still available for implementation label Mar 31, 2023
@cpcloud cpcloud reopened this Mar 31, 2023
@cpcloud cpcloud closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2023
@jcrist jcrist reopened this Jun 26, 2024
@jcrist
Copy link
Member

jcrist commented Jun 26, 2024

Reopening this, as we'll probably want it for full TPC-DS support (#9447).

@jcrist jcrist unassigned wesm Jun 26, 2024
@jcrist jcrist added this to the Q3 2024 milestone Jun 26, 2024
@landonAcuityMD
Copy link

Definite nice to have - was just looking for a way to do this today. :D

@cpcloud cpcloud self-assigned this Jul 3, 2024
@cpcloud cpcloud removed the unresolved Issues that are closed but still available for implementation label Jul 3, 2024
@lostmygithubaccount lostmygithubaccount removed this from the Q3 2024 milestone Jul 17, 2024
@cpcloud cpcloud linked a pull request Sep 11, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analytics Higher-level data analysis API related issues or PRs feature Features or general enhancements
Projects
Status: backlog
Development

Successfully merging a pull request may close this issue.

6 participants