-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(api): Add order_by
parameter to the collect()
method
#9170
Comments
Thanks for opening this! I agree that supporting ordering in I'm not sure how well an t = ibis.table({"a": "int", "b": "int, "c": "int"})
t.a.collect(order_by=("b", "c")) # here `b` and `c` refer to columns on the parent table of `a`
t.a.collect(order_by=(t.b, t.c)) # explicit column references also work Typing all that out doesn't feel that odd, but AFAIK we don't have any other columnar operations that support an ordering specification, so we don't have an existing pattern to mirror here. Edit: actually, we do have some precedent for binding args to a column method to the parent table - the |
|
A quick survey of backends and current ops shows those are the main 4 operations. I've taken a quick look through implementing this (needed a fun friday break) and like most things ibis it's turned up some internal cleanups we'll want to do first but I think this should be only a medium-sized lift to implement. |
order_by
parameter to the collect()
methodorder_by
parameter to the collect()
method
…ss (#9222) This is an initial refactor moving towards #9170. Previously every backend implemented their own `_aggregate` function - many of them copy-pasted (with slight variations) of each other. To add a new `order_by` kwarg to `_aggregate` would require editing all 16 copies of this function, which would be _annoying_. This PR is an attempt to centralize their implementations into a common `Gen` class. The class takes a config flag to handle the common cases, the uncommon cases are then handled by backend-specific subclasses. This also extracts the `.agg` attribute out to be a class variable (err, descriptor) rather than an instance variable. An alternative implementation would be adding boolean flags directly to the `SQLGlotCompiler` class, but IIRC we intentionally moved away from those in the SQLAlchemy -> SQLGlot refactor. Could go either way here.
Is your feature request related to a problem?
The
collect()
method in Ibis is a convenient way to aggregate data into arrays when working with BigQuery backends. It translates to theARRAY_AGG
function in BigQuery SQL. However, there's currently no built-in way to specify anORDER BY
clause within theARRAY_AGG
aggregation. This means that the order of elements in the resulting array is not guaranteed, which can be problematic when the order of elements matters for the downstream analysis.What is the motivation behind your request?
No response
Describe the solution you'd like
I propose adding an
order_by
parameter to thecollect()
method. This parameter would accept either:This would allow users to explicitly control the ordering of elements within the aggregated array, making the results more predictable and useful.
What version of ibis are you running?
9.0.0
What backend(s) are you using, if any?
BigQuery
Code of Conduct
The text was updated successfully, but these errors were encountered: