-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sorting options to groupby #3253
Conversation
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Thank you! |
@testset "sorting API" begin | ||
# simple tests | ||
df = DataFrame(x=["b", "c", "b", "a", "c"]) | ||
@test getindex.(keys(groupby(df, :x)), 1) == ["b", "c", "a"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these can use only
instead of getindex(_, 1)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed it could. I initially had this test implemented on groups not on keys and using getindex
works on both.
@test getindex.(keys(groupby(df, :x, sort=true)), 1) == [1, 2, 100] | ||
@test getindex.(keys(groupby(df, :x, sort=NamedTuple())), 1) == [1, 2, 100] | ||
@test getindex.(keys(groupby(df, :x, sort=false)), 1) == [2, 100, 1] | ||
@test getindex.(keys(groupby(df, order(:x))), 1) == [1, 2, 100] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having so many equivalent ways to specify sorting does seem a bit much? Not sure if it's worth doing anything about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that sort
etc. provide that many ways to specify sort order, so we cannot do anything about it.
What is the rationale behind it:
- normally people will use the "global" settings like
(rev=true,)
, which applies to all columns - however there are cases when you want to specify sorting order per column, e.g.
[order(:x, rev=true), :y]
, where you reverse:x
but sort on:y
in ascending order. Thereforeorder
"per column" is needed.
In general - this complexity is needed when one has several columns.
@jariji - if you would be willing actually improving https://dataframes.juliadata.org/stable/man/sorting/ section of the manual would be welcome. I planned to do it at some point, but maybe you would be willing to give it a shot and give a more in-depth coverage of all sorting options (this PR just inherits the complexity we allow for there). |
Is there an advantage to sorting during the groupby versus sorting the groups afterwards? |
There is no convenient way to sort the groups afterwards AFAICT. To get a desired order you would need to sort the data frame you |
|
It is expensive (in terms of time and memory)
Sorting while grouping will be faster and more convenient if user wants groups sorted (and most likely this is a most typical use case where user knows upfront how one wants groups to be sorted).
|
Fixes #3251