-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add examples for issorted
docstrings.
#2941
Conversation
src/abstractdataframe/sort.jl
Outdated
@@ -336,13 +336,40 @@ Sort.defalg(df::AbstractDataFrame, o::Ordering; alg=nothing, cols=[]) = | |||
lt=isless, by=identity, rev::Bool=false, order::Ordering=Forward) | |||
|
|||
Test whether data frame `df` sorted by column(s) `cols`. | |||
If no column is specified, it checks if the `df` is sorted lexicographically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an important comment. Can you please add it also to sort
and sort!
. Also can you make it more precise that "no column is specified" means "column selector selects no columns".
@nalimilan - this rule is non-intuitive, but changing it would be breaking. Should we deprecate this behavior for 2.0 release (it would be a separate PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Chandu-4444 - if what I write is not clear please comment and I will give you examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add it also to sort and sort!. Also can you make it more precise that "no column is specified" means "column selector selects no columns".
I get this and will do the same.
@nalimilan - this rule is non-intuitive, but changing it would be breaking. Should we deprecate this behavior for 2.0 release (it would be a separate PR).
This, I didn't understand (And I'm really curious). Can you please elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please elaborate?
sort
and related functions were developed in very early days of DataFrames.jl when there was no precise rules for the whole ecosystem. Therefore some inconsistencies slipped through.
A proper definition of sort
should be (I am skipping kwargs):
Base.sort(df::AbstractDataFrame, cols=All())
however it is
Base.sort(df::AbstractDataFrame, cols=[])
and as a consequence if you pass any column selector that selects no columns the data frame is sorted lexicographically.
While technically it should happen if you select All()
columns. If you select no columns a proper behavior should be that the data frame is not reordered (it is left as is as you have not requested on what the data frame should be sorted).
We have two options:
- leave things as is forever (probably there is not much harm with that as it is unlikely that someone does sorting and wants the data frame to be left untouched)
- make a deprecation of how selecting no columns is handled and in 2.0 release change it the way I describe above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it now, thanks for that excellent explanation. Perhaps this column selector can be made a required argument (Like the one Python has in Pandas)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this column selector can be made a required argument
This would be even more breaking. By proposing the change to:
Base.sort(df::AbstractDataFrame, cols=All())
we are not breaking calls like sort(df)
which are quite likely to exist in current user code.
We try to limit breaking changes. Calls like sort(df, [])
are quite unlikely, so we could deprecate them and change their behavior in 2.0.
Let us wait for an opinion from @nalimilan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's indeed unfortunate. I'm not sure we should make a breaking change though given that it's not super likely to confuse people (if you call sort
with no columns, it's most likely a mistake anyway and you'll get unexpected results), so I'd opt for a deprecation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - let us then document that passing no columns is an undefined behavior.
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
when column selector is not specified.
src/abstractdataframe/sort.jl
Outdated
@@ -448,8 +476,10 @@ end | |||
|
|||
Return a permutation vector of row indices of data frame `df` that puts them in | |||
sorted order according to column(s) `cols`. | |||
If `cols` were not specified, it returns row indices that sorts `df` lexicographically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than repeating this here (which would have to be put in all sorting functions), how about putting cols=All()
in the signature above? We could also change the actual implementation to make it clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I want to do. I will make these changes to this PR and then let @Chandu-4444 to sync the docs following your suggestions and writing that passing no columns leads to an undefined row order.
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@Chandu-4444 - I have updated the method signatures and added NEWS.md entry following our discussion. Now can you please sync the documentation accordingly? Thank you. I am marking it as 1.3 release PR (we wait for Julia 1.7 anyway and this should be quick to merge) |
Changing this to the line below would be fine I guess (In all the docstrings)?:
|
Rather than saying "undefined", wouldn't it be better to say what happens now and that this will change in the future? That's less scary and more useful. |
Yeah! Like this? :
|
Well the exact wording will depend on the function (cf. the suggestions I made above). So I'd just add "(this behavior is deprecated and will change in future versions)" to each of them. |
functions's docstrings.
I have reviewed the PR pushing appropriate changes (in particular introducing deprecation warning) and making them consistent across functions. Can you please have a look if all is OK? Thank you! |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Thank you! |
I have added a few examples in
issorted
docstrings. And mentioned about lexicographic sorting atissorted
andsortperm
.