Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add examples for issorted docstrings. #2941

Merged
merged 17 commits into from
Nov 21, 2021
Merged

Add examples for issorted docstrings. #2941

merged 17 commits into from
Nov 21, 2021

Conversation

Chandu-4444
Copy link
Contributor

I have added a few examples in issorted docstrings. And mentioned about lexicographic sorting at issorted and sortperm.

@bkamins bkamins added the doc label Nov 19, 2021
@bkamins bkamins added this to the 1.x milestone Nov 19, 2021
@bkamins bkamins linked an issue Nov 19, 2021 that may be closed by this pull request
@@ -336,13 +336,40 @@ Sort.defalg(df::AbstractDataFrame, o::Ordering; alg=nothing, cols=[]) =
lt=isless, by=identity, rev::Bool=false, order::Ordering=Forward)

Test whether data frame `df` sorted by column(s) `cols`.
If no column is specified, it checks if the `df` is sorted lexicographically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important comment. Can you please add it also to sort and sort!. Also can you make it more precise that "no column is specified" means "column selector selects no columns".

@nalimilan - this rule is non-intuitive, but changing it would be breaking. Should we deprecate this behavior for 2.0 release (it would be a separate PR).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Chandu-4444 - if what I write is not clear please comment and I will give you examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add it also to sort and sort!. Also can you make it more precise that "no column is specified" means "column selector selects no columns".

I get this and will do the same.

@nalimilan - this rule is non-intuitive, but changing it would be breaking. Should we deprecate this behavior for 2.0 release (it would be a separate PR).

This, I didn't understand (And I'm really curious). Can you please elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate?

sort and related functions were developed in very early days of DataFrames.jl when there was no precise rules for the whole ecosystem. Therefore some inconsistencies slipped through.

A proper definition of sort should be (I am skipping kwargs):

Base.sort(df::AbstractDataFrame, cols=All())

however it is

Base.sort(df::AbstractDataFrame, cols=[])

and as a consequence if you pass any column selector that selects no columns the data frame is sorted lexicographically.

While technically it should happen if you select All() columns. If you select no columns a proper behavior should be that the data frame is not reordered (it is left as is as you have not requested on what the data frame should be sorted).

We have two options:

  • leave things as is forever (probably there is not much harm with that as it is unlikely that someone does sorting and wants the data frame to be left untouched)
  • make a deprecation of how selecting no columns is handled and in 2.0 release change it the way I describe above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it now, thanks for that excellent explanation. Perhaps this column selector can be made a required argument (Like the one Python has in Pandas)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this column selector can be made a required argument

This would be even more breaking. By proposing the change to:

Base.sort(df::AbstractDataFrame, cols=All())

we are not breaking calls like sort(df) which are quite likely to exist in current user code.

We try to limit breaking changes. Calls like sort(df, []) are quite unlikely, so we could deprecate them and change their behavior in 2.0.

Let us wait for an opinion from @nalimilan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's indeed unfortunate. I'm not sure we should make a breaking change though given that it's not super likely to confuse people (if you call sort with no columns, it's most likely a mistake anyway and you'll get unexpected results), so I'd opt for a deprecation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - let us then document that passing no columns is an undefined behavior.

Chandu-4444 and others added 2 commits November 20, 2021 11:51
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
when column selector is not specified.
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
src/dataframe/sort.jl Outdated Show resolved Hide resolved
@@ -448,8 +476,10 @@ end

Return a permutation vector of row indices of data frame `df` that puts them in
sorted order according to column(s) `cols`.
If `cols` were not specified, it returns row indices that sorts `df` lexicographically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than repeating this here (which would have to be put in all sorting functions), how about putting cols=All() in the signature above? We could also change the actual implementation to make it clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I want to do. I will make these changes to this PR and then let @Chandu-4444 to sync the docs following your suggestions and writing that passing no columns leads to an undefined row order.

Chandu-4444 and others added 7 commits November 20, 2021 21:42
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member

bkamins commented Nov 20, 2021

@Chandu-4444 - I have updated the method signatures and added NEWS.md entry following our discussion. Now can you please sync the documentation accordingly? Thank you. I am marking it as 1.3 release PR (we wait for Julia 1.7 anyway and this should be quick to merge)

@bkamins bkamins modified the milestones: 1.x, 1.3 Nov 20, 2021
@Chandu-4444
Copy link
Contributor Author

If cols were not specified, it returns row indices that sorts df lexicographically.

Changing this to the line below would be fine I guess (In all the docstrings)?:

Passing an empty columns selector leads to an undefined result.

@nalimilan
Copy link
Member

Rather than saying "undefined", wouldn't it be better to say what happens now and that this will change in the future? That's less scary and more useful.

@Chandu-4444
Copy link
Contributor Author

Rather than saying "undefined", wouldn't it be better to say what happens now and that this will change in the future? That's less scary and more useful.

Yeah! Like this? :

If cols were not specified, it returns row indices that sorts df lexicographically.
This behaviour will change from version 1.3.

@nalimilan
Copy link
Member

Well the exact wording will depend on the function (cf. the suggestions I made above). So I'd just add "(this behavior is deprecated and will change in future versions)" to each of them.

NEWS.md Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Nov 20, 2021

I have reviewed the PR pushing appropriate changes (in particular introducing deprecation warning) and making them consistent across functions. Can you please have a look if all is OK? Thank you!

NEWS.md Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit 9791095 into JuliaData:main Nov 21, 2021
@bkamins
Copy link
Member

bkamins commented Nov 21, 2021

Thank you!

@Chandu-4444 Chandu-4444 deleted the ImproveDocs branch November 21, 2021 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve sort docs
3 participants