Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add examples for issorted docstrings. #2941

Merged
merged 17 commits into from
Nov 21, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions src/abstractdataframe/sort.jl
Original file line number Diff line number Diff line change
Expand Up @@ -336,13 +336,40 @@ Sort.defalg(df::AbstractDataFrame, o::Ordering; alg=nothing, cols=[]) =
lt=isless, by=identity, rev::Bool=false, order::Ordering=Forward)

Test whether data frame `df` sorted by column(s) `cols`.
bkamins marked this conversation as resolved.
Show resolved Hide resolved
If no column is specified, it checks if the `df` is sorted lexicographically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important comment. Can you please add it also to sort and sort!. Also can you make it more precise that "no column is specified" means "column selector selects no columns".

@nalimilan - this rule is non-intuitive, but changing it would be breaking. Should we deprecate this behavior for 2.0 release (it would be a separate PR).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Chandu-4444 - if what I write is not clear please comment and I will give you examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add it also to sort and sort!. Also can you make it more precise that "no column is specified" means "column selector selects no columns".

I get this and will do the same.

@nalimilan - this rule is non-intuitive, but changing it would be breaking. Should we deprecate this behavior for 2.0 release (it would be a separate PR).

This, I didn't understand (And I'm really curious). Can you please elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate?

sort and related functions were developed in very early days of DataFrames.jl when there was no precise rules for the whole ecosystem. Therefore some inconsistencies slipped through.

A proper definition of sort should be (I am skipping kwargs):

Base.sort(df::AbstractDataFrame, cols=All())

however it is

Base.sort(df::AbstractDataFrame, cols=[])

and as a consequence if you pass any column selector that selects no columns the data frame is sorted lexicographically.

While technically it should happen if you select All() columns. If you select no columns a proper behavior should be that the data frame is not reordered (it is left as is as you have not requested on what the data frame should be sorted).

We have two options:

  • leave things as is forever (probably there is not much harm with that as it is unlikely that someone does sorting and wants the data frame to be left untouched)
  • make a deprecation of how selecting no columns is handled and in 2.0 release change it the way I describe above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it now, thanks for that excellent explanation. Perhaps this column selector can be made a required argument (Like the one Python has in Pandas)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this column selector can be made a required argument

This would be even more breaking. By proposing the change to:

Base.sort(df::AbstractDataFrame, cols=All())

we are not breaking calls like sort(df) which are quite likely to exist in current user code.

We try to limit breaking changes. Calls like sort(df, []) are quite unlikely, so we could deprecate them and change their behavior in 2.0.

Let us wait for an opinion from @nalimilan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's indeed unfortunate. I'm not sure we should make a breaking change though given that it's not super likely to confuse people (if you call sort with no columns, it's most likely a mistake anyway and you'll get unexpected results), so I'd opt for a deprecation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - let us then document that passing no columns is an undefined behavior.


`cols` can be any column selector ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR).

If `rev` is `true`, reverse sorting is performed. To enable reverse sorting
only for some columns, pass `order(c, rev=true)` in `cols`, with `c` the
corresponding column index (see example below).

See other methods for a description of other keyword arguments.

# Examples
```jldoctest
julia> df = DataFrame(a = [1,2,3,4], b = [4,3,2,1])
bkamins marked this conversation as resolved.
Show resolved Hide resolved
4×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 3
3 │ 3 2
4 │ 4 1

julia> issorted(df)
true

julia> issorted(df, :a)
true

julia> issorted(df, :b)
false

julia> issorted(df, :b, rev = true)
Chandu-4444 marked this conversation as resolved.
Show resolved Hide resolved
true
```
"""
function Base.issorted(df::AbstractDataFrame, cols=[];
lt=isless, by=identity, rev=false, order=Forward)
Expand Down Expand Up @@ -448,6 +475,7 @@ end

Return a permutation vector of row indices of data frame `df` that puts them in
sorted order according to column(s) `cols`.
bkamins marked this conversation as resolved.
Show resolved Hide resolved
If `cols` were not specified, it returns row indices that sorts `df` lexicographically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than repeating this here (which would have to be put in all sorting functions), how about putting cols=All() in the signature above? We could also change the actual implementation to make it clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I want to do. I will make these changes to this PR and then let @Chandu-4444 to sync the docs following your suggestions and writing that passing no columns leads to an undefined row order.


`cols` can be any column selector ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR).

Expand Down