Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define sort! for AbstractDataFrame and fix issues of kwargs in sorting functions #2946

Merged
merged 13 commits into from
Nov 26, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Nov 23, 2021

@nalimilan - do you see any risks in making sort! more flexible and allow SubDataFrame?
(I will add tests if we agree on the design)

I have also proposed to be more careful and error if columns alias but are not identical, but we might decide not to add this extra check.

@bkamins bkamins added this to the 1.x milestone Nov 23, 2021
@bkamins bkamins marked this pull request as ready for review November 24, 2021 08:48
@bkamins bkamins requested a review from nalimilan November 24, 2021 08:48
@bkamins
Copy link
Member Author

bkamins commented Nov 24, 2021

@nalimilan - I have added tests. It should be good for a review

In general what I propose is I think better. Already the previous method has quadratic complexity in the number of columns, but I think it is safer to error immediately when unsafe aliasing is used.

@nalimilan
Copy link
Member

Looks good!

In general what I propose is I think better. Already the previous method has quadratic complexity in the number of columns, but I think it is safer to error immediately when unsafe aliasing is used.

I just wonder whether we should optimistically check and permute each column, and undo the changes if needed (as it should be super rare). But maybe that wouldn't make a difference for performance.

src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Nov 24, 2021

But maybe that wouldn't make a difference for performance.

This is the point. It would only improve performance if we get an error which should be super rare.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@nalimilan
Copy link
Member

This is the point. It would only improve performance if we get an error which should be super rare.

Wait, isn't that the contrary? It could be faster when no error happens, but it would be slower if it does because we would need to restore already permuted columns to their original state. But I agree the gain should be small anyway.

@bkamins
Copy link
Member Author

bkamins commented Nov 25, 2021

undo the changes if needed

But how would you detect you need to undo this operation. The cost of doing this detection is quadratic in number of columns. So if we want to detect it we can do this detection immediately (and this is what I do now). Additionally we need to detect exact aliases and avoid permuting them (this was the behavior that we already had).

A different situation is e.g. in push! where indeed we can cheaply detect a problem post factum (as we can just check if column lengths are correct). However, doing sort! only shuffles values so we have no easy way to detect the problem.

@nalimilan
Copy link
Member

I just meant we could have a single loop over columns with the contents of the two loops you have now, and in case an error happens we would roll back any already applied changes. But forget that, it's probably not worth it.

@bkamins bkamins changed the title Define sort! for AbstractDataFrame Define sort! for AbstractDataFrame and fix issues of kwargs in sorting functions Nov 25, 2021
@bkamins
Copy link
Member Author

bkamins commented Nov 25, 2021

I ended up having to standardize everything. Now all kwargs, following the rules we set in the 1.0 release have to be either scalars or vectors (tuples are not allowed - as it was announced we will not allow tuples). I have also improved docstrings, test coverage, and error checking.

@bkamins bkamins modified the milestones: 1.x, 1.3 Nov 25, 2021
@bkamins bkamins added the bug label Nov 25, 2021
@@ -14,6 +14,24 @@
# which allows a user to specify column specific orderings
# with "order(column, rev=true, ...)"

function _check_sort_args(lt, by, rev, order)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put this in function signatures instead? People should be used to the kind of MethodError that is printed. If we start checking the type of all arguments manually the codebase is going to get quite large. :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed. For some reason I started copying the old design. Now all kwargs have proper type restrictions.

Comment on lines 357 to 358
`cols` selects no columns, check whether `df` is sorted on all columns (this
behaviour is deprecated and will change in future versions).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text (that we just added) needs to be adapted a bit depending on the function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right - fixed

src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Nov 25, 2021

@nalimilan - let me know when you think it is OK and I will do a final check and merge (there is so much copy-paste in this PR that I want to double check everything before merging).

@bkamins bkamins mentioned this pull request Nov 25, 2021
src/abstractdataframe/sort.jl Outdated Show resolved Hide resolved
bkamins and others added 2 commits November 26, 2021 00:05
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit 421db4d into main Nov 26, 2021
@bkamins bkamins deleted the bk/improve_sort! branch November 26, 2021 10:15
@bkamins
Copy link
Member Author

bkamins commented Nov 26, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants