-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define sort! for AbstractDataFrame and fix issues of kwargs in sorting functions #2946
Conversation
@nalimilan - I have added tests. It should be good for a review In general what I propose is I think better. Already the previous method has quadratic complexity in the number of columns, but I think it is safer to error immediately when unsafe aliasing is used. |
Looks good!
I just wonder whether we should optimistically check and permute each column, and undo the changes if needed (as it should be super rare). But maybe that wouldn't make a difference for performance. |
This is the point. It would only improve performance if we get an error which should be super rare. |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Wait, isn't that the contrary? It could be faster when no error happens, but it would be slower if it does because we would need to restore already permuted columns to their original state. But I agree the gain should be small anyway. |
But how would you detect you need to undo this operation. The cost of doing this detection is quadratic in number of columns. So if we want to detect it we can do this detection immediately (and this is what I do now). Additionally we need to detect exact aliases and avoid permuting them (this was the behavior that we already had). A different situation is e.g. in |
I just meant we could have a single loop over columns with the contents of the two loops you have now, and in case an error happens we would roll back any already applied changes. But forget that, it's probably not worth it. |
I ended up having to standardize everything. Now all kwargs, following the rules we set in the 1.0 release have to be either scalars or vectors (tuples are not allowed - as it was announced we will not allow tuples). I have also improved docstrings, test coverage, and error checking. |
src/abstractdataframe/sort.jl
Outdated
@@ -14,6 +14,24 @@ | |||
# which allows a user to specify column specific orderings | |||
# with "order(column, rev=true, ...)" | |||
|
|||
function _check_sort_args(lt, by, rev, order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not put this in function signatures instead? People should be used to the kind of MethodError
that is printed. If we start checking the type of all arguments manually the codebase is going to get quite large. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agreed. For some reason I started copying the old design. Now all kwargs have proper type restrictions.
src/abstractdataframe/sort.jl
Outdated
`cols` selects no columns, check whether `df` is sorted on all columns (this | ||
behaviour is deprecated and will change in future versions). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This text (that we just added) needs to be adapted a bit depending on the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right - fixed
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@nalimilan - let me know when you think it is OK and I will do a final check and merge (there is so much copy-paste in this PR that I want to double check everything before merging). |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Thank you! |
@nalimilan - do you see any risks in making
sort!
more flexible and allowSubDataFrame
?(I will add tests if we agree on the design)
I have also proposed to be more careful and error if columns alias but are not identical, but we might decide not to add this extra check.