-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of dropmissing #3256
Conversation
Please also make sure to review tests so that they adequately cover any new code paths that might be needed. |
Yes, of course. I'll review/add tests as necessary. I have quickly reviewed how it works now and I'm surprised you expect benefits from fusing the copy rows into Re.
Thanks for the tip! Let me know if you have any more tips or ideas. I'll get into it in a few days when I'm done travelling. |
Your comments are very good. I just threw some ideas that I had. Unfortunately, as you say, it is often the case that one needs to run benchmarks before one can find the best approach as things sometimes are surprising. But for sure there is room for improvement, as DuckDB is noticeably faster. Two additional notes:
|
Quick update from my side
Single-pass Implementation
Benchmarking results
Benchmarking setup
EDIT: The timings of |
In general 10k rows is small. Of course it would be good to optimize also for such case, but I would say priority is to optimize against 10^6+ rows. Maybe a good idea would be to start with profilling |
Just as a curiosity - I was curious how it would look if Threading was always on (
|
We wanted to be on a safe side (i.e. do not use threads unless there is a clear benefit). In particular - threading benefit might vary across machines on which code is run. Still - the optimal threshold might be different for different operations so it does not have to be the same everywhere.
We need to keep Julia 1.6 compatibility, where |
Hi @bkamins, I've tried to integrate the new implementation of dropmissing into the DF codebase. My thinking:
Sharp edges:
I've tested correctness against the default function, but once we agree on the design, I'll add some individual tests. Next steps: Side observations:
Performance results:
|
Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>
Uses Tables.allocatecolumn constructor
@svilupp - note that I have pushed some changes to the PR. |
@svilupp - can you please, when you have time, check the changes in the code I made (I have made them in my head, but they pass tests, so hopefully they are OK 😄). Thank you! |
All looks good! I really like the change to |
I made some more small tweaks (in particular to use Performance improvement on a real case:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Thank you! I hope you enjoyed the process. You are welcome to open other PRs. |
Fixes #3254
TODO:
[x] change dropmissing[x] change dropmissing! (no opportunities found)[x] add/augment testing suite[x] check docs and examples for consistency