-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add return_inverse to cython-unique; unify unique/factorize-code #23400
Conversation
Hello @h-vetinari! Thanks for submitting the PR.
|
Codecov Report
@@ Coverage Diff @@
## master #23400 +/- ##
=======================================
Coverage 92.31% 92.31%
=======================================
Files 161 161
Lines 51483 51483
=======================================
Hits 47525 47525
Misses 3958 3958
Continue to review full report at Codecov.
|
I'll take a look at this. @h-vetinari a couple of preliminary, totally uninformed questions. Does it make sense to implement asvs for this? Can any of the tempita-using code be implemented with fused types (which I for one find more readable)? |
Finally got an ASV to run through, and not be too noisy (compare #23412 about an earlier attempt).
What remains is noise IMO, with the likely exception of The reason that I'm quite confident that it's noise is that I've had a couple different runs (and specifically checking the index part again as well), where the above time increases did not show up, e.g.:
|
Thanks! :)
Absolutely, haven't done so yet.
I totally agree, but I gave up after some preliminary investigation (and may not be fully aware of all the capabilities of fused types). There's a bunch of classes
Not sure how flexible the fused type machinery is (i.e. having to use tuples or structs to catch all the required fields to template over?), but it seems to me that in any case, one would probably need to first fuse-type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be more amenable to this if it were templated on the type. you are duplicating lots of code for the object / non-object cases. Further you haven't proven your performance claims here. You are adding lots of paths to reduce signature and such, but it is just making it more complicated. We want less complicated code, sure performance matters, and some hotspot optimizations are worth it, but unless they can be fully templatized / use fused types they just add complexity.
Once code is added to pandas and it is complex, it is rarely maintained. This is why we bend over backwards to have code simplicity.
@jreback
This is already templated on the dtype. The code for strings/object is already different (and uses different constructions, i.e. no
How can I prove it other than by ASVs and a reproducible(!) timing code just for the most sensitive
The separate functions for
Having
Fair enough - I'd love for this to work without templating (not templating on In any case, I think one function (albeit templated) is easier to maintain than having slightly different versions for each of |
I agree. I am willing to have 10% penalty for this, though I think this is basically what we have now? |
No, not yet. Adding a |
@jreback |
Here's the ASV output of the lastest run:
|
Doesn't look too bad, IMO (failures in ASV are pretty arbitrary when they happen, the test suite ran fine). Another look, @jreback @jbrockmendel, please? :) |
This reverts commit 30de418.
Reverting 30de418 due to the performance degradation. Was just for demo purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i am onboard with the implementation. it is straightforward. change the order of the returns in unique to match factorize. you may not like it but that's how it is. having different return orderings is not going to work.
@jreback if you are asking that only for the internal hashtable method, I don't bother that much (@h-vetinari so you can do it here as Jeff asks, for the internal code it doesn't matter that much). |
That's my point exactly, and even more so: adding the inverse as the first component of the tuple based on a kwarg is bound to lead to lots and lots of confusion. The only sane option IMO is the way I think we have to live with this discrepancy, and then, it just becomes a question where the signature break is in the code. I'd argue that it should be at the lowest level possible (i.e. |
Care to take another look? :) |
|
||
Returns | ||
------- | ||
labels : ndarray[int64] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i agree now, flip the order here and just change it where it is called (e.g. in the python code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
Just so I understand, you'd like to have the signature break between Hashtable.factorize
(returning uniques, labels
) and pandas.factorize
(returning labels, uniques
)? Then there'd still be a couple of places in the pandas code where this would have to be explained.
If the break were only between Hashtable._unique
and Hashtable.factorize
, then the break would be contained to this module, and would not be visible to anyone who's not working on the cython code. I think that would be the cleanest solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then there'd still be a couple of places in the pandas code where this would have to be explained.
I wouldn't explain it, rather I would simply change it. The idea would be to change it everywhere internally, the only place this would not be the case would be the actual pd.factorize()
where we can never change this, but that is user facing and not a big deal.
|
||
Returns | ||
------- | ||
labels : ndarray[int64] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
uniques, labels = self._unique(values, uniques_vector, | ||
na_sentinel=na_sentinel, | ||
na_value=na_value, ignore_na=True, | ||
return_inverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above, just the cython signature, we are not going to (nor need to) change the user facing signature
return np.asarray(labels) | ||
if return_inverse: | ||
return uniques.to_array(), np.asarray(labels) | ||
return uniques.to_array() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the non-constant signature here (and above) avoidable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, now that I read the conversation I see this has been addressed several times. The only change in the python code is in the tests; what part of the non-cython code is affected by this decision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel
The non-constant signature follows np.unique
and will eventually extend to pd.unique
, Series.unique
etc. This PR prepares the cython backend to have the necessary capability. This will i.a. also allow to solve #21720, and work towards #21357 / #22824
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
Thanks for the review. Please have a look at my argument to contain the signature change to this module. If you disagree, I'll change as requested.
@jbrockmendel
Thanks for taking a look, I responded in the comment
|
||
Returns | ||
------- | ||
labels : ndarray[int64] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
Just so I understand, you'd like to have the signature break between Hashtable.factorize
(returning uniques, labels
) and pandas.factorize
(returning labels, uniques
)? Then there'd still be a couple of places in the pandas code where this would have to be explained.
If the break were only between Hashtable._unique
and Hashtable.factorize
, then the break would be contained to this module, and would not be visible to anyone who's not working on the cython code. I think that would be the cleanest solution.
return np.asarray(labels) | ||
if return_inverse: | ||
return uniques.to_array(), np.asarray(labels) | ||
return uniques.to_array() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel
The non-constant signature follows np.unique
and will eventually extend to pd.unique
, Series.unique
etc. This PR prepares the cython backend to have the necessary capability. This will i.a. also allow to solve #21720, and work towards #21357 / #22824
I remembered this question, and checked the ASV benchmarks for existing calls to unique. There's nothing in
Seems one should add some ASVs to |
The failure is a flaky hypothesis test. |
|
||
Returns | ||
------- | ||
labels : ndarray[int64] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then there'd still be a couple of places in the pandas code where this would have to be explained.
I wouldn't explain it, rather I would simply change it. The idea would be to change it everywhere internally, the only place this would not be the case would be the actual pd.factorize()
where we can never change this, but that is user facing and not a big deal.
@jreback |
@jreback All green. |
@jreback |
thanks @h-vetinari |
Cool, thanks. |
This is a continuation/split-off from #22986, where I tried to deduplicate the cython code for
unique
/factorize
, and add areturn_inverse
kwarg tounique
at the same time. This didn't fully work because there was a performance impact that was deemed unacceptable. I've opened cython/cython#2660 to figure out why (resp. have a directive to allow force-compilation of different parameter values, which would solve this more elegantly), and @robertwb told me that it's likely the use of kwargs, but also suggested the possibility to use explicit templating onreturn_inverse
.First I tried unification without any kwargs in 858f54e - this still had the same perf impact as in #22986.
Then I added the explict templating in 4ed354a, which successfully avoids the perf impact, but is a bit uglier in the
pxi.in
. Finally, I've readded the kwargs in 906cd50 to keep the diff here smaller, and this doesn't have an impact.I've had many unsuccessful tries to run the ASVs for all those combinations over the weekend, so I'm going back to explicitly testing the unique-code here (code at the end). The overview is as follows (for completeness I've added the #22986 and the commit immediately prior on master; all results in milliseconds):
or, relatively to the commit on master I was basing myself on:
So long story short, this PR prepares the hashtable-backend to support
return_inverse=True
, which plays into #21357 #21645 #22824, and will also allow to easily solve #21720.Code for the above timings: