-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: isin incorrectly casting ints to datetimes #37528
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this going to need perf tests as this is very sensitive
also i am surprised this is needed at all where you did it
# Dispatch to DatetimeLikeIndexMixin.isin | ||
from pandas import Index | ||
|
||
return Index(comps).isin(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure?
ensure_data and below does this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the way we call ensure_data ends up casting dt64 to int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't you fix it in _ensure_data then, I am not averse to dispatching to the EA itself, which is what we are already doing for Categorical, or is that the plan. If that is the plan, then I would think you could actually remove all off this and simply dispatch (e.g. most of this logic then would actually be handled by PandasArray or the EA as appropriate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a bunch of GH issues about isin (added a new label yesterday), many of which are about unwanted casting. I haven't looked at them all closely tet.
why can't you fix it in _ensure_data then
Even if we fix that (which yes, i'll take a look, but probably in a separate PR), we still need to dispatch to the (Period|Timedelta|Datetime)Index
to get correct casting of the other
(assuming we want algos.isin(dtlike, other)
to match dtlike.isin(other)
, which I for one do)
am not averse to dispatching to the EA itself, which is what we are already doing for Categorical, or is that the plan
ATM Categorical has isin
but thats it. #20617 suggests adding isin to EA, but thats definitely outside the scope of this PR.
will run the appropriate asvs |
actually, i think this can be simplified to just use _from_sequence following #37179 |
# Dispatch to DatetimeLikeIndexMixin.isin | ||
from pandas import Index | ||
|
||
return Index(comps).isin(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't you fix it in _ensure_data then, I am not averse to dispatching to the EA itself, which is what we are already doing for Categorical, or is that the plan. If that is the plan, then I would think you could actually remove all off this and simply dispatch (e.g. most of this logic then would actually be handled by PandasArray or the EA as appropriate).
gentle ping. id like to see if we can share some code between this and |
Found we don't have any asvs that hit this, so just added a few. No change for cases with matching dtype, nice pickup for others:
|
I guess as a short term measure this is ok as you fixed some bugs. the problem here is that you are bouncing between algos and the indexes, where everything should be in algos. this feels like a step backwards, if its tactical, then ok. |
its not unusual for core.algorithms functions to call EA methods. There's no way the casting logic belongs in core.algorithms. |
that's not what's happening here, these are calling Index methods. This seems quite unusual and a very odd depenency. |
Good point, my mistake. At some point this should be consolidated in the EA. IIRC theres an issue about adding
because the relevant logic is casting logic that belongs in the DTA. My current thought is that once we get the _from_sequence-strictness PR in we'll be able to de-duplicate some logic between isin+equals using that. |
The new hashtables have me looking at algorithms._ensure_data, and I'm pretty confident that this will allow us to simplify it quite a bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok fair
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
De-duplication in follow-up.