-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Back IntervalArray by array instead of Index #36310
Conversation
pandas/core/arrays/interval.py
Outdated
right = self.right.fillna(value=value.right) | ||
from pandas import Index | ||
|
||
left = Index(self.left).fillna(value=value.left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm why do you need to coerce to .fillna?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ATM self.left is an ndarray which doesnt have fillna
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh
AFAICT remaining test failures are on from arrow |
cc @jorisvandenbossche are the pyarrow failures here something we can address on our end? |
It are our own tests (testing the conversion that is implemented in pandas itself), so that's something that need to be fixed in this PR. Looking at the failure, it seems that setting a missing value in an IntervalArray does not/no longer set a missing value in both left and right arrays. But looking at the changes in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backwards compatibility, we could keep .left
and .right
returning an Index? (since the arrays are actually stored as ._left
and ._right
)
Nice idea, +1 to changing the storage. We shouldn't change |
pandas/_testing.py
Outdated
assert_index_equal(left.right, right.right, exact=exact, obj=f"{obj}.left") | ||
if left._left.dtype.kind in ["m", "M"]: | ||
# We have a DatetimeArray or Timed | ||
# TODO: `exact` keyword? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe better to
kwargs = {}
if left._left.dtype.kind in ["m", "M"]:
kwargs['check_freq'] = False
....
pandas/core/arrays/interval.py
Outdated
new_right = self.right.astype(dtype.subtype) | ||
# We need to use Index rules for astype to prevent casting | ||
# np.nan entries to int subtypes | ||
new_left = Index(self._left).astype(dtype.subtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy=False ?
pandas/core/arrays/interval.py
Outdated
fill_value = self.left._na_value | ||
from pandas import Index | ||
|
||
fill_value = Index(self._left)._na_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we get the fill value rather than doing this? if we need to do this add copy=False
pandas/core/indexes/interval.py
Outdated
@@ -865,6 +863,22 @@ def _convert_list_indexer(self, keyarr): | |||
|
|||
# -------------------------------------------------------------------- | |||
|
|||
@cache_readonly | |||
def left(self) -> Index: | |||
return Index(self._values.left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy=False on these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this?
i think comments have been addressed, LMK if i missed anything |
@@ -201,6 +197,8 @@ class IntervalIndex(IntervalMixin, ExtensionIndex): | |||
_mask = None | |||
|
|||
_data: IntervalArray | |||
_values: IntervalArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm, now i am confused, what is different about these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without this, mypy thinks _values is ExtensionArray and has a bunch of new complaints since we access self._values.left
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk, should try to remove this at some point
pandas/core/indexes/interval.py
Outdated
@@ -865,6 +863,22 @@ def _convert_list_indexer(self, keyarr): | |||
|
|||
# -------------------------------------------------------------------- | |||
|
|||
@cache_readonly | |||
def left(self) -> Index: | |||
return Index(self._values.left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this?
this is marked POC, but i assume want to merge it? |
Yes, updated the title. |
any more comments here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good generally. Just one change needed in the whatsnew.
This should improve the performance of IntervalArray.__setitem__
, right? That could be added in the release notes.
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -120,6 +120,7 @@ Other enhancements | |||
- `Styler` now allows direct CSS class name addition to individual data cells (:issue:`36159`) | |||
- :meth:`Rolling.mean()` and :meth:`Rolling.sum()` use Kahan summation to calculate the mean to avoid numerical problems (:issue:`10319`, :issue:`11645`, :issue:`13254`, :issue:`32761`, :issue:`36031`) | |||
- :meth:`DatetimeIndex.searchsorted`, :meth:`TimedeltaIndex.searchsorted`, :meth:`PeriodIndex.searchsorted`, and :meth:`Series.searchsorted` with datetimelike dtypes will now try to cast string arguments (listlike and scalar) to the matching datetimelike type (:issue:`36346`) | |||
- :func:`pandas._testing.assert_datetime_array_equal` and :func:`pandas._testing.assert_timedelta_array_equal` now have a ``check_freq=True`` keyword that allows disabling the check for matching ``freq`` attribute (:issue:`36310`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs should only reference public functions. How would users actually pass this through a public API? I suspect it's impossible, since check_freq
in assert_frame_equal
applies to the index rather than values of an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So assert_extension_array_equal
could perhaps take kwargs
and pass it through. But that's maybe not worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so remove this note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, unless this is user facing in some way (e.g. is assert_index_equal changed)?
@@ -588,7 +595,7 @@ def __eq__(self, other): | |||
if is_interval_dtype(other_dtype): | |||
if self.closed != other.closed: | |||
return np.zeros(len(self), dtype=bool) | |||
return (self.left == other.left) & (self.right == other.right) | |||
return (self._left == other.left) & (self._right == other.right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is other
here known to be a particular type (like IntervalArray
), or is it something like Union[IntervalArray, Series, Index,Interval]
. If it's just IntervalArray it'd be a bit faster to compare with other._left
and right._left
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this point other
can be an Interval or IntervalArray
note added to whatsnew. |
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -120,6 +120,7 @@ Other enhancements | |||
- `Styler` now allows direct CSS class name addition to individual data cells (:issue:`36159`) | |||
- :meth:`Rolling.mean()` and :meth:`Rolling.sum()` use Kahan summation to calculate the mean to avoid numerical problems (:issue:`10319`, :issue:`11645`, :issue:`13254`, :issue:`32761`, :issue:`36031`) | |||
- :meth:`DatetimeIndex.searchsorted`, :meth:`TimedeltaIndex.searchsorted`, :meth:`PeriodIndex.searchsorted`, and :meth:`Series.searchsorted` with datetimelike dtypes will now try to cast string arguments (listlike and scalar) to the matching datetimelike type (:issue:`36346`) | |||
- :func:`pandas._testing.assert_datetime_array_equal` and :func:`pandas._testing.assert_timedelta_array_equal` now have a ``check_freq=True`` keyword that allows disabling the check for matching ``freq`` attribute (:issue:`36310`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, unless this is user facing in some way (e.g. is assert_index_equal changed)?
pandas/_testing.py
Outdated
# We have a DatetimeArray or TimedeltaArray | ||
kwargs["check_freq"] = False | ||
|
||
# TODO: `exact` keyword? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this TODO needs to be solved first? It was there before, but you now removed it? (so is not being ignored?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we're ok without the keyword, will remove the comment
from pandas.core.ops.array_ops import maybe_upcast_datetimelike_array | ||
|
||
left = maybe_upcast_datetimelike_array(left) | ||
left = extract_array(left, extract_numpy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above we are first ensuring that the arrays passed to _simple_new
are an index, and then we extract the array again. Is this roundtrip to index and back needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid the roundtrip eventually, will be best accomplished by being stricter in what we pass to _simple_new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, would be a nice follow-up
pandas/core/arrays/interval.py
Outdated
right._values[key] = value_right | ||
self._right = right | ||
self._left[key] = value_left | ||
self._right[key] = value_right # TODO: needs tests for not breaking views |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the un-xfail-ed test doing that?
(or if not, can you add a test now?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, will remove comment
updated per comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment
new_right = self.right.astype(dtype.subtype) | ||
# We need to use Index rules for astype to prevent casting | ||
# np.nan entries to int subtypes | ||
new_left = Index(self._left, copy=False).astype(dtype.subtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could add copy=False to .astype (not sure how much any of this matters though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge is ok now and can see if this matters on followup
@@ -201,6 +197,8 @@ class IntervalIndex(IntervalMixin, ExtensionIndex): | |||
_mask = None | |||
|
|||
_data: IntervalArray | |||
_values: IntervalArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk, should try to remove this at some point
The benefit I have in mind here is that we could back it by a single 2xN array and a) avoid the kludge needed to make
__setitem__
atomic, b) do a view to get native types for e.g uniqueness checks, c) possibly share some methods with NDarrayBackedExtensionArray.Also just in principle having EAs not depend on Index is preferable dependency-structure-wise.
cc @jschendel