Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Back IntervalArray by array instead of Index #36310

Merged
merged 23 commits into from
Oct 2, 2020

Conversation

jbrockmendel
Copy link
Member

The benefit I have in mind here is that we could back it by a single 2xN array and a) avoid the kludge needed to make __setitem__ atomic, b) do a view to get native types for e.g uniqueness checks, c) possibly share some methods with NDarrayBackedExtensionArray.

Also just in principle having EAs not depend on Index is preferable dependency-structure-wise.

cc @jschendel

@jreback jreback added the Interval Interval data type label Sep 12, 2020
right = self.right.fillna(value=value.right)
from pandas import Index

left = Index(self.left).fillna(value=value.left)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm why do you need to coerce to .fillna?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATM self.left is an ndarray which doesnt have fillna

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh

pandas/core/arrays/interval.py Outdated Show resolved Hide resolved
@jbrockmendel
Copy link
Member Author

AFAICT remaining test failures are on from arrow

@jbrockmendel
Copy link
Member Author

cc @jorisvandenbossche are the pyarrow failures here something we can address on our end?

@jorisvandenbossche
Copy link
Member

It are our own tests (testing the conversion that is implemented in pandas itself), so that's something that need to be fixed in this PR.

Looking at the failure, it seems that setting a missing value in an IntervalArray does not/no longer set a missing value in both left and right arrays. But looking at the changes in __setitem__ in the diff, I don't directly see why that would be the case.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backwards compatibility, we could keep .left and .right returning an Index? (since the arrays are actually stored as ._left and ._right)

pandas/core/arrays/interval.py Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor

Nice idea, +1 to changing the storage.

We shouldn't change IntervalArray.left from an index to ndarray without warning. I'm fine with just continuing to return an Index. Or if we really want to change it we can deprecate IntervalArray.left in favor of IntervalArray.left_array.

assert_index_equal(left.right, right.right, exact=exact, obj=f"{obj}.left")
if left._left.dtype.kind in ["m", "M"]:
# We have a DatetimeArray or Timed
# TODO: `exact` keyword?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better to

kwargs = {}
if left._left.dtype.kind in ["m", "M"]:
    kwargs['check_freq'] = False
....

pandas/_testing.py Show resolved Hide resolved
pandas/core/arrays/interval.py Show resolved Hide resolved
pandas/core/arrays/interval.py Show resolved Hide resolved
new_right = self.right.astype(dtype.subtype)
# We need to use Index rules for astype to prevent casting
# np.nan entries to int subtypes
new_left = Index(self._left).astype(dtype.subtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy=False ?

fill_value = self.left._na_value
from pandas import Index

fill_value = Index(self._left)._na_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we get the fill value rather than doing this? if we need to do this add copy=False

@@ -865,6 +863,22 @@ def _convert_list_indexer(self, keyarr):

# --------------------------------------------------------------------

@cache_readonly
def left(self) -> Index:
return Index(self._values.left)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy=False on these?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this?

@jbrockmendel
Copy link
Member Author

i think comments have been addressed, LMK if i missed anything

@jreback jreback added this to the 1.2 milestone Sep 22, 2020
@@ -201,6 +197,8 @@ class IntervalIndex(IntervalMixin, ExtensionIndex):
_mask = None

_data: IntervalArray
_values: IntervalArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm, now i am confused, what is different about these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this, mypy thinks _values is ExtensionArray and has a bunch of new complaints since we access self._values.left below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, should try to remove this at some point

@@ -865,6 +863,22 @@ def _convert_list_indexer(self, keyarr):

# --------------------------------------------------------------------

@cache_readonly
def left(self) -> Index:
return Index(self._values.left)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this?

@jreback
Copy link
Contributor

jreback commented Sep 24, 2020

this is marked POC, but i assume want to merge it?

@jbrockmendel jbrockmendel changed the title POC: back IntervalArray by array instead of Index Back IntervalArray by array instead of Index Sep 24, 2020
@jbrockmendel
Copy link
Member Author

this is marked POC, but i assume want to merge it?

Yes, updated the title.

@jbrockmendel
Copy link
Member Author

any more comments here?

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good generally. Just one change needed in the whatsnew.

This should improve the performance of IntervalArray.__setitem__, right? That could be added in the release notes.

@@ -120,6 +120,7 @@ Other enhancements
- `Styler` now allows direct CSS class name addition to individual data cells (:issue:`36159`)
- :meth:`Rolling.mean()` and :meth:`Rolling.sum()` use Kahan summation to calculate the mean to avoid numerical problems (:issue:`10319`, :issue:`11645`, :issue:`13254`, :issue:`32761`, :issue:`36031`)
- :meth:`DatetimeIndex.searchsorted`, :meth:`TimedeltaIndex.searchsorted`, :meth:`PeriodIndex.searchsorted`, and :meth:`Series.searchsorted` with datetimelike dtypes will now try to cast string arguments (listlike and scalar) to the matching datetimelike type (:issue:`36346`)
- :func:`pandas._testing.assert_datetime_array_equal` and :func:`pandas._testing.assert_timedelta_array_equal` now have a ``check_freq=True`` keyword that allows disabling the check for matching ``freq`` attribute (:issue:`36310`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs should only reference public functions. How would users actually pass this through a public API? I suspect it's impossible, since check_freq in assert_frame_equal applies to the index rather than values of an array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So assert_extension_array_equal could perhaps take kwargs and pass it through. But that's maybe not worth the effort.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so remove this note?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, unless this is user facing in some way (e.g. is assert_index_equal changed)?

@@ -588,7 +595,7 @@ def __eq__(self, other):
if is_interval_dtype(other_dtype):
if self.closed != other.closed:
return np.zeros(len(self), dtype=bool)
return (self.left == other.left) & (self.right == other.right)
return (self._left == other.left) & (self._right == other.right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is other here known to be a particular type (like IntervalArray), or is it something like Union[IntervalArray, Series, Index,Interval]. If it's just IntervalArray it'd be a bit faster to compare with other._left and right._left.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point other can be an Interval or IntervalArray

@jbrockmendel
Copy link
Member Author

This should improve the performance of IntervalArray.setitem, right? That could be added in the release notes.

In [2]: ii = pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]) 
In [3]: ia = ii._data                                                           
In [4]: val = ia[0]                                                             

In [5]: %timeit ia[-1] = val                                                    
38.4 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- master
23.9 µs ± 335 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- PR

note added to whatsnew.

@@ -120,6 +120,7 @@ Other enhancements
- `Styler` now allows direct CSS class name addition to individual data cells (:issue:`36159`)
- :meth:`Rolling.mean()` and :meth:`Rolling.sum()` use Kahan summation to calculate the mean to avoid numerical problems (:issue:`10319`, :issue:`11645`, :issue:`13254`, :issue:`32761`, :issue:`36031`)
- :meth:`DatetimeIndex.searchsorted`, :meth:`TimedeltaIndex.searchsorted`, :meth:`PeriodIndex.searchsorted`, and :meth:`Series.searchsorted` with datetimelike dtypes will now try to cast string arguments (listlike and scalar) to the matching datetimelike type (:issue:`36346`)
- :func:`pandas._testing.assert_datetime_array_equal` and :func:`pandas._testing.assert_timedelta_array_equal` now have a ``check_freq=True`` keyword that allows disabling the check for matching ``freq`` attribute (:issue:`36310`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, unless this is user facing in some way (e.g. is assert_index_equal changed)?

@jorisvandenbossche jorisvandenbossche changed the title Back IntervalArray by array instead of Index REF: Back IntervalArray by array instead of Index Oct 1, 2020
# We have a DatetimeArray or TimedeltaArray
kwargs["check_freq"] = False

# TODO: `exact` keyword?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this TODO needs to be solved first? It was there before, but you now removed it? (so is not being ignored?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we're ok without the keyword, will remove the comment

from pandas.core.ops.array_ops import maybe_upcast_datetimelike_array

left = maybe_upcast_datetimelike_array(left)
left = extract_array(left, extract_numpy=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above we are first ensuring that the arrays passed to _simple_new are an index, and then we extract the array again. Is this roundtrip to index and back needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid the roundtrip eventually, will be best accomplished by being stricter in what we pass to _simple_new

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, would be a nice follow-up

pandas/core/indexes/interval.py Show resolved Hide resolved
right._values[key] = value_right
self._right = right
self._left[key] = value_left
self._right[key] = value_right # TODO: needs tests for not breaking views
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the un-xfail-ed test doing that?
(or if not, can you add a test now?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will remove comment

@jbrockmendel
Copy link
Member Author

updated per comments

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment

new_right = self.right.astype(dtype.subtype)
# We need to use Index rules for astype to prevent casting
# np.nan entries to int subtypes
new_left = Index(self._left, copy=False).astype(dtype.subtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could add copy=False to .astype (not sure how much any of this matters though)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge is ok now and can see if this matters on followup

pandas/core/arrays/interval.py Show resolved Hide resolved
@@ -201,6 +197,8 @@ class IntervalIndex(IntervalMixin, ExtensionIndex):
_mask = None

_data: IntervalArray
_values: IntervalArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, should try to remove this at some point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interval Interval data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants