-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Internal / External values #19558
Changes from 30 commits
41f09d8
29cfd7c
3185f4e
5a59591
476f75d
b15ee5a
659073f
7accb67
9b8d2a5
9fbac29
55305dc
0e63708
fbbbc8a
46a0a49
2c4445a
5612cda
b012c19
d49e6aa
d7d31ee
7b89f1b
b0dbffd
66b936f
32ee0ef
a9882e2
f53652a
2425621
512fb89
170d0c7
402620f
d9e8dd6
815d202
a727b21
f368c29
d74c5c9
8104ee5
f8e29b9
0cd9faa
8fcdb70
34a6a22
c233c28
d6e8051
3af8a21
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -89,6 +89,25 @@ not check (or care) whether the levels themselves are sorted. Fortunately, the | |
constructors ``from_tuples`` and ``from_arrays`` ensure that this is true, but | ||
if you compute the levels and labels yourself, please be careful. | ||
|
||
Values | ||
~~~~~~ | ||
|
||
Pandas extends NumPy's type system with custom types, like ``Categorical`` or | ||
datetimes with a timezone, so we have multiple notions of "values". For 1-D | ||
containers (``Index`` classes and ``Series``) we have the following convention: | ||
|
||
* ``cls._ndarray_values`` is *always* a NumPy ``ndarray``. Ideally, | ||
``_ndarray_values`` is cheap to compute. For example, for a ``Categorical``, | ||
this returns the codes, not the array of objects. | ||
* ``cls._values`` refers is the "best possible" array. This could be an | ||
``ndarray``, ``ExtensionArray``, or in ``Index`` subclass (note: we're in the | ||
process of removing the index subclasses here so that it's always an | ||
``ndarray`` or ``ExtensionArray``). | ||
|
||
So, for example, ``Series[category]._values`` is a ``Categorical``, while | ||
``Series[category]._ndarray_values`` is the underlying codes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this section does not belong in the same document as "how to subclass", as this part is really internals for contributors to pandas. But that's for a separate PR (I can start with that), so for here the above is fine for me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think that "How to subclass", and the eventual "extending pandas with custom array types" would be better in developer.rst, which is labeled as "This section will focus on downstream applications of pandas.". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, yes, I didn't see that the accessor documentation is actually already there (although, I personally don't find the parquet section that fitting in there, as it is not something typical you need to know when extending pandas. I can already start with moving it to the bottom of the file :-)) |
||
|
||
|
||
.. _ref-subclassing-pandas: | ||
|
||
Subclassing pandas Data Structures | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,6 +13,7 @@ | |
is_list_like, | ||
is_scalar, | ||
is_datetimelike, | ||
is_categorical_dtype, | ||
is_extension_type) | ||
|
||
from pandas.util._validators import validate_bool_kwarg | ||
|
@@ -710,7 +711,7 @@ def transpose(self, *args, **kwargs): | |
@property | ||
def shape(self): | ||
""" return a tuple of the shape of the underlying data """ | ||
return self._values.shape | ||
return self._ndarray_values.shape | ||
|
||
@property | ||
def ndim(self): | ||
|
@@ -738,22 +739,22 @@ def data(self): | |
@property | ||
def itemsize(self): | ||
""" return the size of the dtype of the item of the underlying data """ | ||
return self._values.itemsize | ||
return self._ndarray_values.itemsize | ||
|
||
@property | ||
def nbytes(self): | ||
""" return the number of bytes in the underlying data """ | ||
return self._values.nbytes | ||
return self._ndarray_values.nbytes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this caused issues for CI, but re-running tests now with this change. |
||
|
||
@property | ||
def strides(self): | ||
""" return the strides of the underlying data """ | ||
return self._values.strides | ||
return self._ndarray_values.strides | ||
|
||
@property | ||
def size(self): | ||
""" return the number of elements in the underlying data """ | ||
return self._values.size | ||
return self._ndarray_values.size | ||
|
||
@property | ||
def flags(self): | ||
|
@@ -768,8 +769,21 @@ def base(self): | |
return self.values.base | ||
|
||
@property | ||
def _values(self): | ||
""" the internal implementation """ | ||
def _ndarray_values(self): | ||
"""The data as an ndarray, possibly losing information. | ||
|
||
The expectation is that this is cheap to compute. | ||
|
||
- categorical -> codes | ||
|
||
See '_values' for more. | ||
""" | ||
# type: () -> np.ndarray | ||
from pandas.core.dtypes.common import is_categorical_dtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this does NOT belong here. you already have a sub-clss EA for Categorical that can simply override this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This code is to return This raises the question for me, though, what this will return for external extension types. Since it is is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've moved it all the way to |
||
|
||
if is_categorical_dtype(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we have this at all, e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because But I agree it points to something we should think about how to organize this, as also eg for periods there will be a special case here in the future. So maybe we need a property on our own extension arrays that gives back this ndarray? (which is not necessarily part of the external interface for extension arrays) |
||
return self._values.codes | ||
|
||
return self.values | ||
|
||
@property | ||
|
@@ -819,8 +833,10 @@ def tolist(self): | |
|
||
if is_datetimelike(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this should be overriden in EA, rather than specific dispatching via if/else here, IOW it should be a part of the interface, or be defined as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure it is needed to add a |
||
return [com._maybe_box_datetimelike(x) for x in self._values] | ||
elif is_categorical_dtype(self): | ||
return self.values.tolist() | ||
else: | ||
return self._values.tolist() | ||
return self._ndarray_values.tolist() | ||
|
||
def __iter__(self): | ||
""" | ||
|
@@ -978,7 +994,9 @@ def value_counts(self, normalize=False, sort=True, ascending=False, | |
def unique(self): | ||
values = self._values | ||
|
||
# TODO: Make unique part of the ExtensionArray interface. | ||
if hasattr(values, 'unique'): | ||
|
||
result = values.unique() | ||
else: | ||
from pandas.core.algorithms import unique1d | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -480,20 +480,22 @@ def _concat_datetimetz(to_concat, name=None): | |
|
||
def _concat_index_same_dtype(indexes, klass=None): | ||
klass = klass if klass is not None else indexes[0].__class__ | ||
return klass(np.concatenate([x._values for x in indexes])) | ||
return klass(np.concatenate([x._ndarray_values for x in indexes])) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This one is only used for numeric indices, so |
||
|
||
|
||
def _concat_index_asobject(to_concat, name=None): | ||
""" | ||
concat all inputs as object. DatetimeIndex, TimedeltaIndex and | ||
PeriodIndex are converted to object dtype before concatenation | ||
""" | ||
from pandas import Index | ||
from pandas.core.arrays import ExtensionArray | ||
|
||
klasses = ABCDatetimeIndex, ABCTimedeltaIndex, ABCPeriodIndex | ||
klasses = (ABCDatetimeIndex, ABCTimedeltaIndex, ABCPeriodIndex, | ||
ExtensionArray) | ||
to_concat = [x.astype(object) if isinstance(x, klasses) else x | ||
for x in to_concat] | ||
|
||
from pandas import Index | ||
self = to_concat[0] | ||
attribs = self._get_attributes_dict() | ||
attribs['name'] = name | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,12 +31,14 @@ | |
is_object_dtype, | ||
is_categorical_dtype, | ||
is_interval_dtype, | ||
is_period_dtype, | ||
is_bool, | ||
is_bool_dtype, | ||
is_signed_integer_dtype, | ||
is_unsigned_integer_dtype, | ||
is_integer_dtype, is_float_dtype, | ||
is_datetime64_any_dtype, | ||
is_datetime64tz_dtype, | ||
is_timedelta64_dtype, | ||
needs_i8_conversion, | ||
is_iterator, is_list_like, | ||
|
@@ -412,7 +414,7 @@ def _simple_new(cls, values, name=None, dtype=None, **kwargs): | |
values = np.array(values, copy=False) | ||
if is_object_dtype(values): | ||
values = cls(values, name=name, dtype=dtype, | ||
**kwargs)._values | ||
**kwargs)._ndarray_values | ||
|
||
result = object.__new__(cls) | ||
result._data = values | ||
|
@@ -594,6 +596,40 @@ def values(self): | |
""" return the underlying data as an ndarray """ | ||
return self._data.view(np.ndarray) | ||
|
||
@property | ||
def _values(self): | ||
# type: () -> Union[ExtensionArray, Index] | ||
# TODO: remove index types as they become is extension arrays | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove 'is' |
||
"""The best array representation. | ||
|
||
This is an ndarray, ExtensionArray, or Index subclass. This differs | ||
from ``_ndarray_values``, which always returns an ndarray. | ||
|
||
Both ``_values`` and ``_ndarray_values`` are consistent between | ||
``Series`` and ``Index``. | ||
|
||
It may differ from the public '.values' method. | ||
|
||
index | values | _values | _ndarray_values | | ||
----------------- | -------------- -| ----------- | --------------- | | ||
CategoricalIndex | Categorical | Categorical | codes | | ||
DatetimeIndex[tz] | ndarray[M8ns] | DTI[tz] | ndarray[M8ns] | | ||
|
||
For the following, the ``._values`` is currently ``ndarray[object]``, | ||
but will soon be an ``ExtensionArray`` | ||
|
||
index | values | _values | _ndarray_values | | ||
----------------- | --------------- | ------------ | --------------- | | ||
PeriodIndex | ndarray[object] | ndarray[obj] | ndarray[int] | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the (but not sure where this is currently actually used) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No I don't think so, since There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but that is because we don't have a PeriodBlock that is backed by a PeriodIndex, as is the case for DatetimeTZ, so at the moment you cannot really have Period values in a series. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm confused what your proposing / saying then. Is it just that What we have now "works", but is internally inconsistent. I'm updating There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. So regardless of what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in theory these should return an |
||
IntervalIndex | ndarray[object] | ndarray[obj] | ndarray[object] | | ||
|
||
See Also | ||
-------- | ||
values | ||
_ndarray_values | ||
""" | ||
return self.values | ||
|
||
def get_values(self): | ||
""" return the underlying data as an ndarray """ | ||
return self.values | ||
|
@@ -664,7 +700,7 @@ def ravel(self, order='C'): | |
-------- | ||
numpy.ndarray.ravel | ||
""" | ||
return self._values.ravel(order=order) | ||
return self._ndarray_values.ravel(order=order) | ||
|
||
# construction helpers | ||
@classmethod | ||
|
@@ -1597,7 +1633,7 @@ def _constructor(self): | |
@cache_readonly | ||
def _engine(self): | ||
# property, for now, slow to look up | ||
return self._engine_type(lambda: self._values, len(self)) | ||
return self._engine_type(lambda: self._ndarray_values, len(self)) | ||
|
||
def _validate_index_level(self, level): | ||
""" | ||
|
@@ -2228,27 +2264,37 @@ def union(self, other): | |
other = other.astype('O') | ||
return this.union(other) | ||
|
||
# TODO: setops-refactor, clean all this up | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this if/then needs to all be completely removed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can't be removed until setops are fix more generally. Like I said, I have a PR started, but it's a bigger issue than what's on the critical path for ExtensionArray. Take There's no way to avoid the if statements until set ops are completely refactored. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, let's do this next then. these if/then/else are really smelly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if might be worth it to put this in a single method now (not sure if its possible) to isolate this code, maybe
|
||
if is_period_dtype(self) or is_datetime64tz_dtype(self): | ||
lvals = self._ndarray_values | ||
else: | ||
lvals = self._values | ||
if is_period_dtype(other) or is_datetime64tz_dtype(other): | ||
rvals = other._ndarray_values | ||
else: | ||
rvals = other._values | ||
|
||
if self.is_monotonic and other.is_monotonic: | ||
try: | ||
result = self._outer_indexer(self._values, other._values)[0] | ||
result = self._outer_indexer(lvals, rvals)[0] | ||
except TypeError: | ||
# incomparable objects | ||
result = list(self._values) | ||
result = list(lvals) | ||
|
||
# worth making this faster? a very unusual case | ||
value_set = set(self._values) | ||
result.extend([x for x in other._values if x not in value_set]) | ||
value_set = set(lvals) | ||
result.extend([x for x in rvals if x not in value_set]) | ||
else: | ||
indexer = self.get_indexer(other) | ||
indexer, = (indexer == -1).nonzero() | ||
|
||
if len(indexer) > 0: | ||
other_diff = algos.take_nd(other._values, indexer, | ||
other_diff = algos.take_nd(rvals, indexer, | ||
allow_fill=False) | ||
result = _concat._concat_compat((self._values, other_diff)) | ||
result = _concat._concat_compat((lvals, other_diff)) | ||
|
||
try: | ||
self._values[0] < other_diff[0] | ||
lvals[0] < other_diff[0] | ||
except TypeError as e: | ||
warnings.warn("%s, sort order is undefined for " | ||
"incomparable objects" % e, RuntimeWarning, | ||
|
@@ -2260,7 +2306,7 @@ def union(self, other): | |
result.sort() | ||
|
||
else: | ||
result = self._values | ||
result = lvals | ||
|
||
try: | ||
result = np.sort(result) | ||
|
@@ -2311,20 +2357,30 @@ def intersection(self, other): | |
other = other.astype('O') | ||
return this.intersection(other) | ||
|
||
# TODO: setops-refactor, clean all this up | ||
if is_period_dtype(self): | ||
lvals = self._ndarray_values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
else: | ||
lvals = self._values | ||
if is_period_dtype(other): | ||
rvals = other._ndarray_values | ||
else: | ||
rvals = other._values | ||
|
||
if self.is_monotonic and other.is_monotonic: | ||
try: | ||
result = self._inner_indexer(self._values, other._values)[0] | ||
result = self._inner_indexer(lvals, rvals)[0] | ||
return self._wrap_union_result(other, result) | ||
except TypeError: | ||
pass | ||
|
||
try: | ||
indexer = Index(other._values).get_indexer(self._values) | ||
indexer = Index(rvals).get_indexer(lvals) | ||
indexer = indexer.take((indexer != -1).nonzero()[0]) | ||
except Exception: | ||
# duplicates | ||
indexer = algos.unique1d( | ||
Index(other._values).get_indexer_non_unique(self._values)[0]) | ||
Index(rvals).get_indexer_non_unique(lvals)[0]) | ||
indexer = indexer[indexer != -1] | ||
|
||
taken = other.take(indexer) | ||
|
@@ -2700,7 +2756,7 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None): | |
raise ValueError('limit argument only valid if doing pad, ' | ||
'backfill or nearest reindexing') | ||
|
||
indexer = self._engine.get_indexer(target._values) | ||
indexer = self._engine.get_indexer(target._ndarray_values) | ||
|
||
return _ensure_platform_int(indexer) | ||
|
||
|
@@ -2716,12 +2772,13 @@ def _get_fill_indexer(self, target, method, limit=None, tolerance=None): | |
if self.is_monotonic_increasing and target.is_monotonic_increasing: | ||
method = (self._engine.get_pad_indexer if method == 'pad' else | ||
self._engine.get_backfill_indexer) | ||
indexer = method(target._values, limit) | ||
indexer = method(target._ndarray_values, limit) | ||
else: | ||
indexer = self._get_fill_indexer_searchsorted(target, method, | ||
limit) | ||
if tolerance is not None: | ||
indexer = self._filter_indexer_tolerance(target._values, indexer, | ||
indexer = self._filter_indexer_tolerance(target._ndarray_values, | ||
indexer, | ||
tolerance) | ||
return indexer | ||
|
||
|
@@ -2812,7 +2869,7 @@ def get_indexer_non_unique(self, target): | |
self = Index(self.asi8) | ||
tgt_values = target.asi8 | ||
else: | ||
tgt_values = target._values | ||
tgt_values = target._ndarray_values | ||
|
||
indexer, missing = self._engine.get_indexer_non_unique(tgt_values) | ||
return _ensure_platform_int(indexer), missing | ||
|
@@ -3247,16 +3304,17 @@ def _join_multi(self, other, how, return_indexers=True): | |
def _join_non_unique(self, other, how='left', return_indexers=False): | ||
from pandas.core.reshape.merge import _get_join_indexers | ||
|
||
left_idx, right_idx = _get_join_indexers([self._values], | ||
[other._values], how=how, | ||
left_idx, right_idx = _get_join_indexers([self._ndarray_values], | ||
[other._ndarray_values], | ||
how=how, | ||
sort=True) | ||
|
||
left_idx = _ensure_platform_int(left_idx) | ||
right_idx = _ensure_platform_int(right_idx) | ||
|
||
join_index = np.asarray(self._values.take(left_idx)) | ||
join_index = np.asarray(self._ndarray_values.take(left_idx)) | ||
mask = left_idx == -1 | ||
np.putmask(join_index, mask, other._values.take(right_idx)) | ||
np.putmask(join_index, mask, other._ndarray_values.take(right_idx)) | ||
|
||
join_index = self._wrap_joined_index(join_index, other) | ||
|
||
|
@@ -3403,8 +3461,8 @@ def _join_monotonic(self, other, how='left', return_indexers=False): | |
else: | ||
return ret_index | ||
|
||
sv = self._values | ||
ov = other._values | ||
sv = self._ndarray_values | ||
ov = other._ndarray_values | ||
|
||
if self.is_unique and other.is_unique: | ||
# We can perform much better than the general case | ||
|
@@ -3756,7 +3814,7 @@ def insert(self, loc, item): | |
item = self._na_value | ||
|
||
_self = np.asarray(self) | ||
item = self._coerce_scalar_to_index(item)._values | ||
item = self._coerce_scalar_to_index(item)._ndarray_values | ||
idx = np.concatenate((_self[:loc], item, _self[loc:])) | ||
return self._shallow_copy_with_infer(idx) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could add section tags