-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Internal / External values #19558
Changes from 39 commits
41f09d8
29cfd7c
3185f4e
5a59591
476f75d
b15ee5a
659073f
7accb67
9b8d2a5
9fbac29
55305dc
0e63708
fbbbc8a
46a0a49
2c4445a
5612cda
b012c19
d49e6aa
d7d31ee
7b89f1b
b0dbffd
66b936f
32ee0ef
a9882e2
f53652a
2425621
512fb89
170d0c7
402620f
d9e8dd6
815d202
a727b21
f368c29
d74c5c9
8104ee5
f8e29b9
0cd9faa
8fcdb70
34a6a22
c233c28
d6e8051
3af8a21
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -89,6 +89,25 @@ not check (or care) whether the levels themselves are sorted. Fortunately, the | |
constructors ``from_tuples`` and ``from_arrays`` ensure that this is true, but | ||
if you compute the levels and labels yourself, please be careful. | ||
|
||
Values | ||
~~~~~~ | ||
|
||
Pandas extends NumPy's type system with custom types, like ``Categorical`` or | ||
datetimes with a timezone, so we have multiple notions of "values". For 1-D | ||
containers (``Index`` classes and ``Series``) we have the following convention: | ||
|
||
* ``cls._ndarray_values`` is *always* a NumPy ``ndarray``. Ideally, | ||
``_ndarray_values`` is cheap to compute. For example, for a ``Categorical``, | ||
this returns the codes, not the array of objects. | ||
* ``cls._values`` refers is the "best possible" array. This could be an | ||
``ndarray``, ``ExtensionArray``, or in ``Index`` subclass (note: we're in the | ||
process of removing the index subclasses here so that it's always an | ||
``ndarray`` or ``ExtensionArray``). | ||
|
||
So, for example, ``Series[category]._values`` is a ``Categorical``, while | ||
``Series[category]._ndarray_values`` is the underlying codes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this section does not belong in the same document as "how to subclass", as this part is really internals for contributors to pandas. But that's for a separate PR (I can start with that), so for here the above is fine for me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think that "How to subclass", and the eventual "extending pandas with custom array types" would be better in developer.rst, which is labeled as "This section will focus on downstream applications of pandas.". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, yes, I didn't see that the accessor documentation is actually already there (although, I personally don't find the parquet section that fitting in there, as it is not something typical you need to know when extending pandas. I can already start with moving it to the bottom of the file :-)) |
||
|
||
|
||
.. _ref-subclassing-pandas: | ||
|
||
Subclassing pandas Data Structures | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -266,3 +266,15 @@ def _can_hold_na(self): | |
Setting this to false will optimize some operations like fillna. | ||
""" | ||
return True | ||
|
||
@property | ||
def _ndarray_values(self): | ||
# type: () -> np.ndarray | ||
"""Internal pandas method for lossy conversion to a NumPy ndarray. | ||
|
||
This method is not part of the pandas interface. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find it a bit strange that we say this is not part of the interface, but still provide here a default implementation and say what it is. If it is not part of the interface, we could also define it on our own subclasses without defining it here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This simplifies the implementation since a.) we don't have to have a My preference is to leave it out of the interface until someone sees an actual need for it. I suspect this could come up if / when we start allowing custom indexes with their own indexing engines, but that seems like a ways down the road... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I understand those reasons, but that means it is part of the interface (in the sense that if somebody would be stupid to implement a To give an example, for GeoPandas, I was wondering if I would overwrite this property to return my There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see your point. Your GeoPandas concern is a good one. But I'm not sure how to proceeded :/ Do you have a preference for a.) Adding it to the interface
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe the best option is to leave it for now? :) Ideally, the Series machinery should not use And this is already the case I think. From a quick scan, currently in the PR, I see
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Your assessment looks correct. Nothing that's series specific currently uses I can change There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i agree with @jorisvandenbossche assessment here. maybe if we can ultimatley remove this would be good, and simply dispatch to the array object. If Index is a proper EA then this would be possible. |
||
|
||
The expectation is that this is cheap to compute, and is primarily | ||
used for interacting with our indexers. | ||
""" | ||
return np.array(self) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,7 +13,8 @@ | |
is_list_like, | ||
is_scalar, | ||
is_datetimelike, | ||
is_extension_type) | ||
is_extension_type, | ||
is_extension_array_dtype) | ||
|
||
from pandas.util._validators import validate_bool_kwarg | ||
|
||
|
@@ -738,17 +739,17 @@ def data(self): | |
@property | ||
def itemsize(self): | ||
""" return the size of the dtype of the item of the underlying data """ | ||
return self._values.itemsize | ||
return self._ndarray_values.itemsize | ||
|
||
@property | ||
def nbytes(self): | ||
""" return the number of bytes in the underlying data """ | ||
return self._values.nbytes | ||
return self.values.nbytes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why not There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It depends on what we do with If we change the public If we leave the public So far we've put off that decision, but we'll need to make a choice soon. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You have to override this in DatetimeIndex (tz) then for now, this is the reason CI is failing (recursion error) On the original comment: I think it is better for now to use |
||
|
||
@property | ||
def strides(self): | ||
""" return the strides of the underlying data """ | ||
return self._values.strides | ||
return self._ndarray_values.strides | ||
|
||
@property | ||
def size(self): | ||
|
@@ -768,8 +769,17 @@ def base(self): | |
return self.values.base | ||
|
||
@property | ||
def _values(self): | ||
""" the internal implementation """ | ||
def _ndarray_values(self): | ||
"""The data as an ndarray, possibly losing information. | ||
|
||
The expectation is that this is cheap to compute, and is primarily | ||
used for interacting with our indexers. | ||
|
||
- categorical -> codes | ||
""" | ||
# type: () -> np.ndarray | ||
if is_extension_array_dtype(self): | ||
return self.values._ndarray_values | ||
return self.values | ||
|
||
@property | ||
|
@@ -978,7 +988,9 @@ def value_counts(self, normalize=False, sort=True, ascending=False, | |
def unique(self): | ||
values = self._values | ||
|
||
# TODO: Make unique part of the ExtensionArray interface. | ||
if hasattr(values, 'unique'): | ||
|
||
result = values.unique() | ||
else: | ||
from pandas.core.algorithms import unique1d | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,12 +31,14 @@ | |
is_object_dtype, | ||
is_categorical_dtype, | ||
is_interval_dtype, | ||
is_period_dtype, | ||
is_bool, | ||
is_bool_dtype, | ||
is_signed_integer_dtype, | ||
is_unsigned_integer_dtype, | ||
is_integer_dtype, is_float_dtype, | ||
is_datetime64_any_dtype, | ||
is_datetime64tz_dtype, | ||
is_timedelta64_dtype, | ||
needs_i8_conversion, | ||
is_iterator, is_list_like, | ||
|
@@ -412,7 +414,7 @@ def _simple_new(cls, values, name=None, dtype=None, **kwargs): | |
values = np.array(values, copy=False) | ||
if is_object_dtype(values): | ||
values = cls(values, name=name, dtype=dtype, | ||
**kwargs)._values | ||
**kwargs)._ndarray_values | ||
|
||
result = object.__new__(cls) | ||
result._data = values | ||
|
@@ -594,6 +596,40 @@ def values(self): | |
""" return the underlying data as an ndarray """ | ||
return self._data.view(np.ndarray) | ||
|
||
@property | ||
def _values(self): | ||
# type: () -> Union[ExtensionArray, Index] | ||
# TODO: remove index types as they become is extension arrays | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove 'is' |
||
"""The best array representation. | ||
|
||
This is an ndarray, ExtensionArray, or Index subclass. This differs | ||
from ``_ndarray_values``, which always returns an ndarray. | ||
|
||
Both ``_values`` and ``_ndarray_values`` are consistent between | ||
``Series`` and ``Index``. | ||
|
||
It may differ from the public '.values' method. | ||
|
||
index | values | _values | _ndarray_values | | ||
----------------- | -------------- -| ----------- | --------------- | | ||
CategoricalIndex | Categorical | Categorical | codes | | ||
DatetimeIndex[tz] | ndarray[M8ns] | DTI[tz] | ndarray[M8ns] | | ||
|
||
For the following, the ``._values`` is currently ``ndarray[object]``, | ||
but will soon be an ``ExtensionArray`` | ||
|
||
index | values | _values | _ndarray_values | | ||
----------------- | --------------- | ------------ | --------------- | | ||
PeriodIndex | ndarray[object] | ndarray[obj] | ndarray[int] | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the (but not sure where this is currently actually used) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No I don't think so, since There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but that is because we don't have a PeriodBlock that is backed by a PeriodIndex, as is the case for DatetimeTZ, so at the moment you cannot really have Period values in a series. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm confused what your proposing / saying then. Is it just that What we have now "works", but is internally inconsistent. I'm updating There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. So regardless of what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in theory these should return an |
||
IntervalIndex | ndarray[object] | ndarray[obj] | ndarray[object] | | ||
|
||
See Also | ||
-------- | ||
values | ||
_ndarray_values | ||
""" | ||
return self.values | ||
|
||
def get_values(self): | ||
""" return the underlying data as an ndarray """ | ||
return self.values | ||
|
@@ -664,7 +700,7 @@ def ravel(self, order='C'): | |
-------- | ||
numpy.ndarray.ravel | ||
""" | ||
return self._values.ravel(order=order) | ||
return self._ndarray_values.ravel(order=order) | ||
|
||
# construction helpers | ||
@classmethod | ||
|
@@ -1597,7 +1633,7 @@ def _constructor(self): | |
@cache_readonly | ||
def _engine(self): | ||
# property, for now, slow to look up | ||
return self._engine_type(lambda: self._values, len(self)) | ||
return self._engine_type(lambda: self._ndarray_values, len(self)) | ||
|
||
def _validate_index_level(self, level): | ||
""" | ||
|
@@ -2228,27 +2264,37 @@ def union(self, other): | |
other = other.astype('O') | ||
return this.union(other) | ||
|
||
# TODO: setops-refactor, clean all this up | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this if/then needs to all be completely removed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can't be removed until setops are fix more generally. Like I said, I have a PR started, but it's a bigger issue than what's on the critical path for ExtensionArray. Take There's no way to avoid the if statements until set ops are completely refactored. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, let's do this next then. these if/then/else are really smelly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if might be worth it to put this in a single method now (not sure if its possible) to isolate this code, maybe
|
||
if is_period_dtype(self) or is_datetime64tz_dtype(self): | ||
lvals = self._ndarray_values | ||
else: | ||
lvals = self._values | ||
if is_period_dtype(other) or is_datetime64tz_dtype(other): | ||
rvals = other._ndarray_values | ||
else: | ||
rvals = other._values | ||
|
||
if self.is_monotonic and other.is_monotonic: | ||
try: | ||
result = self._outer_indexer(self._values, other._values)[0] | ||
result = self._outer_indexer(lvals, rvals)[0] | ||
except TypeError: | ||
# incomparable objects | ||
result = list(self._values) | ||
result = list(lvals) | ||
|
||
# worth making this faster? a very unusual case | ||
value_set = set(self._values) | ||
result.extend([x for x in other._values if x not in value_set]) | ||
value_set = set(lvals) | ||
result.extend([x for x in rvals if x not in value_set]) | ||
else: | ||
indexer = self.get_indexer(other) | ||
indexer, = (indexer == -1).nonzero() | ||
|
||
if len(indexer) > 0: | ||
other_diff = algos.take_nd(other._values, indexer, | ||
other_diff = algos.take_nd(rvals, indexer, | ||
allow_fill=False) | ||
result = _concat._concat_compat((self._values, other_diff)) | ||
result = _concat._concat_compat((lvals, other_diff)) | ||
|
||
try: | ||
self._values[0] < other_diff[0] | ||
lvals[0] < other_diff[0] | ||
except TypeError as e: | ||
warnings.warn("%s, sort order is undefined for " | ||
"incomparable objects" % e, RuntimeWarning, | ||
|
@@ -2260,7 +2306,7 @@ def union(self, other): | |
result.sort() | ||
|
||
else: | ||
result = self._values | ||
result = lvals | ||
|
||
try: | ||
result = np.sort(result) | ||
|
@@ -2311,20 +2357,30 @@ def intersection(self, other): | |
other = other.astype('O') | ||
return this.intersection(other) | ||
|
||
# TODO: setops-refactor, clean all this up | ||
if is_period_dtype(self): | ||
lvals = self._ndarray_values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
else: | ||
lvals = self._values | ||
if is_period_dtype(other): | ||
rvals = other._ndarray_values | ||
else: | ||
rvals = other._values | ||
|
||
if self.is_monotonic and other.is_monotonic: | ||
try: | ||
result = self._inner_indexer(self._values, other._values)[0] | ||
result = self._inner_indexer(lvals, rvals)[0] | ||
return self._wrap_union_result(other, result) | ||
except TypeError: | ||
pass | ||
|
||
try: | ||
indexer = Index(other._values).get_indexer(self._values) | ||
indexer = Index(rvals).get_indexer(lvals) | ||
indexer = indexer.take((indexer != -1).nonzero()[0]) | ||
except Exception: | ||
# duplicates | ||
indexer = algos.unique1d( | ||
Index(other._values).get_indexer_non_unique(self._values)[0]) | ||
Index(rvals).get_indexer_non_unique(lvals)[0]) | ||
indexer = indexer[indexer != -1] | ||
|
||
taken = other.take(indexer) | ||
|
@@ -2700,7 +2756,7 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None): | |
raise ValueError('limit argument only valid if doing pad, ' | ||
'backfill or nearest reindexing') | ||
|
||
indexer = self._engine.get_indexer(target._values) | ||
indexer = self._engine.get_indexer(target._ndarray_values) | ||
|
||
return _ensure_platform_int(indexer) | ||
|
||
|
@@ -2716,12 +2772,13 @@ def _get_fill_indexer(self, target, method, limit=None, tolerance=None): | |
if self.is_monotonic_increasing and target.is_monotonic_increasing: | ||
method = (self._engine.get_pad_indexer if method == 'pad' else | ||
self._engine.get_backfill_indexer) | ||
indexer = method(target._values, limit) | ||
indexer = method(target._ndarray_values, limit) | ||
else: | ||
indexer = self._get_fill_indexer_searchsorted(target, method, | ||
limit) | ||
if tolerance is not None: | ||
indexer = self._filter_indexer_tolerance(target._values, indexer, | ||
indexer = self._filter_indexer_tolerance(target._ndarray_values, | ||
indexer, | ||
tolerance) | ||
return indexer | ||
|
||
|
@@ -2812,7 +2869,7 @@ def get_indexer_non_unique(self, target): | |
self = Index(self.asi8) | ||
tgt_values = target.asi8 | ||
else: | ||
tgt_values = target._values | ||
tgt_values = target._ndarray_values | ||
|
||
indexer, missing = self._engine.get_indexer_non_unique(tgt_values) | ||
return _ensure_platform_int(indexer), missing | ||
|
@@ -3247,16 +3304,17 @@ def _join_multi(self, other, how, return_indexers=True): | |
def _join_non_unique(self, other, how='left', return_indexers=False): | ||
from pandas.core.reshape.merge import _get_join_indexers | ||
|
||
left_idx, right_idx = _get_join_indexers([self._values], | ||
[other._values], how=how, | ||
left_idx, right_idx = _get_join_indexers([self._ndarray_values], | ||
[other._ndarray_values], | ||
how=how, | ||
sort=True) | ||
|
||
left_idx = _ensure_platform_int(left_idx) | ||
right_idx = _ensure_platform_int(right_idx) | ||
|
||
join_index = np.asarray(self._values.take(left_idx)) | ||
join_index = np.asarray(self._ndarray_values.take(left_idx)) | ||
mask = left_idx == -1 | ||
np.putmask(join_index, mask, other._values.take(right_idx)) | ||
np.putmask(join_index, mask, other._ndarray_values.take(right_idx)) | ||
|
||
join_index = self._wrap_joined_index(join_index, other) | ||
|
||
|
@@ -3403,8 +3461,8 @@ def _join_monotonic(self, other, how='left', return_indexers=False): | |
else: | ||
return ret_index | ||
|
||
sv = self._values | ||
ov = other._values | ||
sv = self._ndarray_values | ||
ov = other._ndarray_values | ||
|
||
if self.is_unique and other.is_unique: | ||
# We can perform much better than the general case | ||
|
@@ -3756,7 +3814,7 @@ def insert(self, loc, item): | |
item = self._na_value | ||
|
||
_self = np.asarray(self) | ||
item = self._coerce_scalar_to_index(item)._values | ||
item = self._coerce_scalar_to_index(item)._ndarray_values | ||
idx = np.concatenate((_self[:loc], item, _self[loc:])) | ||
return self._shallow_copy_with_infer(idx) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could add section tags