-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Internal / External values #19558
Changes from 34 commits
41f09d8
29cfd7c
3185f4e
5a59591
476f75d
b15ee5a
659073f
7accb67
9b8d2a5
9fbac29
55305dc
0e63708
fbbbc8a
46a0a49
2c4445a
5612cda
b012c19
d49e6aa
d7d31ee
7b89f1b
b0dbffd
66b936f
32ee0ef
a9882e2
f53652a
2425621
512fb89
170d0c7
402620f
d9e8dd6
815d202
a727b21
f368c29
d74c5c9
8104ee5
f8e29b9
0cd9faa
8fcdb70
34a6a22
c233c28
d6e8051
3af8a21
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -89,6 +89,25 @@ not check (or care) whether the levels themselves are sorted. Fortunately, the | |
constructors ``from_tuples`` and ``from_arrays`` ensure that this is true, but | ||
if you compute the levels and labels yourself, please be careful. | ||
|
||
Values | ||
~~~~~~ | ||
|
||
Pandas extends NumPy's type system with custom types, like ``Categorical`` or | ||
datetimes with a timezone, so we have multiple notions of "values". For 1-D | ||
containers (``Index`` classes and ``Series``) we have the following convention: | ||
|
||
* ``cls._ndarray_values`` is *always* a NumPy ``ndarray``. Ideally, | ||
``_ndarray_values`` is cheap to compute. For example, for a ``Categorical``, | ||
this returns the codes, not the array of objects. | ||
* ``cls._values`` refers is the "best possible" array. This could be an | ||
``ndarray``, ``ExtensionArray``, or in ``Index`` subclass (note: we're in the | ||
process of removing the index subclasses here so that it's always an | ||
``ndarray`` or ``ExtensionArray``). | ||
|
||
So, for example, ``Series[category]._values`` is a ``Categorical``, while | ||
``Series[category]._ndarray_values`` is the underlying codes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this section does not belong in the same document as "how to subclass", as this part is really internals for contributors to pandas. But that's for a separate PR (I can start with that), so for here the above is fine for me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think that "How to subclass", and the eventual "extending pandas with custom array types" would be better in developer.rst, which is labeled as "This section will focus on downstream applications of pandas.". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, yes, I didn't see that the accessor documentation is actually already there (although, I personally don't find the parquet section that fitting in there, as it is not something typical you need to know when extending pandas. I can already start with moving it to the bottom of the file :-)) |
||
|
||
|
||
.. _ref-subclassing-pandas: | ||
|
||
Subclassing pandas Data Structures | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -266,3 +266,15 @@ def _can_hold_na(self): | |
Setting this to false will optimize some operations like fillna. | ||
""" | ||
return True | ||
|
||
@property | ||
def _ndarray_values(self): | ||
# type: () -> np.ndarray | ||
"""Internal pandas method for lossy conversion to a NumPy ndarray. | ||
|
||
This method is not part of the pandas interface. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find it a bit strange that we say this is not part of the interface, but still provide here a default implementation and say what it is. If it is not part of the interface, we could also define it on our own subclasses without defining it here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This simplifies the implementation since a.) we don't have to have a My preference is to leave it out of the interface until someone sees an actual need for it. I suspect this could come up if / when we start allowing custom indexes with their own indexing engines, but that seems like a ways down the road... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I understand those reasons, but that means it is part of the interface (in the sense that if somebody would be stupid to implement a To give an example, for GeoPandas, I was wondering if I would overwrite this property to return my There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see your point. Your GeoPandas concern is a good one. But I'm not sure how to proceeded :/ Do you have a preference for a.) Adding it to the interface
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe the best option is to leave it for now? :) Ideally, the Series machinery should not use And this is already the case I think. From a quick scan, currently in the PR, I see
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Your assessment looks correct. Nothing that's series specific currently uses I can change There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i agree with @jorisvandenbossche assessment here. maybe if we can ultimatley remove this would be good, and simply dispatch to the array object. If Index is a proper EA then this would be possible. |
||
|
||
The expectation is that this is cheap to compute, and is primarily | ||
used for interacting with our indexers. | ||
""" | ||
return np.array(self) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,7 +13,9 @@ | |
is_list_like, | ||
is_scalar, | ||
is_datetimelike, | ||
is_extension_type) | ||
is_categorical_dtype, | ||
is_extension_type, | ||
is_extension_array_dtype) | ||
|
||
from pandas.util._validators import validate_bool_kwarg | ||
|
||
|
@@ -710,7 +712,7 @@ def transpose(self, *args, **kwargs): | |
@property | ||
def shape(self): | ||
""" return a tuple of the shape of the underlying data """ | ||
return self._values.shape | ||
return self._ndarray_values.shape | ||
|
||
@property | ||
def ndim(self): | ||
|
@@ -738,22 +740,22 @@ def data(self): | |
@property | ||
def itemsize(self): | ||
""" return the size of the dtype of the item of the underlying data """ | ||
return self._values.itemsize | ||
return self._ndarray_values.itemsize | ||
|
||
@property | ||
def nbytes(self): | ||
""" return the number of bytes in the underlying data """ | ||
return self._values.nbytes | ||
return self._ndarray_values.nbytes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this caused issues for CI, but re-running tests now with this change. |
||
|
||
@property | ||
def strides(self): | ||
""" return the strides of the underlying data """ | ||
return self._values.strides | ||
return self._ndarray_values.strides | ||
|
||
@property | ||
def size(self): | ||
""" return the number of elements in the underlying data """ | ||
return self._values.size | ||
return self._ndarray_values.size | ||
|
||
@property | ||
def flags(self): | ||
|
@@ -768,8 +770,17 @@ def base(self): | |
return self.values.base | ||
|
||
@property | ||
def _values(self): | ||
""" the internal implementation """ | ||
def _ndarray_values(self): | ||
"""The data as an ndarray, possibly losing information. | ||
|
||
The expectation is that this is cheap to compute, and is primarily | ||
used for interacting with our indexers. | ||
|
||
- categorical -> codes | ||
""" | ||
# type: () -> np.ndarray | ||
if is_extension_array_dtype(self): | ||
return self.values._ndarray_values | ||
return self.values | ||
|
||
@property | ||
|
@@ -978,7 +989,9 @@ def value_counts(self, normalize=False, sort=True, ascending=False, | |
def unique(self): | ||
values = self._values | ||
|
||
# TODO: Make unique part of the ExtensionArray interface. | ||
if hasattr(values, 'unique'): | ||
|
||
result = values.unique() | ||
else: | ||
from pandas.core.algorithms import unique1d | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -480,20 +480,22 @@ def _concat_datetimetz(to_concat, name=None): | |
|
||
def _concat_index_same_dtype(indexes, klass=None): | ||
klass = klass if klass is not None else indexes[0].__class__ | ||
return klass(np.concatenate([x._values for x in indexes])) | ||
return klass(np.concatenate([x._ndarray_values for x in indexes])) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This one is only used for numeric indices, so |
||
|
||
|
||
def _concat_index_asobject(to_concat, name=None): | ||
""" | ||
concat all inputs as object. DatetimeIndex, TimedeltaIndex and | ||
PeriodIndex are converted to object dtype before concatenation | ||
""" | ||
from pandas import Index | ||
from pandas.core.arrays import ExtensionArray | ||
|
||
klasses = ABCDatetimeIndex, ABCTimedeltaIndex, ABCPeriodIndex | ||
klasses = (ABCDatetimeIndex, ABCTimedeltaIndex, ABCPeriodIndex, | ||
ExtensionArray) | ||
to_concat = [x.astype(object) if isinstance(x, klasses) else x | ||
for x in to_concat] | ||
|
||
from pandas import Index | ||
self = to_concat[0] | ||
attribs = self._get_attributes_dict() | ||
attribs['name'] = name | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could add section tags