Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Internal / External values #19558

Merged
merged 42 commits into from
Feb 13, 2018
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
41f09d8
REF/Clean: Internal / External values
TomAugspurger Feb 3, 2018
29cfd7c
Move to index base
TomAugspurger Feb 6, 2018
3185f4e
Cleanup unique handling
TomAugspurger Feb 7, 2018
5a59591
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 7, 2018
476f75d
Simplify object concat
TomAugspurger Feb 7, 2018
b15ee5a
Use values for intersection
TomAugspurger Feb 7, 2018
659073f
hmm
TomAugspurger Feb 7, 2018
7accb67
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 8, 2018
9b8d2a5
Additional testing
TomAugspurger Feb 8, 2018
9fbac29
More tests
TomAugspurger Feb 8, 2018
55305dc
ndarray_values
TomAugspurger Feb 8, 2018
0e63708
API: Default ExtensionArray.astype
TomAugspurger Feb 8, 2018
fbbbc8a
Simplify concat_as_object
TomAugspurger Feb 8, 2018
46a0a49
Py2 compat
TomAugspurger Feb 8, 2018
2c4445a
Set-ops ugliness
TomAugspurger Feb 8, 2018
5612cda
better docstrings
TomAugspurger Feb 8, 2018
b012c19
tolist
TomAugspurger Feb 8, 2018
d49e6aa
linting
TomAugspurger Feb 8, 2018
d7d31ee
Moved dtypes
TomAugspurger Feb 9, 2018
7b89f1b
clean
TomAugspurger Feb 9, 2018
b0dbffd
cleanup
TomAugspurger Feb 9, 2018
66b936f
NumPy compat
TomAugspurger Feb 9, 2018
32ee0ef
Use base _values for CategoricalIndex
TomAugspurger Feb 9, 2018
a9882e2
Update dev docs
TomAugspurger Feb 9, 2018
f53652a
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
2425621
cleanup
TomAugspurger Feb 9, 2018
512fb89
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
170d0c7
Linting
TomAugspurger Feb 9, 2018
402620f
Precision in tests
TomAugspurger Feb 9, 2018
d9e8dd6
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
815d202
Push _ndarray_values to ExtensionArray
TomAugspurger Feb 11, 2018
a727b21
Clean up tolist
TomAugspurger Feb 11, 2018
f368c29
Move test locations
TomAugspurger Feb 11, 2018
d74c5c9
Fixed test
TomAugspurger Feb 12, 2018
8104ee5
REF: Update per comments
TomAugspurger Feb 12, 2018
f8e29b9
lint
TomAugspurger Feb 12, 2018
0cd9faa
REF: Use _values for size and shape
TomAugspurger Feb 12, 2018
8fcdb70
PERF: Implement size, shape for IntervalIndex
TomAugspurger Feb 12, 2018
34a6a22
PERF: Avoid materializing values for PeriodIndex shape, size
TomAugspurger Feb 12, 2018
c233c28
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 13, 2018
d6e8051
Cleanup
TomAugspurger Feb 13, 2018
3af8a21
Override nbytes
TomAugspurger Feb 13, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions doc/source/internals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,25 @@ not check (or care) whether the levels themselves are sorted. Fortunately, the
constructors ``from_tuples`` and ``from_arrays`` ensure that this is true, but
if you compute the levels and labels yourself, please be careful.

Values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could add section tags

~~~~~~

Pandas extends NumPy's type system with custom types, like ``Categorical`` or
datetimes with a timezone, so we have multiple notions of "values". For 1-D
containers (``Index`` classes and ``Series``) we have the following convention:

* ``cls._ndarray_values`` is *always* a NumPy ``ndarray``. Ideally,
``_ndarray_values`` is cheap to compute. For example, for a ``Categorical``,
this returns the codes, not the array of objects.
* ``cls._values`` refers is the "best possible" array. This could be an
``ndarray``, ``ExtensionArray``, or in ``Index`` subclass (note: we're in the
process of removing the index subclasses here so that it's always an
``ndarray`` or ``ExtensionArray``).

So, for example, ``Series[category]._values`` is a ``Categorical``, while
``Series[category]._ndarray_values`` is the underlying codes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section does not belong in the same document as "how to subclass", as this part is really internals for contributors to pandas.
So I would maybe split this in two: external vs internal details, where the future documentation on how to use and define ExtensionArrays would also go into the external details part (maybe "Extending Pandas" would be a good title, with then information about the different strategies: accessor, extension array, subclassing, ..)

But that's for a separate PR (I can start with that), so for here the above is fine for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal-internals.rst and external-internals.rst ;)

I think that "How to subclass", and the eventual "extending pandas with custom array types" would be better in developer.rst, which is labeled as "This section will focus on downstream applications of pandas.".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I didn't see that the accessor documentation is actually already there (although, I personally don't find the parquet section that fitting in there, as it is not something typical you need to know when extending pandas. I can already start with moving it to the bottom of the file :-))



.. _ref-subclassing-pandas:

Subclassing pandas Data Structures
Expand Down
12 changes: 12 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,3 +266,15 @@ def _can_hold_na(self):
Setting this to false will optimize some operations like fillna.
"""
return True

@property
def _ndarray_values(self):
# type: () -> np.ndarray
"""Internal pandas method for lossy conversion to a NumPy ndarray.

This method is not part of the pandas interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit strange that we say this is not part of the interface, but still provide here a default implementation and say what it is.

If it is not part of the interface, we could also define it on our own subclasses without defining it here.
But on the other hand, that will raise errors when an external extension array does not have this method. Eg some attributes on Series will call into this, which makes it somehow part of the interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit strange that we say this is not part of the interface, but still provide here a default implementation and say what it is.

This simplifies the implementation since

a.) we don't have to have a PandasExtensionArray that subclasses ExtensionArray, just to define this method.
b.) we can safely call ._values anywhere in our code without checking whether it's an extension array or a pandas extension array.

My preference is to leave it out of the interface until someone sees an actual need for it. I suspect this could come up if / when we start allowing custom indexes with their own indexing engines, but that seems like a ways down the road...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand those reasons, but that means it is part of the interface (in the sense that if somebody would be stupid to implement a _ndarray_values property that returns something differently, it will break code?)
It's just a part of the interface that we ask not to implement?

To give an example, for GeoPandas, I was wondering if I would overwrite this property to return my self.data (the integer pointers) instead of a materialized array (which can be costly).
For things like Series.strides, Series.itemsize, .. this will be OK (and in principle no user should call this anyway .. so maybe I should not worry about it), but if it is used in other places as well that I am not really sure about, such an implementation might actually give a wrong result (but at the same time, if this is called in certain places for my column with an extension array, I want to know that, because materializing can be costly and I want to avoid this as much as possible).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Your GeoPandas concern is a good one. But I'm not sure how to proceeded :/ Do you have a preference for

a.) Adding it to the interface
b.) Leaving it out of the interface, with a default implementation on EA
c.) Making a pandas-specific EA that defines _ndarray_values, so uses of _ndarray_values will need to check for that attr.

._values / ._ndarray_values are meant to be internal, and I don't want to limit future refactorings with backwards compatibility concerns. I'm also not even sure how to document it beyond what's already there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the best option is to leave it for now? :)

Ideally, the Series machinery should not use _ndarray_values too often, as this would require knowledge of what those values mean (codes, ordinals, ...), which Series does not need to know (if there is knowledge required, it should somehow dispatch the operation to the underlying (extension) array itself).

And this is already the case I think. From a quick scan, currently in the PR, I see ndarray_values is mainly used for:

  • indexing (values used in the engine) -> for now we don't support extension arrays to be the data for Index (apart from our own), so this is a worry for later if we want to tackle that
  • Index object operations (eg the setops) -> same as above
  • in some specific cases where we know the parent object (eg we know it is a PeriodIndex, we just want to get the ordinals)
  • for algos (eg take, unique) -> either they dispatch to EA (eg take) or for now the fallback is object array anyhow (eg value_counts), and in general this is something we still need to discuss how to handle in general (factorize, unique, ..)
  • some Series attributes (shape, itemsize, strides, size) -> shape can also be derived from _values directly (like nbytes) as it is part of the EA interface. And from the others, I think size would be nice to not have to materialize the EA (itemsize and strides I don't care about). Can we let this return len(self._values), which is part of the interface? (I am not sure to what extent .size and __len__ on numpy arrays are different / have different performance)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your assessment looks correct. Nothing that's series specific currently uses _ndarray_values, so it's really just code that is Index-specific or index / series agnostic.

I can change .shape and .size to use ._values instead of instead of ._ndarary values. I'll just have to override it in DatetimeIndex to avoid a recursion error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with @jorisvandenbossche assessment here. maybe if we can ultimatley remove this would be good, and simply dispatch to the array object. If Index is a proper EA then this would be possible.


The expectation is that this is cheap to compute, and is primarily
used for interacting with our indexers.
"""
return np.array(self)
4 changes: 4 additions & 0 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -410,6 +410,10 @@ def dtype(self):
"""The :class:`~pandas.api.types.CategoricalDtype` for this instance"""
return self._dtype

@property
def _ndarray_values(self):
return self.codes

@property
def _constructor(self):
return Categorical
Expand Down
29 changes: 21 additions & 8 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@
is_list_like,
is_scalar,
is_datetimelike,
is_extension_type)
is_categorical_dtype,
is_extension_type,
is_extension_array_dtype)

from pandas.util._validators import validate_bool_kwarg

Expand Down Expand Up @@ -710,7 +712,7 @@ def transpose(self, *args, **kwargs):
@property
def shape(self):
""" return a tuple of the shape of the underlying data """
return self._values.shape
return self._ndarray_values.shape

@property
def ndim(self):
Expand Down Expand Up @@ -738,22 +740,22 @@ def data(self):
@property
def itemsize(self):
""" return the size of the dtype of the item of the underlying data """
return self._values.itemsize
return self._ndarray_values.itemsize

@property
def nbytes(self):
""" return the number of bytes in the underlying data """
return self._values.nbytes
return self._ndarray_values.nbytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should nbytes be called on self._values instead? (as we require a nbytes implementation on the Extension Array, and numpy arrays already have it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this caused issues for CI, but re-running tests now with this change.


@property
def strides(self):
""" return the strides of the underlying data """
return self._values.strides
return self._ndarray_values.strides

@property
def size(self):
""" return the number of elements in the underlying data """
return self._values.size
return self._ndarray_values.size

@property
def flags(self):
Expand All @@ -768,8 +770,17 @@ def base(self):
return self.values.base

@property
def _values(self):
""" the internal implementation """
def _ndarray_values(self):
"""The data as an ndarray, possibly losing information.

The expectation is that this is cheap to compute, and is primarily
used for interacting with our indexers.

- categorical -> codes
"""
# type: () -> np.ndarray
if is_extension_array_dtype(self):
return self.values._ndarray_values
return self.values

@property
Expand Down Expand Up @@ -978,7 +989,9 @@ def value_counts(self, normalize=False, sort=True, ascending=False,
def unique(self):
values = self._values

# TODO: Make unique part of the ExtensionArray interface.
if hasattr(values, 'unique'):

result = values.unique()
else:
from pandas.core.algorithms import unique1d
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -927,7 +927,7 @@ def try_timedelta(v):
# will try first with a string & object conversion
from pandas import to_timedelta
try:
return to_timedelta(v)._values.reshape(shape)
return to_timedelta(v)._ndarray_values.reshape(shape)
except Exception:
return v.reshape(shape)

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1709,7 +1709,7 @@ def is_extension_array_dtype(arr_or_dtype):
from pandas.core.arrays import ExtensionArray

# we want to unpack series, anything else?
if isinstance(arr_or_dtype, ABCSeries):
if isinstance(arr_or_dtype, (ABCIndexClass, ABCSeries)):
arr_or_dtype = arr_or_dtype._values
return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray))

Expand Down
8 changes: 5 additions & 3 deletions pandas/core/dtypes/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -480,20 +480,22 @@ def _concat_datetimetz(to_concat, name=None):

def _concat_index_same_dtype(indexes, klass=None):
klass = klass if klass is not None else indexes[0].__class__
return klass(np.concatenate([x._values for x in indexes]))
return klass(np.concatenate([x._ndarray_values for x in indexes]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is only used for numeric indices, so _values or _ndarray_values should not matter.
I was mainly thinking that for those cases where _ndarray_values actually differs from _values, like a categorical, the above code will not work anyway, as klass(codes) is not enough to reconstruct the categorical. So _values seems safer to me.



def _concat_index_asobject(to_concat, name=None):
"""
concat all inputs as object. DatetimeIndex, TimedeltaIndex and
PeriodIndex are converted to object dtype before concatenation
"""
from pandas import Index
from pandas.core.arrays import ExtensionArray

klasses = ABCDatetimeIndex, ABCTimedeltaIndex, ABCPeriodIndex
klasses = (ABCDatetimeIndex, ABCTimedeltaIndex, ABCPeriodIndex,
ExtensionArray)
to_concat = [x.astype(object) if isinstance(x, klasses) else x
for x in to_concat]

from pandas import Index
self = to_concat[0]
attribs = self._get_attributes_dict()
attribs['name'] = name
Expand Down
Loading