Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Internal / External values #19558

Merged
merged 42 commits into from
Feb 13, 2018
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
41f09d8
REF/Clean: Internal / External values
TomAugspurger Feb 3, 2018
29cfd7c
Move to index base
TomAugspurger Feb 6, 2018
3185f4e
Cleanup unique handling
TomAugspurger Feb 7, 2018
5a59591
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 7, 2018
476f75d
Simplify object concat
TomAugspurger Feb 7, 2018
b15ee5a
Use values for intersection
TomAugspurger Feb 7, 2018
659073f
hmm
TomAugspurger Feb 7, 2018
7accb67
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 8, 2018
9b8d2a5
Additional testing
TomAugspurger Feb 8, 2018
9fbac29
More tests
TomAugspurger Feb 8, 2018
55305dc
ndarray_values
TomAugspurger Feb 8, 2018
0e63708
API: Default ExtensionArray.astype
TomAugspurger Feb 8, 2018
fbbbc8a
Simplify concat_as_object
TomAugspurger Feb 8, 2018
46a0a49
Py2 compat
TomAugspurger Feb 8, 2018
2c4445a
Set-ops ugliness
TomAugspurger Feb 8, 2018
5612cda
better docstrings
TomAugspurger Feb 8, 2018
b012c19
tolist
TomAugspurger Feb 8, 2018
d49e6aa
linting
TomAugspurger Feb 8, 2018
d7d31ee
Moved dtypes
TomAugspurger Feb 9, 2018
7b89f1b
clean
TomAugspurger Feb 9, 2018
b0dbffd
cleanup
TomAugspurger Feb 9, 2018
66b936f
NumPy compat
TomAugspurger Feb 9, 2018
32ee0ef
Use base _values for CategoricalIndex
TomAugspurger Feb 9, 2018
a9882e2
Update dev docs
TomAugspurger Feb 9, 2018
f53652a
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
2425621
cleanup
TomAugspurger Feb 9, 2018
512fb89
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
170d0c7
Linting
TomAugspurger Feb 9, 2018
402620f
Precision in tests
TomAugspurger Feb 9, 2018
d9e8dd6
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
815d202
Push _ndarray_values to ExtensionArray
TomAugspurger Feb 11, 2018
a727b21
Clean up tolist
TomAugspurger Feb 11, 2018
f368c29
Move test locations
TomAugspurger Feb 11, 2018
d74c5c9
Fixed test
TomAugspurger Feb 12, 2018
8104ee5
REF: Update per comments
TomAugspurger Feb 12, 2018
f8e29b9
lint
TomAugspurger Feb 12, 2018
0cd9faa
REF: Use _values for size and shape
TomAugspurger Feb 12, 2018
8fcdb70
PERF: Implement size, shape for IntervalIndex
TomAugspurger Feb 12, 2018
34a6a22
PERF: Avoid materializing values for PeriodIndex shape, size
TomAugspurger Feb 12, 2018
c233c28
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 13, 2018
d6e8051
Cleanup
TomAugspurger Feb 13, 2018
3af8a21
Override nbytes
TomAugspurger Feb 13, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions doc/source/internals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,21 @@ not check (or care) whether the levels themselves are sorted. Fortunately, the
constructors ``from_tuples`` and ``from_arrays`` ensure that this is true, but
if you compute the levels and labels yourself, please be careful.

Values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could add section tags

~~~~~~

Pandas extends NumPy's type system in a few places, so we have multiple notions of "values" floating around.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence is totally not clear to a new reader

For 1-D containers (``Index`` classes and ``Series``) we have the following convention:

* ``cls._ndarray_values`` is *always* and ``ndarray``
* ``cls._values`` refers is the "best possible" array. This could be an ``ndarray``, ``ExtensionArray``, or
in ``Index`` subclass (note: we're in the process of removing the index subclasses here so that it's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are using internal (pun intended) jargon here

always an ``ndarray`` or ``ExtensionArray``).

So, for example, ``Series[category]._values`` is a ``Categorical``, while ``Series[category]._ndarray_values`` is
the underlying ndarray.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what Series[categorical]._ndarray_values should be, categories or codes? PeriodIndex._ndarray_values is the ordinals, so maybe codes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes



.. _ref-subclassing-pandas:

Subclassing pandas Data Structures
Expand Down
30 changes: 30 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
"""An interface for extending pandas with custom arrays."""
import numpy as np

from pandas.errors import AbstractMethodError

_not_implemented_message = "{} does not implement {}."
Expand Down Expand Up @@ -138,6 +140,34 @@ def nbytes(self):
# ------------------------------------------------------------------------
# Additional Methods
# ------------------------------------------------------------------------
def astype(self, dtype, copy=True):
"""Cast to a NumPy array with 'dtype'.

The default implementation only allows casting to 'object' dtype.

Parameters
----------
dtype : str or dtype
Typecode or data-type to which the array is cast.
copy : bool, default True
Whether to copy the data, even if not necessary. If False,
a copy is made only if the old dtype does not match the
new dtype.

Returns
-------
array : ndarray
NumPy ndarray with 'dtype' for its dtype.
"""
np_dtype = np.dtype(dtype)

if np_dtype != 'object':
msg = ("{} can only be coerced to 'object' dtype, "
"not '{}'.").format(type(self).__name__, dtype)
raise ValueError(msg)

return np.array(self, dtype=np_dtype, copy=copy)

def isna(self):
# type: () -> np.ndarray
"""Boolean NumPy array indicating if each value is missing.
Expand Down
37 changes: 28 additions & 9 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,14 @@
import numpy as np

from pandas.core.dtypes.missing import isna
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries, ABCIndexClass
from pandas.core.dtypes.generic import (
ABCDataFrame, ABCSeries, ABCIndexClass, ABCDatetimeIndex)
from pandas.core.dtypes.common import (
is_object_dtype,
is_list_like,
is_scalar,
is_datetimelike,
is_categorical_dtype,
is_extension_type)

from pandas.util._validators import validate_bool_kwarg
Expand Down Expand Up @@ -710,7 +712,7 @@ def transpose(self, *args, **kwargs):
@property
def shape(self):
""" return a tuple of the shape of the underlying data """
return self._values.shape
return self._ndarray_values.shape

@property
def ndim(self):
Expand Down Expand Up @@ -738,22 +740,22 @@ def data(self):
@property
def itemsize(self):
""" return the size of the dtype of the item of the underlying data """
return self._values.itemsize
return self._ndarray_values.itemsize

@property
def nbytes(self):
""" return the number of bytes in the underlying data """
return self._values.nbytes
return self._ndarray_values.nbytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should nbytes be called on self._values instead? (as we require a nbytes implementation on the Extension Array, and numpy arrays already have it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this caused issues for CI, but re-running tests now with this change.


@property
def strides(self):
""" return the strides of the underlying data """
return self._values.strides
return self._ndarray_values.strides

@property
def size(self):
""" return the number of elements in the underlying data """
return self._values.size
return self._ndarray_values.size

@property
def flags(self):
Expand All @@ -768,8 +770,21 @@ def base(self):
return self.values.base

@property
def _values(self):
""" the internal implementation """
def _ndarray_values(self):
"""The data as an ndarray, possibly losing information.

The expectation is that this is cheap to compute.

- categorical -> codes

See '_values' for more.
"""
# type: () -> np.ndarray
from pandas.core.dtypes.common import is_categorical_dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does NOT belong here. you already have a sub-clss EA for Categorical that can simply override this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is to return _ndarray_values for both Series and Index, and for Series there is in general no subclass that can override it.
Of course we could say to put this logic in the Blocks, but that is then also not for Index, so not sure it is better.

This raises the question for me, though, what this will return for external extension types. Since it is is .values, I suppose this is a materialized array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved it all the way to ExtensionArray, with the default being just returning an np.array(self), same as before. It's not part of the interface though. I think is what @jorisvandenbossche suggested here.


if is_categorical_dtype(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have this at all, e.g. np.array(ea) should just do the right thing no? why are we adding something else? note I actually don't mind that we have an additional property that we use consistently, more to the point of why __array__ does not just return _ndarray_values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because __array__ returns materialized array (so eg array of strings if you have string categories), not the codes.

But I agree it points to something we should think about how to organize this, as also eg for periods there will be a special case here in the future. So maybe we need a property on our own extension arrays that gives back this ndarray? (which is not necessarily part of the external interface for extension arrays)

return self._values.codes

return self.values

@property
Expand Down Expand Up @@ -819,8 +834,10 @@ def tolist(self):

if is_datetimelike(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be overriden in EA, rather than specific dispatching via if/else here, IOW it should be a part of the interface, or be defined as list(.values)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is needed to add a tolist to the interface, I think in general it can rely on __iter__/__getitem__ (so list(self._values)).
The only problem in that case is that we still need to distinguish between normal numeric types (where the above would return numpy scalars and not python scalars) and other types where this already gives the correct result.

return [com._maybe_box_datetimelike(x) for x in self._values]
elif is_categorical_dtype(self):
return self.values.tolist()
else:
return self._values.tolist()
return self._ndarray_values.tolist()

def __iter__(self):
"""
Expand Down Expand Up @@ -978,7 +995,9 @@ def value_counts(self, normalize=False, sort=True, ascending=False,
def unique(self):
values = self._values

# TODO: Make unique part of the ExtensionArray interface.
if hasattr(values, 'unique'):

result = values.unique()
else:
from pandas.core.algorithms import unique1d
Expand Down
8 changes: 5 additions & 3 deletions pandas/core/dtypes/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -480,20 +480,22 @@ def _concat_datetimetz(to_concat, name=None):

def _concat_index_same_dtype(indexes, klass=None):
klass = klass if klass is not None else indexes[0].__class__
return klass(np.concatenate([x._values for x in indexes]))
return klass(np.concatenate([x._ndarray_values for x in indexes]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is only used for numeric indices, so _values or _ndarray_values should not matter.
I was mainly thinking that for those cases where _ndarray_values actually differs from _values, like a categorical, the above code will not work anyway, as klass(codes) is not enough to reconstruct the categorical. So _values seems safer to me.



def _concat_index_asobject(to_concat, name=None):
"""
concat all inputs as object. DatetimeIndex, TimedeltaIndex and
PeriodIndex are converted to object dtype before concatenation
"""
from pandas import Index
from pandas.core.arrays import ExtensionArray

klasses = ABCDatetimeIndex, ABCTimedeltaIndex, ABCPeriodIndex
klasses = (ABCDatetimeIndex, ABCTimedeltaIndex, ABCPeriodIndex,
ExtensionArray)
to_concat = [x.astype(object) if isinstance(x, klasses) else x
for x in to_concat]

from pandas import Index
self = to_concat[0]
attribs = self._get_attributes_dict()
attribs['name'] = name
Expand Down
Loading