Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Internal / External values #19558

Merged
merged 42 commits into from
Feb 13, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
41f09d8
REF/Clean: Internal / External values
TomAugspurger Feb 3, 2018
29cfd7c
Move to index base
TomAugspurger Feb 6, 2018
3185f4e
Cleanup unique handling
TomAugspurger Feb 7, 2018
5a59591
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 7, 2018
476f75d
Simplify object concat
TomAugspurger Feb 7, 2018
b15ee5a
Use values for intersection
TomAugspurger Feb 7, 2018
659073f
hmm
TomAugspurger Feb 7, 2018
7accb67
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 8, 2018
9b8d2a5
Additional testing
TomAugspurger Feb 8, 2018
9fbac29
More tests
TomAugspurger Feb 8, 2018
55305dc
ndarray_values
TomAugspurger Feb 8, 2018
0e63708
API: Default ExtensionArray.astype
TomAugspurger Feb 8, 2018
fbbbc8a
Simplify concat_as_object
TomAugspurger Feb 8, 2018
46a0a49
Py2 compat
TomAugspurger Feb 8, 2018
2c4445a
Set-ops ugliness
TomAugspurger Feb 8, 2018
5612cda
better docstrings
TomAugspurger Feb 8, 2018
b012c19
tolist
TomAugspurger Feb 8, 2018
d49e6aa
linting
TomAugspurger Feb 8, 2018
d7d31ee
Moved dtypes
TomAugspurger Feb 9, 2018
7b89f1b
clean
TomAugspurger Feb 9, 2018
b0dbffd
cleanup
TomAugspurger Feb 9, 2018
66b936f
NumPy compat
TomAugspurger Feb 9, 2018
32ee0ef
Use base _values for CategoricalIndex
TomAugspurger Feb 9, 2018
a9882e2
Update dev docs
TomAugspurger Feb 9, 2018
f53652a
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
2425621
cleanup
TomAugspurger Feb 9, 2018
512fb89
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
170d0c7
Linting
TomAugspurger Feb 9, 2018
402620f
Precision in tests
TomAugspurger Feb 9, 2018
d9e8dd6
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
815d202
Push _ndarray_values to ExtensionArray
TomAugspurger Feb 11, 2018
a727b21
Clean up tolist
TomAugspurger Feb 11, 2018
f368c29
Move test locations
TomAugspurger Feb 11, 2018
d74c5c9
Fixed test
TomAugspurger Feb 12, 2018
8104ee5
REF: Update per comments
TomAugspurger Feb 12, 2018
f8e29b9
lint
TomAugspurger Feb 12, 2018
0cd9faa
REF: Use _values for size and shape
TomAugspurger Feb 12, 2018
8fcdb70
PERF: Implement size, shape for IntervalIndex
TomAugspurger Feb 12, 2018
34a6a22
PERF: Avoid materializing values for PeriodIndex shape, size
TomAugspurger Feb 12, 2018
c233c28
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 13, 2018
d6e8051
Cleanup
TomAugspurger Feb 13, 2018
3af8a21
Override nbytes
TomAugspurger Feb 13, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions doc/source/internals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,21 @@ not check (or care) whether the levels themselves are sorted. Fortunately, the
constructors ``from_tuples`` and ``from_arrays`` ensure that this is true, but
if you compute the levels and labels yourself, please be careful.

Values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could add section tags

~~~~~~

Pandas extends NumPy's type system in a few places, so we have multiple notions of "values" floating around.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence is totally not clear to a new reader

For 1-D containers (``Index`` classes and ``Series``) we have the following convention:

* ``cls._ndarray_values`` is *always* and ``ndarray``
* ``cls._values`` refers is the "best possible" array. This could be an ``ndarray``, ``ExtensionArray``, or
in ``Index`` subclass (note: we're in the process of removing the index subclasses here so that it's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are using internal (pun intended) jargon here

always an ``ndarray`` or ``ExtensionArray``).

So, for example, ``Series[category]._values`` is a ``Categorical``, while ``Series[category]._ndarray_values`` is
the underlying ndarray.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what Series[categorical]._ndarray_values should be, categories or codes? PeriodIndex._ndarray_values is the ordinals, so maybe codes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes



.. _ref-subclassing-pandas:

Subclassing pandas Data Structures
Expand Down
48 changes: 39 additions & 9 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@
import numpy as np

from pandas.core.dtypes.missing import isna
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries, ABCIndexClass
from pandas.core.dtypes.generic import (
ABCDataFrame, ABCSeries, ABCIndexClass, ABCDatetimeIndex)
from pandas.core.dtypes.common import (
is_object_dtype,
is_list_like,
Expand Down Expand Up @@ -706,7 +707,7 @@ def transpose(self, *args, **kwargs):
@property
def shape(self):
""" return a tuple of the shape of the underlying data """
return self._values.shape
return self._ndarray_values.shape

@property
def ndim(self):
Expand Down Expand Up @@ -734,22 +735,22 @@ def data(self):
@property
def itemsize(self):
""" return the size of the dtype of the item of the underlying data """
return self._values.itemsize
return self._ndarray_values.itemsize

@property
def nbytes(self):
""" return the number of bytes in the underlying data """
return self._values.nbytes
return self._ndarray_values.nbytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should nbytes be called on self._values instead? (as we require a nbytes implementation on the Extension Array, and numpy arrays already have it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this caused issues for CI, but re-running tests now with this change.


@property
def strides(self):
""" return the strides of the underlying data """
return self._values.strides
return self._ndarray_values.strides

@property
def size(self):
""" return the number of elements in the underlying data """
return self._values.size
return self._ndarray_values.size

@property
def flags(self):
Expand All @@ -763,9 +764,34 @@ def base(self):
"""
return self.values.base

@property
def _ndarray_values(self):
"""The data as an ndarray. See '_values' for more."""
# type: () -> np.ndarray
return self.values

@property
def _values(self):
""" the internal implementation """
# type: () -> Union[ExtensionArray, Index]
# TODO: remove index types as they become is extension arrays
""" The best array representation.

This is an ndarray, ExtensionArray, or Index subclass. This differs
from '._ndarray_values', which always returns an ndarray. It may differ
from the public '.values'

index | values | _values
----------------- | -------------- -| ----------
CategoricalIndex | Categorical | Categorical
DatetimeIndex[tz] | ndarray[M8ns] | DTI[tz]
PeriodIndex | ndarray[Period] | ndarray[Pd] (soon PeriodArray)
IntervalIndex | ndarray[IV] | ndarray[IV] (soon IntervalArray)

See Also
--------
values
_ndarray_values
"""
return self.values

@property
Expand Down Expand Up @@ -816,7 +842,7 @@ def tolist(self):
if is_datetimelike(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be overriden in EA, rather than specific dispatching via if/else here, IOW it should be a part of the interface, or be defined as list(.values)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is needed to add a tolist to the interface, I think in general it can rely on __iter__/__getitem__ (so list(self._values)).
The only problem in that case is that we still need to distinguish between normal numeric types (where the above would return numpy scalars and not python scalars) and other types where this already gives the correct result.

return [com._maybe_box_datetimelike(x) for x in self._values]
else:
return self._values.tolist()
return self._ndarray_values.tolist()

def __iter__(self):
"""
Expand Down Expand Up @@ -973,8 +999,12 @@ def value_counts(self, normalize=False, sort=True, ascending=False,
@Appender(_shared_docs['unique'] % _indexops_doc_kwargs)
def unique(self):
values = self._values

if isinstance(values, ABCDatetimeIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you special casing?

values = values._ndarray_values
# TODO: Make unique part of the ExtensionArray interface.
# else, this could be surprising.
if hasattr(values, 'unique'):

result = values.unique()
else:
from pandas.core.algorithms import unique1d
Expand Down
15 changes: 11 additions & 4 deletions pandas/core/dtypes/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -480,7 +480,7 @@ def _concat_datetimetz(to_concat, name=None):

def _concat_index_same_dtype(indexes, klass=None):
klass = klass if klass is not None else indexes[0].__class__
return klass(np.concatenate([x._values for x in indexes]))
return klass(np.concatenate([x._ndarray_values for x in indexes]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is only used for numeric indices, so _values or _ndarray_values should not matter.
I was mainly thinking that for those cases where _ndarray_values actually differs from _values, like a categorical, the above code will not work anyway, as klass(codes) is not enough to reconstruct the categorical. So _values seems safer to me.



def _concat_index_asobject(to_concat, name=None):
Expand All @@ -498,9 +498,16 @@ def _concat_index_asobject(to_concat, name=None):
attribs = self._get_attributes_dict()
attribs['name'] = name

to_concat = [x._values if isinstance(x, Index) else x
for x in to_concat]
return self._shallow_copy_with_infer(np.concatenate(to_concat), **attribs)
arrays = []
for x in to_concat:
if is_categorical_dtype(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a special case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger don't we want np.asarray(x, dtype=object) for all extension dtypes in this case?

So for Period and Interval _values (used some lines below for Index types) currently still returns object array, but I think the purpose is that those will also return their own Array class, so using np.asarray(x, dtype=object) will do the same now and is more future proof?

arrays.append(np.asarray(x, dtype=object))
elif isinstance(x, Index):
arrays.append(x._values)
else:
arrays.append(x)

return self._shallow_copy_with_infer(np.concatenate(arrays), **attribs)


def _concat_sparse(to_concat, axis=0, typs=None):
Expand Down
65 changes: 40 additions & 25 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -392,7 +392,7 @@ def _simple_new(cls, values, name=None, dtype=None, **kwargs):
values = np.array(values, copy=False)
if is_object_dtype(values):
values = cls(values, name=name, dtype=dtype,
**kwargs)._values
**kwargs)._ndarray_values

result = object.__new__(cls)
result._data = values
Expand Down Expand Up @@ -644,7 +644,7 @@ def ravel(self, order='C'):
--------
numpy.ndarray.ravel
"""
return self._values.ravel(order=order)
return self._ndarray_values.ravel(order=order)

# construction helpers
@classmethod
Expand Down Expand Up @@ -1577,7 +1577,7 @@ def _constructor(self):
@cache_readonly
def _engine(self):
# property, for now, slow to look up
return self._engine_type(lambda: self._values, len(self))
return self._engine_type(lambda: self._ndarray_values, len(self))

def _validate_index_level(self, level):
"""
Expand Down Expand Up @@ -2208,27 +2208,37 @@ def union(self, other):
other = other.astype('O')
return this.union(other)

if is_categorical_dtype(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these special caes?

lvals = self.values
else:
lvals = self._ndarray_values

if is_categorical_dtype(other):
rvals = other.values
else:
rvals = other._ndarray_values

if self.is_monotonic and other.is_monotonic:
try:
result = self._outer_indexer(self._values, other._values)[0]
result = self._outer_indexer(lvals, rvals)[0]
except TypeError:
# incomparable objects
result = list(self._values)
result = list(lvals)

# worth making this faster? a very unusual case
value_set = set(self._values)
result.extend([x for x in other._values if x not in value_set])
value_set = set(lvals)
result.extend([x for x in rvals if x not in value_set])
else:
indexer = self.get_indexer(other)
indexer, = (indexer == -1).nonzero()

if len(indexer) > 0:
other_diff = algos.take_nd(other._values, indexer,
other_diff = algos.take_nd(rvals, indexer,
allow_fill=False)
result = _concat._concat_compat((self._values, other_diff))
result = _concat._concat_compat((lvals, other_diff))

try:
self._values[0] < other_diff[0]
lvals[0] < other_diff[0]
except TypeError as e:
warnings.warn("%s, sort order is undefined for "
"incomparable objects" % e, RuntimeWarning,
Expand All @@ -2240,7 +2250,7 @@ def union(self, other):
result.sort()

else:
result = self._values
result = lvals

try:
result = np.sort(result)
Expand Down Expand Up @@ -2293,18 +2303,21 @@ def intersection(self, other):

if self.is_monotonic and other.is_monotonic:
try:
result = self._inner_indexer(self._values, other._values)[0]
result = self._inner_indexer(self._ndarray_values,
other._ndarray_values)[0]
return self._wrap_union_result(other, result)
except TypeError:
pass

try:
indexer = Index(other._values).get_indexer(self._values)
indexer = Index(other._ndarray_values).get_indexer(
self._ndarray_values)
indexer = indexer.take((indexer != -1).nonzero()[0])
except Exception:
# duplicates
indexer = algos.unique1d(
Index(other._values).get_indexer_non_unique(self._values)[0])
Index(other._ndarray_values).get_indexer_non_unique(
self._ndarray_values)[0])
indexer = indexer[indexer != -1]

taken = other.take(indexer)
Expand Down Expand Up @@ -2680,7 +2693,7 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None):
raise ValueError('limit argument only valid if doing pad, '
'backfill or nearest reindexing')

indexer = self._engine.get_indexer(target._values)
indexer = self._engine.get_indexer(target._ndarray_values)

return _ensure_platform_int(indexer)

Expand All @@ -2696,12 +2709,13 @@ def _get_fill_indexer(self, target, method, limit=None, tolerance=None):
if self.is_monotonic_increasing and target.is_monotonic_increasing:
method = (self._engine.get_pad_indexer if method == 'pad' else
self._engine.get_backfill_indexer)
indexer = method(target._values, limit)
indexer = method(target._ndarray_values, limit)
else:
indexer = self._get_fill_indexer_searchsorted(target, method,
limit)
if tolerance is not None:
indexer = self._filter_indexer_tolerance(target._values, indexer,
indexer = self._filter_indexer_tolerance(target._ndarray_values,
indexer,
tolerance)
return indexer

Expand Down Expand Up @@ -2792,7 +2806,7 @@ def get_indexer_non_unique(self, target):
self = Index(self.asi8)
tgt_values = target.asi8
else:
tgt_values = target._values
tgt_values = target._ndarray_values

indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
return _ensure_platform_int(indexer), missing
Expand Down Expand Up @@ -3227,16 +3241,17 @@ def _join_multi(self, other, how, return_indexers=True):
def _join_non_unique(self, other, how='left', return_indexers=False):
from pandas.core.reshape.merge import _get_join_indexers

left_idx, right_idx = _get_join_indexers([self._values],
[other._values], how=how,
left_idx, right_idx = _get_join_indexers([self._ndarray_values],
[other._ndarray_values],
how=how,
sort=True)

left_idx = _ensure_platform_int(left_idx)
right_idx = _ensure_platform_int(right_idx)

join_index = np.asarray(self._values.take(left_idx))
join_index = np.asarray(self._ndarray_values.take(left_idx))
mask = left_idx == -1
np.putmask(join_index, mask, other._values.take(right_idx))
np.putmask(join_index, mask, other._ndarray_values.take(right_idx))

join_index = self._wrap_joined_index(join_index, other)

Expand Down Expand Up @@ -3383,8 +3398,8 @@ def _join_monotonic(self, other, how='left', return_indexers=False):
else:
return ret_index

sv = self._values
ov = other._values
sv = self._ndarray_values
ov = other._ndarray_values

if self.is_unique and other.is_unique:
# We can perform much better than the general case
Expand Down Expand Up @@ -3736,7 +3751,7 @@ def insert(self, loc, item):
item = self._na_value

_self = np.asarray(self)
item = self._coerce_scalar_to_index(item)._values
item = self._coerce_scalar_to_index(item)._ndarray_values
idx = np.concatenate((_self[:loc], item, _self[loc:]))
return self._shallow_copy_with_infer(idx)

Expand Down
25 changes: 21 additions & 4 deletions pandas/core/indexes/category.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,7 +227,7 @@ def _is_dtype_compat(self, other):
"""
if is_categorical_dtype(other):
if isinstance(other, CategoricalIndex):
other = other._values
other = other.values
if not other.is_dtype_equal(self):
raise TypeError("categories must match existing categories "
"when appending")
Expand Down Expand Up @@ -293,6 +293,23 @@ def values(self):
""" return the underlying data, which is a Categorical """
return self._data

@property
def _values(self):
return self._data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between ._data and .values ? As below you use .values for itemsize and nbytes, but both seems to point to the underlying Categorical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just me being inconsistent.


@property
def _ndarray_values(self):
return self._data.codes

@property
def itemsize(self):
return self.values.itemsize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a comment why you are overwriting this here (default is to use _ndarray_values, which are codes, but Categorical defines itemsize as the itemsize of the categories).
Also, would it be more consistent to do self._values.itemsize? (not that it would differ in this case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More consistent with the parent's implementation, yes, but I'd prefer to use the public API where possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, but I'd prefer to use the public API where possible.

But the public api is not meant to be consistent, but to maintain backwards compatibility :-) (at least for now)


@property
def nbytes(self):
""" return the number of bytes in the underlying data """
return self.values.nbytes

def get_values(self):
""" return the underlying data as an ndarray """
return self._data.get_values()
Expand Down Expand Up @@ -386,8 +403,8 @@ def is_monotonic_decreasing(self):
def unique(self, level=None):
if level is not None:
self._validate_index_level(level)
result = base.IndexOpsMixin.unique(self)
# CategoricalIndex._shallow_copy uses keeps original categories
result = self.values.unique()
# CategoricalIndex._shallow_copy keeps original categories
# and ordered if not otherwise specified
return self._shallow_copy(result, categories=result.categories,
ordered=result.ordered)
Expand Down Expand Up @@ -762,7 +779,7 @@ def _evaluate_compare(self, other):

def _delegate_method(self, name, *args, **kwargs):
""" method delegation to the ._values """
method = getattr(self._values, name)
method = getattr(self.values, name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be unnecessary. Reverting.

if 'inplace' in kwargs:
raise ValueError("cannot use inplace with CategoricalIndex")
res = method(*args, **kwargs)
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/indexes/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -389,7 +389,7 @@ def sort_values(self, return_indexer=False, ascending=True):
sorted_index = self.take(_as)
return sorted_index, _as
else:
sorted_values = np.sort(self._values)
sorted_values = np.sort(self._ndarray_values)
attribs = self._get_attributes_dict()
freq = attribs['freq']

Expand Down
Loading