Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtensionArray.take default implementation #20814

Merged
merged 32 commits into from
Apr 27, 2018
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
fb3c234
[WIP]: ExtensionArray.take default implementation
TomAugspurger Apr 24, 2018
dacd98e
Moves
TomAugspurger Apr 24, 2018
0be9ec6
non-internals changes
TomAugspurger Apr 24, 2018
08f2479
Fixup JSON take
TomAugspurger Apr 25, 2018
eba137f
More internals hacking
TomAugspurger Apr 25, 2018
67ba9dd
more with default fill value
TomAugspurger Apr 25, 2018
37915e9
Merge remote-tracking branch 'upstream/master' into ea-take
TomAugspurger Apr 25, 2018
c721915
Removed default_fill_value
TomAugspurger Apr 25, 2018
125ca0b
Simplify
TomAugspurger Apr 25, 2018
b7ae0bc
revert combine change
TomAugspurger Apr 25, 2018
338566f
Upcasting
TomAugspurger Apr 25, 2018
31cd304
pep8
TomAugspurger Apr 25, 2018
05d8844
Updated docs
TomAugspurger Apr 25, 2018
69e7fe7
Updates
TomAugspurger Apr 25, 2018
449983b
Linting
TomAugspurger Apr 25, 2018
c449afd
Bounds checking
TomAugspurger Apr 25, 2018
82cad8b
Stale comment
TomAugspurger Apr 25, 2018
d5470a0
Fixed reorder
TomAugspurger Apr 25, 2018
bbcbf19
Cleanup
TomAugspurger Apr 25, 2018
1a4d987
Pass an array
TomAugspurger Apr 25, 2018
fc729d6
Fixed editor
TomAugspurger Apr 25, 2018
74b2c09
Added verisonadded
TomAugspurger Apr 26, 2018
5db6624
Doc and move tests
TomAugspurger Apr 26, 2018
741f284
Merge remote-tracking branch 'upstream/master' into ea-take
TomAugspurger Apr 26, 2018
fbc4425
Updates
TomAugspurger Apr 26, 2018
f3b91ca
doc fixup
TomAugspurger Apr 26, 2018
eecd632
Skip categorical take tests
TomAugspurger Apr 26, 2018
9a6c7d4
Doc updates
TomAugspurger Apr 27, 2018
7c4f625
Merge remote-tracking branch 'upstream/master' into ea-take
TomAugspurger Apr 27, 2018
eb43fa4
Really truly fix it hopefully.
TomAugspurger Apr 27, 2018
6858409
Added note
TomAugspurger Apr 27, 2018
ec0cecd
Updates
TomAugspurger Apr 27, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pandas/api/extensions/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@
from pandas.core.accessor import (register_dataframe_accessor, # noqa
register_index_accessor,
register_series_accessor)
from pandas.core.algorithms import take # noqa
from pandas.core.arrays.base import ExtensionArray # noqa
from pandas.core.dtypes.dtypes import ExtensionDtype # noqa
88 changes: 88 additions & 0 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -1448,6 +1448,94 @@ def func(arr, indexer, out, fill_value=np.nan):
return func


def take(arr, indexer, allow_fill=False, fill_value=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding keyword names, np.take and pd.Series.take use indices instead of indexer, so that might be a reason to use that as well (although in pandas we often speak about an indexer, although that is also a more general term than just integer indices)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change to indices.

"""Take elements from an array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, but in the docstring guidelines we now say to start the text on the next line


.. versionadded:: 0.23.0

Parameters
----------
arr : sequence
Non array-likes (sequences without a dtype) are coereced
to an ndarray.
indexer : sequence of integers
Indices to be taken.
allow_fill : bool, default False
How to handle negative values in `indexer`.

* False: negative values in `indexer` indicate
slices from the right (the default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "slices" -> "indexed from the right" ("slice" has also a different meaning in indexing).

I would maybe also mention this is the similar as numpy.take.


* True: negative values in `indexer` indicate
missing values. These values are set to `fill_value`. Any other
other negative values raise a ``ValueError``.

fill_value : any, optional
Fill value to use for NA-indicies when `allow_fill` is True.
This may be ``None``, in which case the default NA value for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This may be None" -> "By default" ? (or at least mention "default" somewhere)

the type, ``self.dtype.na_value``, is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, referring to self.dtype.na_value is not really correct, as that is EA specific. I think for this function, the default is NaN ? (since it relies on take_1d)


Returns
-------
ndarray or ExtensionArray
Same type as the input.

Raises
------
IndexError
When the indexer is out of bounds for the array.
ValueError
When the indexer contains negative values other than ``-1``
and `allow_fill` is True.

Notes
-----
When `allow_fill` is False, `indexer` may be whatever dimensionality
is accepted by NumPy for `arr`.

When `allow_fill` is True, `indexer` should be 1-D.

See Also
--------
numpy.take

Examples
--------
>>> from pandas.api.extensions import take

With the default ``allow_fill=False``, negative numbers indicate
slices from the right.

>>> take(np.array([10, 20, 30]), [0, 0, -1])
array([10, 10, 30])

Setting ``allow_fill=True`` will place `fill_value` in those positions.

>>> take(np.array([10, 20, 30]), [0, 0, -1], allow_fill=True)
array([10., 10., nan])

>>> take(np.array([10, 20, 30]), [0, 0, -1], allow_fill=True,
... fill_value=-10)
array([ 10, 10, -10])
"""
from pandas.core.indexing import validate_indices

if not is_array_like(arr):
arr = np.asarray(arr)

# Do we require int64 or intp here?
indexer = np.asarray(indexer, dtype='int')

if allow_fill:
# Pandas style, -1 means NA
validate_indices(indexer, len(arr))
result = take_1d(arr, indexer, allow_fill=True, fill_value=fill_value)
else:
# NumPy style
result = arr.take(indexer)
return result


def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None,
allow_fill=True):
"""
Expand Down
73 changes: 39 additions & 34 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -462,22 +462,29 @@ def factorize(self, na_sentinel=-1):
# ------------------------------------------------------------------------
# Indexing methods
# ------------------------------------------------------------------------
def take(self, indexer, allow_fill=True, fill_value=None):

def take(self, indexer, allow_fill=False, fill_value=None):
# type: (Sequence[int], bool, Optional[Any]) -> ExtensionArray
"""Take elements from an array.

Parameters
----------
indexer : sequence of integers
indices to be taken. -1 is used to indicate values
that are missing.
allow_fill : bool, default True
If False, indexer is assumed to contain no -1 values so no filling
will be done. This short-circuits computation of a mask. Result is
undefined if allow_fill == False and -1 is present in indexer.
fill_value : any, default None
Fill value to replace -1 values with. If applicable, this should
use the sentinel missing value for this type.
Indices to be taken. See Notes for how negative indicies
are handled.
allow_fill : bool, default False
How to handle negative values in `indexer`.

For False values (the default), negative values in `indexer`
indiciate slices from the right.

For True values, indicies where `indexer` is ``-1`` indicate
missing values. These values are set to `fill_value`. Any other
other negative value should raise a ``ValueError``.
fill_value : any, optional
Fill value to use for NA-indicies when `allow_fill` is True.
This may be ``None``, in which case the default NA value for
the type, ``self.dtype.na_value``, is used.

Returns
-------
Expand All @@ -487,42 +494,40 @@ def take(self, indexer, allow_fill=True, fill_value=None):
------
IndexError
When the indexer is out of bounds for the array.
ValueError
When the indexer contains negative values other than ``-1``
and `allow_fill` is True.

Notes
-----
This should follow pandas' semantics where -1 indicates missing values.
Positions where indexer is ``-1`` should be filled with the missing
value for this type.
This gives rise to the special case of a take on an empty
ExtensionArray that does not raises an IndexError straight away
when the `indexer` is all ``-1``.
ExtensionArray.take is called by ``Series.__getitem__``, ``.loc``,
``iloc``, when the indexer is a sequence of values. Additionally,
it's called by :meth:`Series.reindex`, or any other method
that causes realignemnt, with a `fill_value`.

This is called by ``Series.__getitem__``, ``.loc``, ``iloc``, when the
indexer is a sequence of values.
See Also
--------
numpy.take
pandas.api.extensions.take

Examples
--------
Suppose the extension array is backed by a NumPy array stored as
``self.data``. Then ``take`` may be written as
Here's an example implementation, which relies on casting the
extension array to object dtype. This uses the helper method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally found the previous example that uses something like self.data more illustrating how it could look like (except the part on checking for empty, and that the masking would not be done manually anymore). But of course, the example that you did now has the advantage of being an actual working implementation (only one that you won't use in many cases).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on the new example? I added a comment about how to proceed if your array is backed by an ndarray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a similar comment to the fill_value = self.dtype.na_value line ? (that there you can pass the low-level value that is used in the backing non-object ndarray instead self.dtype.na_value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see you added a note about that in the normal comments below the docstring (but maybe still add a short note like "pass here the appropriate NA value for the physical storage, see below")

:func:`pandas.api.extensions.take`.

.. code-block:: python

def take(self, indexer, allow_fill=True, fill_value=None):
indexer = np.asarray(indexer)
mask = indexer == -1
def take(self, indexer, allow_fill=False, fill_value=None):
from pandas.core.algorithms import take

# take on empty array not handled as desired by numpy
# in case of -1 (all missing take)
if not len(self) and mask.all():
return type(self)([np.nan] * len(indexer))
data = self.astype(object)
if allow_fill and fill_value is None:
fill_value = self.dtype.na_value

result = self.data.take(indexer)
result[mask] = np.nan # NA for this type
return type(self)(result)

See Also
--------
numpy.take
result = take(data, indexer, fill_value=fill_value,
allow_fill=allow_fill)
return self._from_sequence(result)
"""
raise AbstractMethodError(self)

Expand Down
9 changes: 9 additions & 0 deletions pandas/core/dtypes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ class _DtypeOpsMixin(object):
# classes will inherit from this Mixin. Once everything is compatible, this
# class's methods can be moved to ExtensionDtype and removed.

# na_value is the default NA value to use for this type. This is used in
# e.g. ExtensionArray.take.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "used in EA.take" is misleading. In many cases, it will not be used in take, as this na_value is your "boxed" scalar NA type, not the underlying physical NA value.

na_value = np.nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe even show an example (from JSON?)


def __eq__(self, other):
"""Check whether 'other' is equal to self.

Expand Down Expand Up @@ -92,6 +96,8 @@ def is_dtype(cls, dtype):
class ExtensionDtype(_DtypeOpsMixin):
"""A custom data type, to be paired with an ExtensionArray.

.. versionadded:: 0.23.0

Notes
-----
The interface includes the following abstract methods that must
Expand All @@ -101,6 +107,9 @@ class ExtensionDtype(_DtypeOpsMixin):
* name
* construct_from_string

The `na_value` class attribute can be used to set the default NA value
for this type. :attr:`numpy.nan` is used by default.

This class does not inherit from 'abc.ABCMeta' for performance reasons.
Methods and properties required by the interface raise
``pandas.errors.AbstractMethodError`` and no ``register`` method is
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,7 +255,6 @@ def changeit():


def maybe_promote(dtype, fill_value=np.nan):

# if we passed an array here, determine the fill value by dtype
if isinstance(fill_value, np.ndarray):
if issubclass(fill_value.dtype.type, (np.datetime64, np.timedelta64)):
Expand Down Expand Up @@ -294,6 +293,8 @@ def maybe_promote(dtype, fill_value=np.nan):
elif is_datetimetz(dtype):
if isna(fill_value):
fill_value = iNaT
elif is_extension_array_dtype(dtype) and isna(fill_value):
fill_value = dtype.na_value
elif is_float(fill_value):
if issubclass(dtype.type, np.bool_):
dtype = np.object_
Expand Down
2 changes: 2 additions & 0 deletions pandas/core/dtypes/missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -502,6 +502,8 @@ def na_value_for_dtype(dtype, compat=True):
"""
dtype = pandas_dtype(dtype)

if is_extension_array_dtype(dtype):
return dtype.na_value
if (is_datetime64_dtype(dtype) or is_datetime64tz_dtype(dtype) or
is_timedelta64_dtype(dtype) or is_period_dtype(dtype)):
return NaT
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3476,7 +3476,7 @@ def _reindex_index(self, new_index, method, copy, level, fill_value=np.nan,
allow_dups=False)

def _reindex_columns(self, new_columns, method, copy, level,
fill_value=np.nan, limit=None, tolerance=None):
fill_value=None, limit=None, tolerance=None):
new_columns, indexer = self.columns.reindex(new_columns, method=method,
level=level, limit=limit,
tolerance=tolerance)
Expand Down
8 changes: 4 additions & 4 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -3660,7 +3660,7 @@ def reindex(self, *args, **kwargs):
copy = kwargs.pop('copy', True)
limit = kwargs.pop('limit', None)
tolerance = kwargs.pop('tolerance', None)
fill_value = kwargs.pop('fill_value', np.nan)
fill_value = kwargs.pop('fill_value', None)

# Series.reindex doesn't use / need the axis kwarg
# We pop and ignore it here, to make writing Series/Frame generic code
Expand Down Expand Up @@ -3776,7 +3776,7 @@ def _reindex_multi(self, axes, copy, fill_value):

@Appender(_shared_docs['reindex_axis'] % _shared_doc_kwargs)
def reindex_axis(self, labels, axis=0, method=None, level=None, copy=True,
limit=None, fill_value=np.nan):
limit=None, fill_value=None):
msg = ("'.reindex_axis' is deprecated and will be removed in a future "
"version. Use '.reindex' instead.")
self._consolidate_inplace()
Expand All @@ -3790,7 +3790,7 @@ def reindex_axis(self, labels, axis=0, method=None, level=None, copy=True,
return self._reindex_with_indexers({axis: [new_index, indexer]},
fill_value=fill_value, copy=copy)

def _reindex_with_indexers(self, reindexers, fill_value=np.nan, copy=False,
def _reindex_with_indexers(self, reindexers, fill_value=None, copy=False,
allow_dups=False):
"""allow_dups indicates an internal call here """

Expand Down Expand Up @@ -7252,7 +7252,7 @@ def align(self, other, join='outer', axis=None, level=None, copy=True,
raise TypeError('unsupported type: %s' % type(other))

def _align_frame(self, other, join='outer', axis=None, level=None,
copy=True, fill_value=np.nan, method=None, limit=None,
copy=True, fill_value=None, method=None, limit=None,
fill_axis=0):
# defaults
join_index, join_columns = None, None
Expand Down
41 changes: 41 additions & 0 deletions pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -2417,12 +2417,53 @@ def maybe_convert_indices(indices, n):
mask = indices < 0
if mask.any():
indices[mask] += n

mask = (indices >= n) | (indices < 0)
if mask.any():
raise IndexError("indices are out-of-bounds")
return indices


def validate_indices(indices, n):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is similar functionality in this file (_iLocIndexer._is_valid_integer/_is_valid_list_like), but since that is for checking without the -1 specific behaviour, I don't think it is possible / easy to share the implementation (it would be two separate cases inside the function anyhow)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed, as opposed to maybe_convert_indices? (maybe refactor that a bit?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe_convert_indices converts indices, this one only checks bounds. For this use-case, it is not needed to convert negative into positive indices, as this is handles well by numpy (it's only when you want to use our own take_1d that you need to do such conversion)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then fix maybe_convert_indixes which duplicates this logic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, maybe_convert_indices converts the indices first and then checks bounds 0-1, this function only checks bounds but different bounds (-1 to n), so there is no real duplicate code. Combining both will only make the single function more complex to understand. Both serve a different purpose, and are used in different places in the code base. There is IMO really no reason to combine them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I initially started by changing maybe_convert_indices to handle both. But it required two new keywords to toggle checking for invalid negative indexers and whether to allow an all -1 take from an empty index. The only thing it ended up sharing was the integer too large check, which wasn't worthwhile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so now we have 2 different code paths which do very similar type of validation, I would try to simplify Series._take and NDFrame._take to use these new helper functions, removing the need for maybe_convert_indices entirely.

Copy link
Contributor Author

@TomAugspurger TomAugspurger Apr 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have different behaviors.

  1. maybe_convert_indices allows for any negative numbers, while validate_indices allows for just -1
  2. validate_indices allows for an array of -1 taken from an empty array, while maybe_convert_indices doesn't.

You'd need two keywords to control those behaviors, and the only thing that could be shared is a max_idx > n. Better to have two simpler functions here.

"""Perform bounds-checking for an indexer.

-1 is allowed for indicating missing values.

Parameters
----------
indices : ndarray
n : int
length of the array being indexed

Raises
------
ValueError

Examples
--------
>>> validate_indices([1, 2], 3)
# OK
>>> validate_indices([1, -2], 3)
ValueError
>>> validate_indices([1, 2, 3], 3)
IndexError
>>> validate_indices([-1, -1], 0)
# OK
>>> validate_indices([0, 1], 0)
IndexError
"""
if len(indices):
min_idx = indices.min()
if min_idx < -1:
msg = ("'indices' contains values less than allowed ({} < {})"
.format(min_idx, -1))
raise ValueError(msg)

max_idx = indices.max()
if max_idx >= n:
raise IndexError("indices are out-of-bounds")


def maybe_convert_ix(*args):
"""
We likely want to take the cross-product
Expand Down
Loading