-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExtensionArray.take default implementation #20814
Changes from 24 commits
fb3c234
dacd98e
0be9ec6
08f2479
eba137f
67ba9dd
37915e9
c721915
125ca0b
b7ae0bc
338566f
31cd304
05d8844
69e7fe7
449983b
c449afd
82cad8b
d5470a0
bbcbf19
1a4d987
fc729d6
74b2c09
5db6624
741f284
fbc4425
f3b91ca
eecd632
9a6c7d4
7c4f625
eb43fa4
6858409
ec0cecd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1448,6 +1448,94 @@ def func(arr, indexer, out, fill_value=np.nan): | |
return func | ||
|
||
|
||
def take(arr, indexer, allow_fill=False, fill_value=None): | ||
"""Take elements from an array. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick, but in the docstring guidelines we now say to start the text on the next line |
||
|
||
.. versionadded:: 0.23.0 | ||
|
||
Parameters | ||
---------- | ||
arr : sequence | ||
Non array-likes (sequences without a dtype) are coereced | ||
to an ndarray. | ||
indexer : sequence of integers | ||
Indices to be taken. | ||
allow_fill : bool, default False | ||
How to handle negative values in `indexer`. | ||
|
||
* False: negative values in `indexer` indicate | ||
slices from the right (the default) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe "slices" -> "indexed from the right" ("slice" has also a different meaning in indexing). I would maybe also mention this is the similar as |
||
|
||
* True: negative values in `indexer` indicate | ||
missing values. These values are set to `fill_value`. Any other | ||
other negative values raise a ``ValueError``. | ||
|
||
fill_value : any, optional | ||
Fill value to use for NA-indicies when `allow_fill` is True. | ||
This may be ``None``, in which case the default NA value for | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "This may be None" -> "By default" ? (or at least mention "default" somewhere) |
||
the type, ``self.dtype.na_value``, is used. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this case, referring to |
||
|
||
Returns | ||
------- | ||
ndarray or ExtensionArray | ||
Same type as the input. | ||
|
||
Raises | ||
------ | ||
IndexError | ||
When the indexer is out of bounds for the array. | ||
ValueError | ||
When the indexer contains negative values other than ``-1`` | ||
and `allow_fill` is True. | ||
|
||
Notes | ||
----- | ||
When `allow_fill` is False, `indexer` may be whatever dimensionality | ||
is accepted by NumPy for `arr`. | ||
|
||
When `allow_fill` is True, `indexer` should be 1-D. | ||
|
||
See Also | ||
-------- | ||
numpy.take | ||
|
||
Examples | ||
-------- | ||
>>> from pandas.api.extensions import take | ||
|
||
With the default ``allow_fill=False``, negative numbers indicate | ||
slices from the right. | ||
|
||
>>> take(np.array([10, 20, 30]), [0, 0, -1]) | ||
array([10, 10, 30]) | ||
|
||
Setting ``allow_fill=True`` will place `fill_value` in those positions. | ||
|
||
>>> take(np.array([10, 20, 30]), [0, 0, -1], allow_fill=True) | ||
array([10., 10., nan]) | ||
|
||
>>> take(np.array([10, 20, 30]), [0, 0, -1], allow_fill=True, | ||
... fill_value=-10) | ||
array([ 10, 10, -10]) | ||
""" | ||
from pandas.core.indexing import validate_indices | ||
|
||
if not is_array_like(arr): | ||
arr = np.asarray(arr) | ||
|
||
# Do we require int64 or intp here? | ||
indexer = np.asarray(indexer, dtype='int') | ||
|
||
if allow_fill: | ||
# Pandas style, -1 means NA | ||
validate_indices(indexer, len(arr)) | ||
result = take_1d(arr, indexer, allow_fill=True, fill_value=fill_value) | ||
else: | ||
# NumPy style | ||
result = arr.take(indexer) | ||
return result | ||
|
||
|
||
def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None, | ||
allow_fill=True): | ||
""" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -462,22 +462,29 @@ def factorize(self, na_sentinel=-1): | |
# ------------------------------------------------------------------------ | ||
# Indexing methods | ||
# ------------------------------------------------------------------------ | ||
def take(self, indexer, allow_fill=True, fill_value=None): | ||
|
||
def take(self, indexer, allow_fill=False, fill_value=None): | ||
# type: (Sequence[int], bool, Optional[Any]) -> ExtensionArray | ||
"""Take elements from an array. | ||
|
||
Parameters | ||
---------- | ||
indexer : sequence of integers | ||
indices to be taken. -1 is used to indicate values | ||
that are missing. | ||
allow_fill : bool, default True | ||
If False, indexer is assumed to contain no -1 values so no filling | ||
will be done. This short-circuits computation of a mask. Result is | ||
undefined if allow_fill == False and -1 is present in indexer. | ||
fill_value : any, default None | ||
Fill value to replace -1 values with. If applicable, this should | ||
use the sentinel missing value for this type. | ||
Indices to be taken. See Notes for how negative indicies | ||
are handled. | ||
allow_fill : bool, default False | ||
How to handle negative values in `indexer`. | ||
|
||
For False values (the default), negative values in `indexer` | ||
indiciate slices from the right. | ||
|
||
For True values, indicies where `indexer` is ``-1`` indicate | ||
missing values. These values are set to `fill_value`. Any other | ||
other negative value should raise a ``ValueError``. | ||
fill_value : any, optional | ||
Fill value to use for NA-indicies when `allow_fill` is True. | ||
This may be ``None``, in which case the default NA value for | ||
the type, ``self.dtype.na_value``, is used. | ||
|
||
Returns | ||
------- | ||
|
@@ -487,42 +494,40 @@ def take(self, indexer, allow_fill=True, fill_value=None): | |
------ | ||
IndexError | ||
When the indexer is out of bounds for the array. | ||
ValueError | ||
When the indexer contains negative values other than ``-1`` | ||
and `allow_fill` is True. | ||
|
||
Notes | ||
----- | ||
This should follow pandas' semantics where -1 indicates missing values. | ||
Positions where indexer is ``-1`` should be filled with the missing | ||
value for this type. | ||
This gives rise to the special case of a take on an empty | ||
ExtensionArray that does not raises an IndexError straight away | ||
when the `indexer` is all ``-1``. | ||
ExtensionArray.take is called by ``Series.__getitem__``, ``.loc``, | ||
``iloc``, when the indexer is a sequence of values. Additionally, | ||
it's called by :meth:`Series.reindex`, or any other method | ||
that causes realignemnt, with a `fill_value`. | ||
|
||
This is called by ``Series.__getitem__``, ``.loc``, ``iloc``, when the | ||
indexer is a sequence of values. | ||
See Also | ||
-------- | ||
numpy.take | ||
pandas.api.extensions.take | ||
|
||
Examples | ||
-------- | ||
Suppose the extension array is backed by a NumPy array stored as | ||
``self.data``. Then ``take`` may be written as | ||
Here's an example implementation, which relies on casting the | ||
extension array to object dtype. This uses the helper method | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I personally found the previous example that uses something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any thoughts on the new example? I added a comment about how to proceed if your array is backed by an ndarray. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe add a similar comment to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see you added a note about that in the normal comments below the docstring (but maybe still add a short note like "pass here the appropriate NA value for the physical storage, see below") |
||
:func:`pandas.api.extensions.take`. | ||
|
||
.. code-block:: python | ||
|
||
def take(self, indexer, allow_fill=True, fill_value=None): | ||
indexer = np.asarray(indexer) | ||
mask = indexer == -1 | ||
def take(self, indexer, allow_fill=False, fill_value=None): | ||
from pandas.core.algorithms import take | ||
|
||
# take on empty array not handled as desired by numpy | ||
# in case of -1 (all missing take) | ||
if not len(self) and mask.all(): | ||
return type(self)([np.nan] * len(indexer)) | ||
data = self.astype(object) | ||
if allow_fill and fill_value is None: | ||
fill_value = self.dtype.na_value | ||
|
||
result = self.data.take(indexer) | ||
result[mask] = np.nan # NA for this type | ||
return type(self)(result) | ||
|
||
See Also | ||
-------- | ||
numpy.take | ||
result = take(data, indexer, fill_value=fill_value, | ||
allow_fill=allow_fill) | ||
return self._from_sequence(result) | ||
""" | ||
raise AbstractMethodError(self) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,10 @@ class _DtypeOpsMixin(object): | |
# classes will inherit from this Mixin. Once everything is compatible, this | ||
# class's methods can be moved to ExtensionDtype and removed. | ||
|
||
# na_value is the default NA value to use for this type. This is used in | ||
# e.g. ExtensionArray.take. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the "used in EA.take" is misleading. In many cases, it will not be used in take, as this |
||
na_value = np.nan | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe even show an example (from JSON?) |
||
|
||
def __eq__(self, other): | ||
"""Check whether 'other' is equal to self. | ||
|
||
|
@@ -92,6 +96,8 @@ def is_dtype(cls, dtype): | |
class ExtensionDtype(_DtypeOpsMixin): | ||
"""A custom data type, to be paired with an ExtensionArray. | ||
|
||
.. versionadded:: 0.23.0 | ||
|
||
Notes | ||
----- | ||
The interface includes the following abstract methods that must | ||
|
@@ -101,6 +107,9 @@ class ExtensionDtype(_DtypeOpsMixin): | |
* name | ||
* construct_from_string | ||
|
||
The `na_value` class attribute can be used to set the default NA value | ||
for this type. :attr:`numpy.nan` is used by default. | ||
|
||
This class does not inherit from 'abc.ABCMeta' for performance reasons. | ||
Methods and properties required by the interface raise | ||
``pandas.errors.AbstractMethodError`` and no ``register`` method is | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2417,12 +2417,53 @@ def maybe_convert_indices(indices, n): | |
mask = indices < 0 | ||
if mask.any(): | ||
indices[mask] += n | ||
|
||
mask = (indices >= n) | (indices < 0) | ||
if mask.any(): | ||
raise IndexError("indices are out-of-bounds") | ||
return indices | ||
|
||
|
||
def validate_indices(indices, n): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is similar functionality in this file ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this needed, as opposed to maybe_convert_indices? (maybe refactor that a bit?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. then fix maybe_convert_indixes which duplicates this logic There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I said, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, I initially started by changing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so now we have 2 different code paths which do very similar type of validation, I would try to simplify There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They have different behaviors.
You'd need two keywords to control those behaviors, and the only thing that could be shared is a |
||
"""Perform bounds-checking for an indexer. | ||
|
||
-1 is allowed for indicating missing values. | ||
|
||
Parameters | ||
---------- | ||
indices : ndarray | ||
n : int | ||
length of the array being indexed | ||
|
||
Raises | ||
------ | ||
ValueError | ||
|
||
Examples | ||
-------- | ||
>>> validate_indices([1, 2], 3) | ||
# OK | ||
>>> validate_indices([1, -2], 3) | ||
ValueError | ||
>>> validate_indices([1, 2, 3], 3) | ||
IndexError | ||
>>> validate_indices([-1, -1], 0) | ||
# OK | ||
>>> validate_indices([0, 1], 0) | ||
IndexError | ||
""" | ||
if len(indices): | ||
min_idx = indices.min() | ||
if min_idx < -1: | ||
msg = ("'indices' contains values less than allowed ({} < {})" | ||
.format(min_idx, -1)) | ||
raise ValueError(msg) | ||
|
||
max_idx = indices.max() | ||
if max_idx >= n: | ||
raise IndexError("indices are out-of-bounds") | ||
|
||
|
||
def maybe_convert_ix(*args): | ||
""" | ||
We likely want to take the cross-product | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding keyword names,
np.take
andpd.Series.take
useindices
instead ofindexer
, so that might be a reason to use that as well (although in pandas we often speak about an indexer, although that is also a more general term than just integer indices)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change to indices.