-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExtensionArray.take default implementation #20814
Changes from 27 commits
fb3c234
dacd98e
0be9ec6
08f2479
eba137f
67ba9dd
37915e9
c721915
125ca0b
b7ae0bc
338566f
31cd304
05d8844
69e7fe7
449983b
c449afd
82cad8b
d5470a0
bbcbf19
1a4d987
fc729d6
74b2c09
5db6624
741f284
fbc4425
f3b91ca
eecd632
9a6c7d4
7c4f625
eb43fa4
6858409
ec0cecd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1448,6 +1448,96 @@ def func(arr, indexer, out, fill_value=np.nan): | |
return func | ||
|
||
|
||
def take(arr, indices, allow_fill=False, fill_value=None): | ||
""" | ||
Take elements from an array. | ||
|
||
.. versionadded:: 0.23.0 | ||
|
||
Parameters | ||
---------- | ||
arr : sequence | ||
Non array-likes (sequences without a dtype) are coereced | ||
to an ndarray. | ||
indices : sequence of integers | ||
Indices to be taken. | ||
allow_fill : bool, default False | ||
How to handle negative values in `indices`. | ||
|
||
* False: negative values in `indices` indicate positional indicies | ||
from the right (the default). This is similar to :func:`numpy.take`. | ||
|
||
* True: negative values in `indices` indicate | ||
missing values. These values are set to `fill_value`. Any other | ||
other negative values raise a ``ValueError``. | ||
|
||
fill_value : any, optional | ||
Fill value to use for NA-indicies when `allow_fill` is True. | ||
This may be ``None``, in which case the default NA value for | ||
the type is used. For ndarrays, :attr:`numpy.nan` is used. For | ||
ExtensionArrays, a different value may be used. | ||
|
||
Returns | ||
------- | ||
ndarray or ExtensionArray | ||
Same type as the input. | ||
|
||
Raises | ||
------ | ||
IndexError | ||
When `indices` is out of bounds for the array. | ||
ValueError | ||
When the indexer contains negative values other than ``-1`` | ||
and `allow_fill` is True. | ||
|
||
Notes | ||
----- | ||
When `allow_fill` is False, `indices` may be whatever dimensionality | ||
is accepted by NumPy for `arr`. | ||
|
||
When `allow_fill` is True, `indices` should be 1-D. | ||
|
||
See Also | ||
-------- | ||
numpy.take | ||
|
||
Examples | ||
-------- | ||
>>> from pandas.api.extensions import take | ||
|
||
With the default ``allow_fill=False``, negative numbers indicate | ||
positional indicies from the right. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. indicies -> indices |
||
|
||
>>> take(np.array([10, 20, 30]), [0, 0, -1]) | ||
array([10, 10, 30]) | ||
|
||
Setting ``allow_fill=True`` will place `fill_value` in those positions. | ||
|
||
>>> take(np.array([10, 20, 30]), [0, 0, -1], allow_fill=True) | ||
array([10., 10., nan]) | ||
|
||
>>> take(np.array([10, 20, 30]), [0, 0, -1], allow_fill=True, | ||
... fill_value=-10) | ||
array([ 10, 10, -10]) | ||
""" | ||
from pandas.core.indexing import validate_indices | ||
|
||
if not is_array_like(arr): | ||
arr = np.asarray(arr) | ||
|
||
# Do we require int64 or intp here? | ||
indices = np.asarray(indices, dtype='int') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should be intp, which is what take accepts for indexers |
||
|
||
if allow_fill: | ||
# Pandas style, -1 means NA | ||
validate_indices(indices, len(arr)) | ||
result = take_1d(arr, indices, allow_fill=True, fill_value=fill_value) | ||
else: | ||
# NumPy style | ||
result = arr.take(indices) | ||
return result | ||
|
||
|
||
def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None, | ||
allow_fill=True): | ||
""" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -462,22 +462,35 @@ def factorize(self, na_sentinel=-1): | |
# ------------------------------------------------------------------------ | ||
# Indexing methods | ||
# ------------------------------------------------------------------------ | ||
def take(self, indexer, allow_fill=True, fill_value=None): | ||
|
||
def take(self, indices, allow_fill=False, fill_value=None): | ||
# type: (Sequence[int], bool, Optional[Any]) -> ExtensionArray | ||
"""Take elements from an array. | ||
|
||
Parameters | ||
---------- | ||
indexer : sequence of integers | ||
indices to be taken. -1 is used to indicate values | ||
that are missing. | ||
allow_fill : bool, default True | ||
If False, indexer is assumed to contain no -1 values so no filling | ||
will be done. This short-circuits computation of a mask. Result is | ||
undefined if allow_fill == False and -1 is present in indexer. | ||
fill_value : any, default None | ||
Fill value to replace -1 values with. If applicable, this should | ||
use the sentinel missing value for this type. | ||
indices : sequence of integers | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you share this doc? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we want people to look at the source code of this (as we explicitly say the code of this base class is kind of the documentation), so I would certainly try to keep this one here (not sure if it would be easy to reuse it for the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sharing docs between these two are a bit difficult because of circular import issues. The docstring would have to go in a third module, which isn't desirable. |
||
Indices to be taken. | ||
allow_fill : bool, default False | ||
How to handle negative values in `indices`. | ||
|
||
For False values (the default), negative values in `indices` | ||
indiciate positional indicies from the right. | ||
|
||
For True values, indicies where `indices` is ``-1`` indicate | ||
missing values. These values are set to `fill_value`. Any other | ||
other negative value should raise a ``ValueError``. | ||
|
||
fill_value : any, optional | ||
Fill value to use for NA-indicies when `allow_fill` is True. | ||
This may be ``None``, in which case the default NA value for | ||
the type, ``self.dtype.na_value``, is used. | ||
|
||
For many ExtensionArrays, there will be two representations of | ||
`fill_value`: a user-facing "boxed" scalar, and a low-level | ||
physical NA value. `fill_value` should be the user-facing version, | ||
and the implementation should handle translating that to the | ||
physical version for processing the take if nescessary. | ||
|
||
Returns | ||
------- | ||
|
@@ -486,43 +499,44 @@ def take(self, indexer, allow_fill=True, fill_value=None): | |
Raises | ||
------ | ||
IndexError | ||
When the indexer is out of bounds for the array. | ||
When the indices are out of bounds for the array. | ||
ValueError | ||
When `indices` contains negative values other than ``-1`` | ||
and `allow_fill` is True. | ||
|
||
Notes | ||
----- | ||
This should follow pandas' semantics where -1 indicates missing values. | ||
Positions where indexer is ``-1`` should be filled with the missing | ||
value for this type. | ||
This gives rise to the special case of a take on an empty | ||
ExtensionArray that does not raises an IndexError straight away | ||
when the `indexer` is all ``-1``. | ||
ExtensionArray.take is called by ``Series.__getitem__``, ``.loc``, | ||
``iloc``, when `indices` is a sequence of values. Additionally, | ||
it's called by :meth:`Series.reindex`, or any other method | ||
that causes realignemnt, with a `fill_value`. | ||
|
||
This is called by ``Series.__getitem__``, ``.loc``, ``iloc``, when the | ||
indexer is a sequence of values. | ||
See Also | ||
-------- | ||
numpy.take | ||
pandas.api.extensions.take | ||
|
||
Examples | ||
-------- | ||
Suppose the extension array is backed by a NumPy array stored as | ||
``self.data``. Then ``take`` may be written as | ||
Here's an example implementation, which relies on casting the | ||
extension array to object dtype. This uses the helper method | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I personally found the previous example that uses something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any thoughts on the new example? I added a comment about how to proceed if your array is backed by an ndarray. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe add a similar comment to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see you added a note about that in the normal comments below the docstring (but maybe still add a short note like "pass here the appropriate NA value for the physical storage, see below") |
||
:func:`pandas.api.extensions.take`. | ||
|
||
.. code-block:: python | ||
|
||
def take(self, indexer, allow_fill=True, fill_value=None): | ||
indexer = np.asarray(indexer) | ||
mask = indexer == -1 | ||
def take(self, indices, allow_fill=False, fill_value=None): | ||
from pandas.core.algorithms import take | ||
|
||
# take on empty array not handled as desired by numpy | ||
# in case of -1 (all missing take) | ||
if not len(self) and mask.all(): | ||
return type(self)([np.nan] * len(indexer)) | ||
# If the ExtensionArray is backed by an ndarray, then | ||
# just pass that here instead of coercing to object. | ||
data = self.astype(object) | ||
|
||
result = self.data.take(indexer) | ||
result[mask] = np.nan # NA for this type | ||
return type(self)(result) | ||
if allow_fill and fill_value is None: | ||
fill_value = self.dtype.na_value | ||
|
||
See Also | ||
-------- | ||
numpy.take | ||
result = take(data, indices, fill_value=fill_value, | ||
allow_fill=allow_fill) | ||
return self._from_sequence(result) | ||
""" | ||
raise AbstractMethodError(self) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,11 @@ class _DtypeOpsMixin(object): | |
# classes will inherit from this Mixin. Once everything is compatible, this | ||
# class's methods can be moved to ExtensionDtype and removed. | ||
|
||
# na_value is the default NA value to use for this type. This is used in | ||
# e.g. ExtensionArray.take. This should be the user-facing "boxed" version | ||
# of the NA value, not the physical NA vaalue for storage. | ||
na_value = np.nan | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe even show an example (from JSON?) |
||
|
||
def __eq__(self, other): | ||
"""Check whether 'other' is equal to self. | ||
|
||
|
@@ -92,6 +97,8 @@ def is_dtype(cls, dtype): | |
class ExtensionDtype(_DtypeOpsMixin): | ||
"""A custom data type, to be paired with an ExtensionArray. | ||
|
||
.. versionadded:: 0.23.0 | ||
|
||
Notes | ||
----- | ||
The interface includes the following abstract methods that must | ||
|
@@ -101,6 +108,9 @@ class ExtensionDtype(_DtypeOpsMixin): | |
* name | ||
* construct_from_string | ||
|
||
The `na_value` class attribute can be used to set the default NA value | ||
for this type. :attr:`numpy.nan` is used by default. | ||
|
||
This class does not inherit from 'abc.ABCMeta' for performance reasons. | ||
Methods and properties required by the interface raise | ||
``pandas.errors.AbstractMethodError`` and no ``register`` method is | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2417,12 +2417,53 @@ def maybe_convert_indices(indices, n): | |
mask = indices < 0 | ||
if mask.any(): | ||
indices[mask] += n | ||
|
||
mask = (indices >= n) | (indices < 0) | ||
if mask.any(): | ||
raise IndexError("indices are out-of-bounds") | ||
return indices | ||
|
||
|
||
def validate_indices(indices, n): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is similar functionality in this file ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this needed, as opposed to maybe_convert_indices? (maybe refactor that a bit?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. then fix maybe_convert_indixes which duplicates this logic There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I said, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, I initially started by changing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so now we have 2 different code paths which do very similar type of validation, I would try to simplify There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They have different behaviors.
You'd need two keywords to control those behaviors, and the only thing that could be shared is a |
||
"""Perform bounds-checking for an indexer. | ||
|
||
-1 is allowed for indicating missing values. | ||
|
||
Parameters | ||
---------- | ||
indices : ndarray | ||
n : int | ||
length of the array being indexed | ||
|
||
Raises | ||
------ | ||
ValueError | ||
|
||
Examples | ||
-------- | ||
>>> validate_indices([1, 2], 3) | ||
# OK | ||
>>> validate_indices([1, -2], 3) | ||
ValueError | ||
>>> validate_indices([1, 2, 3], 3) | ||
IndexError | ||
>>> validate_indices([-1, -1], 0) | ||
# OK | ||
>>> validate_indices([0, 1], 0) | ||
IndexError | ||
""" | ||
if len(indices): | ||
min_idx = indices.min() | ||
if min_idx < -1: | ||
msg = ("'indices' contains values less than allowed ({} < {})" | ||
.format(min_idx, -1)) | ||
raise ValueError(msg) | ||
|
||
max_idx = indices.max() | ||
if max_idx >= n: | ||
raise IndexError("indices are out-of-bounds") | ||
|
||
|
||
def maybe_convert_ix(*args): | ||
""" | ||
We likely want to take the cross-product | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This may be None" -> "By default" ? (or at least mention "default" somewhere)