-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExtensionArray.take default implementation #20814
Changes from all commits
fb3c234
dacd98e
0be9ec6
08f2479
eba137f
67ba9dd
37915e9
c721915
125ca0b
b7ae0bc
338566f
31cd304
05d8844
69e7fe7
449983b
c449afd
82cad8b
d5470a0
bbcbf19
1a4d987
fc729d6
74b2c09
5db6624
741f284
fbc4425
f3b91ca
eecd632
9a6c7d4
7c4f625
eb43fa4
6858409
ec0cecd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -462,22 +462,36 @@ def factorize(self, na_sentinel=-1): | |
# ------------------------------------------------------------------------ | ||
# Indexing methods | ||
# ------------------------------------------------------------------------ | ||
def take(self, indexer, allow_fill=True, fill_value=None): | ||
|
||
def take(self, indices, allow_fill=False, fill_value=None): | ||
# type: (Sequence[int], bool, Optional[Any]) -> ExtensionArray | ||
"""Take elements from an array. | ||
|
||
Parameters | ||
---------- | ||
indexer : sequence of integers | ||
indices to be taken. -1 is used to indicate values | ||
that are missing. | ||
allow_fill : bool, default True | ||
If False, indexer is assumed to contain no -1 values so no filling | ||
will be done. This short-circuits computation of a mask. Result is | ||
undefined if allow_fill == False and -1 is present in indexer. | ||
fill_value : any, default None | ||
Fill value to replace -1 values with. If applicable, this should | ||
use the sentinel missing value for this type. | ||
indices : sequence of integers | ||
Indices to be taken. | ||
allow_fill : bool, default False | ||
How to handle negative values in `indices`. | ||
|
||
* False: negative values in `indices` indicate positional indices | ||
from the right (the default). This is similar to | ||
:func:`numpy.take`. | ||
|
||
* True: negative values in `indices` indicate | ||
missing values. These values are set to `fill_value`. Any other | ||
other negative values raise a ``ValueError``. | ||
|
||
fill_value : any, optional | ||
Fill value to use for NA-indices when `allow_fill` is True. | ||
This may be ``None``, in which case the default NA value for | ||
the type, ``self.dtype.na_value``, is used. | ||
|
||
For many ExtensionArrays, there will be two representations of | ||
`fill_value`: a user-facing "boxed" scalar, and a low-level | ||
physical NA value. `fill_value` should be the user-facing version, | ||
and the implementation should handle translating that to the | ||
physical version for processing the take if nescessary. | ||
|
||
Returns | ||
------- | ||
|
@@ -486,44 +500,56 @@ def take(self, indexer, allow_fill=True, fill_value=None): | |
Raises | ||
------ | ||
IndexError | ||
When the indexer is out of bounds for the array. | ||
When the indices are out of bounds for the array. | ||
ValueError | ||
When `indices` contains negative values other than ``-1`` | ||
and `allow_fill` is True. | ||
|
||
Notes | ||
----- | ||
This should follow pandas' semantics where -1 indicates missing values. | ||
Positions where indexer is ``-1`` should be filled with the missing | ||
value for this type. | ||
This gives rise to the special case of a take on an empty | ||
ExtensionArray that does not raises an IndexError straight away | ||
when the `indexer` is all ``-1``. | ||
ExtensionArray.take is called by ``Series.__getitem__``, ``.loc``, | ||
``iloc``, when `indices` is a sequence of values. Additionally, | ||
it's called by :meth:`Series.reindex`, or any other method | ||
that causes realignemnt, with a `fill_value`. | ||
|
||
This is called by ``Series.__getitem__``, ``.loc``, ``iloc``, when the | ||
indexer is a sequence of values. | ||
See Also | ||
-------- | ||
numpy.take | ||
pandas.api.extensions.take | ||
|
||
Examples | ||
-------- | ||
Suppose the extension array is backed by a NumPy array stored as | ||
``self.data``. Then ``take`` may be written as | ||
Here's an example implementation, which relies on casting the | ||
extension array to object dtype. This uses the helper method | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I personally found the previous example that uses something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any thoughts on the new example? I added a comment about how to proceed if your array is backed by an ndarray. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe add a similar comment to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see you added a note about that in the normal comments below the docstring (but maybe still add a short note like "pass here the appropriate NA value for the physical storage, see below") |
||
:func:`pandas.api.extensions.take`. | ||
|
||
.. code-block:: python | ||
|
||
def take(self, indexer, allow_fill=True, fill_value=None): | ||
indexer = np.asarray(indexer) | ||
mask = indexer == -1 | ||
def take(self, indices, allow_fill=False, fill_value=None): | ||
from pandas.core.algorithms import take | ||
|
||
# take on empty array not handled as desired by numpy | ||
# in case of -1 (all missing take) | ||
if not len(self) and mask.all(): | ||
return type(self)([np.nan] * len(indexer)) | ||
# If the ExtensionArray is backed by an ndarray, then | ||
# just pass that here instead of coercing to object. | ||
data = self.astype(object) | ||
|
||
result = self.data.take(indexer) | ||
result[mask] = np.nan # NA for this type | ||
return type(self)(result) | ||
if allow_fill and fill_value is None: | ||
fill_value = self.dtype.na_value | ||
|
||
See Also | ||
-------- | ||
numpy.take | ||
# fill value should always be translated from the scalar | ||
# type for the array, to the physical storage type for | ||
# the data, before passing to take. | ||
|
||
result = take(data, indices, fill_value=fill_value, | ||
allow_fill=allow_fill) | ||
return self._from_sequence(result) | ||
""" | ||
# Implementer note: The `fill_value` parameter should be a user-facing | ||
# value, an instance of self.dtype.type. When passed `fill_value=None`, | ||
# the default of `self.dtype.na_value` should be used. | ||
# This may differ from the physical storage type your ExtensionArray | ||
# uses. In this case, your implementation is responsible for casting | ||
# the user-facing type to the storage type, before using | ||
# pandas.api.extensions.take | ||
raise AbstractMethodError(self) | ||
|
||
def copy(self, deep=False): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2417,12 +2417,53 @@ def maybe_convert_indices(indices, n): | |
mask = indices < 0 | ||
if mask.any(): | ||
indices[mask] += n | ||
|
||
mask = (indices >= n) | (indices < 0) | ||
if mask.any(): | ||
raise IndexError("indices are out-of-bounds") | ||
return indices | ||
|
||
|
||
def validate_indices(indices, n): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is similar functionality in this file ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this needed, as opposed to maybe_convert_indices? (maybe refactor that a bit?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. then fix maybe_convert_indixes which duplicates this logic There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I said, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW, I initially started by changing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so now we have 2 different code paths which do very similar type of validation, I would try to simplify There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They have different behaviors.
You'd need two keywords to control those behaviors, and the only thing that could be shared is a |
||
"""Perform bounds-checking for an indexer. | ||
|
||
-1 is allowed for indicating missing values. | ||
|
||
Parameters | ||
---------- | ||
indices : ndarray | ||
n : int | ||
length of the array being indexed | ||
|
||
Raises | ||
------ | ||
ValueError | ||
|
||
Examples | ||
-------- | ||
>>> validate_indices([1, 2], 3) | ||
# OK | ||
>>> validate_indices([1, -2], 3) | ||
ValueError | ||
>>> validate_indices([1, 2, 3], 3) | ||
IndexError | ||
>>> validate_indices([-1, -1], 0) | ||
# OK | ||
>>> validate_indices([0, 1], 0) | ||
IndexError | ||
""" | ||
if len(indices): | ||
min_idx = indices.min() | ||
if min_idx < -1: | ||
msg = ("'indices' contains values less than allowed ({} < {})" | ||
.format(min_idx, -1)) | ||
raise ValueError(msg) | ||
|
||
max_idx = indices.max() | ||
if max_idx >= n: | ||
raise IndexError("indices are out-of-bounds") | ||
|
||
|
||
def maybe_convert_ix(*args): | ||
""" | ||
We likely want to take the cross-product | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you share this doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want people to look at the source code of this (as we explicitly say the code of this base class is kind of the documentation), so I would certainly try to keep this one here (not sure if it would be easy to reuse it for the
take
function)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sharing docs between these two are a bit difficult because of circular import issues. The docstring would have to go in a third module, which isn't desirable.