Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtensionArray.take default implementation #20814

Merged
merged 32 commits into from
Apr 27, 2018
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
fb3c234
[WIP]: ExtensionArray.take default implementation
TomAugspurger Apr 24, 2018
dacd98e
Moves
TomAugspurger Apr 24, 2018
0be9ec6
non-internals changes
TomAugspurger Apr 24, 2018
08f2479
Fixup JSON take
TomAugspurger Apr 25, 2018
eba137f
More internals hacking
TomAugspurger Apr 25, 2018
67ba9dd
more with default fill value
TomAugspurger Apr 25, 2018
37915e9
Merge remote-tracking branch 'upstream/master' into ea-take
TomAugspurger Apr 25, 2018
c721915
Removed default_fill_value
TomAugspurger Apr 25, 2018
125ca0b
Simplify
TomAugspurger Apr 25, 2018
b7ae0bc
revert combine change
TomAugspurger Apr 25, 2018
338566f
Upcasting
TomAugspurger Apr 25, 2018
31cd304
pep8
TomAugspurger Apr 25, 2018
05d8844
Updated docs
TomAugspurger Apr 25, 2018
69e7fe7
Updates
TomAugspurger Apr 25, 2018
449983b
Linting
TomAugspurger Apr 25, 2018
c449afd
Bounds checking
TomAugspurger Apr 25, 2018
82cad8b
Stale comment
TomAugspurger Apr 25, 2018
d5470a0
Fixed reorder
TomAugspurger Apr 25, 2018
bbcbf19
Cleanup
TomAugspurger Apr 25, 2018
1a4d987
Pass an array
TomAugspurger Apr 25, 2018
fc729d6
Fixed editor
TomAugspurger Apr 25, 2018
74b2c09
Added verisonadded
TomAugspurger Apr 26, 2018
5db6624
Doc and move tests
TomAugspurger Apr 26, 2018
741f284
Merge remote-tracking branch 'upstream/master' into ea-take
TomAugspurger Apr 26, 2018
fbc4425
Updates
TomAugspurger Apr 26, 2018
f3b91ca
doc fixup
TomAugspurger Apr 26, 2018
eecd632
Skip categorical take tests
TomAugspurger Apr 26, 2018
9a6c7d4
Doc updates
TomAugspurger Apr 27, 2018
7c4f625
Merge remote-tracking branch 'upstream/master' into ea-take
TomAugspurger Apr 27, 2018
eb43fa4
Really truly fix it hopefully.
TomAugspurger Apr 27, 2018
6858409
Added note
TomAugspurger Apr 27, 2018
ec0cecd
Updates
TomAugspurger Apr 27, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 18 additions & 16 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -1448,8 +1448,9 @@ def func(arr, indexer, out, fill_value=np.nan):
return func


def take(arr, indexer, allow_fill=False, fill_value=None):
"""Take elements from an array.
def take(arr, indices, allow_fill=False, fill_value=None):
"""
Take elements from an array.

.. versionadded:: 0.23.0

Expand All @@ -1458,22 +1459,23 @@ def take(arr, indexer, allow_fill=False, fill_value=None):
arr : sequence
Non array-likes (sequences without a dtype) are coereced
to an ndarray.
indexer : sequence of integers
indices : sequence of integers
Indices to be taken.
allow_fill : bool, default False
How to handle negative values in `indexer`.
How to handle negative values in `indices`.

* False: negative values in `indexer` indicate
slices from the right (the default)
* False: negative values in `indices` indicate positional indicies
from the right (the default). This is similar to :func:`numpy.take`.

* True: negative values in `indexer` indicate
* True: negative values in `indices` indicate
missing values. These values are set to `fill_value`. Any other
other negative values raise a ``ValueError``.

fill_value : any, optional
Fill value to use for NA-indicies when `allow_fill` is True.
This may be ``None``, in which case the default NA value for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This may be None" -> "By default" ? (or at least mention "default" somewhere)

the type, ``self.dtype.na_value``, is used.
the type is used. For ndarrays, :attr:`numpy.nan` is used. For
ExtensionArrays, a different value may be used.

Returns
-------
Expand All @@ -1483,17 +1485,17 @@ def take(arr, indexer, allow_fill=False, fill_value=None):
Raises
------
IndexError
When the indexer is out of bounds for the array.
When `indices` is out of bounds for the array.
ValueError
When the indexer contains negative values other than ``-1``
and `allow_fill` is True.

Notes
-----
When `allow_fill` is False, `indexer` may be whatever dimensionality
When `allow_fill` is False, `indices` may be whatever dimensionality
is accepted by NumPy for `arr`.

When `allow_fill` is True, `indexer` should be 1-D.
When `allow_fill` is True, `indices` should be 1-D.

See Also
--------
Expand All @@ -1504,7 +1506,7 @@ def take(arr, indexer, allow_fill=False, fill_value=None):
>>> from pandas.api.extensions import take

With the default ``allow_fill=False``, negative numbers indicate
slices from the right.
positional indicies from the right.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indicies -> indices


>>> take(np.array([10, 20, 30]), [0, 0, -1])
array([10, 10, 30])
Expand All @@ -1524,15 +1526,15 @@ def take(arr, indexer, allow_fill=False, fill_value=None):
arr = np.asarray(arr)

# Do we require int64 or intp here?
indexer = np.asarray(indexer, dtype='int')
indices = np.asarray(indices, dtype='int')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be intp, which is what take accepts for indexers


if allow_fill:
# Pandas style, -1 means NA
validate_indices(indexer, len(arr))
result = take_1d(arr, indexer, allow_fill=True, fill_value=fill_value)
validate_indices(indices, len(arr))
result = take_1d(arr, indices, allow_fill=True, fill_value=fill_value)
else:
# NumPy style
result = arr.take(indexer)
result = arr.take(indices)
return result


Expand Down
35 changes: 22 additions & 13 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -463,45 +463,51 @@ def factorize(self, na_sentinel=-1):
# Indexing methods
# ------------------------------------------------------------------------

def take(self, indexer, allow_fill=False, fill_value=None):
def take(self, indices, allow_fill=False, fill_value=None):
# type: (Sequence[int], bool, Optional[Any]) -> ExtensionArray
"""Take elements from an array.

Parameters
----------
indexer : sequence of integers
Indices to be taken. See Notes for how negative indicies
are handled.
indices : sequence of integers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share this doc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want people to look at the source code of this (as we explicitly say the code of this base class is kind of the documentation), so I would certainly try to keep this one here (not sure if it would be easy to reuse it for the take function)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharing docs between these two are a bit difficult because of circular import issues. The docstring would have to go in a third module, which isn't desirable.

Indices to be taken.
allow_fill : bool, default False
How to handle negative values in `indexer`.
How to handle negative values in `indices`.

For False values (the default), negative values in `indexer`
indiciate slices from the right.
For False values (the default), negative values in `indices`
indiciate positional indicies from the right.

For True values, indicies where `indexer` is ``-1`` indicate
For True values, indicies where `indices` is ``-1`` indicate
missing values. These values are set to `fill_value`. Any other
other negative value should raise a ``ValueError``.

fill_value : any, optional
Fill value to use for NA-indicies when `allow_fill` is True.
This may be ``None``, in which case the default NA value for
the type, ``self.dtype.na_value``, is used.

For many ExtensionArrays, there will be two representations of
`fill_value`: a user-facing "boxed" scalar, and a low-level
physical NA value. `fill_value` should be the user-facing version,
and the implementation should handle translating that to the
physical version for processing the take if nescessary.

Returns
-------
ExtensionArray

Raises
------
IndexError
When the indexer is out of bounds for the array.
When the indices are out of bounds for the array.
ValueError
When the indexer contains negative values other than ``-1``
When `indices` contains negative values other than ``-1``
and `allow_fill` is True.

Notes
-----
ExtensionArray.take is called by ``Series.__getitem__``, ``.loc``,
``iloc``, when the indexer is a sequence of values. Additionally,
``iloc``, when `indices` is a sequence of values. Additionally,
it's called by :meth:`Series.reindex`, or any other method
that causes realignemnt, with a `fill_value`.

Expand All @@ -518,14 +524,17 @@ def take(self, indexer, allow_fill=False, fill_value=None):

.. code-block:: python

def take(self, indexer, allow_fill=False, fill_value=None):
def take(self, indices, allow_fill=False, fill_value=None):
from pandas.core.algorithms import take

# If the ExtensionArray is backed by an ndarray, then
# just pass that here instead of coercing to object.
data = self.astype(object)

if allow_fill and fill_value is None:
fill_value = self.dtype.na_value

result = take(data, indexer, fill_value=fill_value,
result = take(data, indices, fill_value=fill_value,
allow_fill=allow_fill)
return self._from_sequence(result)
"""
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/dtypes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ class _DtypeOpsMixin(object):
# class's methods can be moved to ExtensionDtype and removed.

# na_value is the default NA value to use for this type. This is used in
# e.g. ExtensionArray.take.
# e.g. ExtensionArray.take. This should be the user-facing "boxed" version
# of the NA value, not the physical NA vaalue for storage.
na_value = np.nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe even show an example (from JSON?)


def __eq__(self, other):
Expand Down
3 changes: 0 additions & 3 deletions pandas/core/internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -5405,9 +5405,6 @@ def concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy):

for placement, join_units in concat_plan:

# The issue: we have a join unit (or maybe several) that needs to be
# reindexed.

if len(join_units) == 1 and not join_units[0].indexers:
b = join_units[0].block
values = b.values
Expand Down
20 changes: 17 additions & 3 deletions pandas/tests/extension/base/getitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,11 @@ def test_take(self, data, na_value, na_cmp):
result = data.take([0, -1])
assert result.dtype == data.dtype
assert result[0] == data[0]
na_cmp(result[1], na_value)
assert result[1] == data[-1]

result = data.take([0, -1], allow_fill=True, fill_value=na_value)
assert result[0] == data[0]
assert na_cmp(result[1], na_value)

with tm.assert_raises_regex(IndexError, "out of bounds"):
data.take([len(data) + 1])
Expand All @@ -136,7 +140,7 @@ def test_take_empty(self, data, na_value, na_cmp):
empty = data[:0]

result = empty.take([-1], allow_fill=True)
na_cmp(result[0], na_value)
assert na_cmp(result[0], na_value)

with pytest.raises(IndexError):
empty.take([-1])
Expand Down Expand Up @@ -170,7 +174,6 @@ def test_take_out_of_bounds_raises(self, data, allow_fill):
with pytest.raises(IndexError):
arr.take(np.asarray([0, 3]), allow_fill=allow_fill)

@pytest.mark.xfail(reason="Series.take with extension array buggy for -1")
def test_take_series(self, data):
s = pd.Series(data)
result = s.take([0, -1])
Expand All @@ -196,3 +199,14 @@ def test_reindex(self, data, na_value):
expected = pd.Series(data._from_sequence([na_value, na_value]),
index=[n, n + 1])
self.assert_series_equal(result, expected)

def test_reindex_non_na_fill_value(self, data_missing):
valid = data_missing[1]
na = data_missing[0]

array = data_missing._from_sequence([na, valid])
ser = pd.Series(array)
result = ser.reindex([0, 1, 2], fill_value=valid)
expected = pd.Series(data_missing._from_sequence([na, valid, valid]))

self.assert_series_equal(result, expected)
8 changes: 8 additions & 0 deletions pandas/tests/extension/category/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,14 @@ def test_take_non_na_fill_value(self):
def test_take_out_of_bounds_raises(self):
pass

@skip_take
def test_take_series(self):
pass

@skip_take
def test_reindex_non_na_fill_value(self):
pass

@pytest.mark.xfail(reason="Categorical.take buggy")
def test_take_empty(self):
pass
Expand Down
10 changes: 9 additions & 1 deletion pandas/tests/extension/decimal/test_decimal.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,15 @@ class TestReshaping(BaseDecimal, base.BaseReshapingTests):


class TestGetitem(BaseDecimal, base.BaseGetitemTests):
pass

def test_take_na_value_other_decimal(self):
arr = DecimalArray([decimal.Decimal('1.0'),
decimal.Decimal('2.0')])
result = arr.take([0, -1], allow_fill=True,
fill_value=decimal.Decimal('-1.0'))
expected = DecimalArray([decimal.Decimal('1.0'),
decimal.Decimal('-1.0')])
self.assert_extension_array_equal(result, expected)


class TestMissing(BaseDecimal, base.BaseMissingTests):
Expand Down
18 changes: 12 additions & 6 deletions pandas/tests/extension/json/array.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
"""Test extension array for storing nested data in a pandas container.

The JSONArray stores lists of dictionaries. The storage mechanism is a list,
not an ndarray.

Note:

We currently store lists of UserDicts (Py3 only). Pandas has a few places
internally that specifically check for dicts, and does non-scalar things
in that case. We *want* the dictionaries to be treated as scalars, so we
hack around pandas by using UserDicts.
"""
import collections
import itertools
import numbers
Expand Down Expand Up @@ -125,12 +137,6 @@ def take(self, indexer, allow_fill=False, fill_value=None):

return self._from_sequence(output)

# def astype(self, dtype, copy=True):
# # NumPy has issues when all the dicts are the same length.
# # np.array([UserDict(...), UserDict(...)]) fails,
# # but np.array([{...}, {...}]) works, so cast.
# return np.array([dict(x) for x in self], dtype=dtype, copy=copy)

def copy(self, deep=False):
return type(self)(self.data[:])

Expand Down