From 644138fc8c88fc101250d38a0c562dde039f133c Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Fri, 27 Apr 2018 06:15:12 -0500 Subject: [PATCH 1/9] Squashed commit of the following: commit ec0cecd292947aa4d8416991e9f8920a4cd9a831 Author: Tom Augspurger Date: Fri Apr 27 06:02:48 2018 -0500 Updates commit 6858409f3394035483e327c6e0fca05c1a6285ff Author: Tom Augspurger Date: Fri Apr 27 05:48:59 2018 -0500 Added note commit eb43fa4159c2affaf57ad686c260b2d1899eeacd Author: Tom Augspurger Date: Thu Apr 26 20:47:35 2018 -0500 Really truly fix it hopefully. commit 7c4f625fdab30865e23570bf9e0a5450cf3e648e Merge: 9a6c7d44c 6cacdde56 Author: Tom Augspurger Date: Thu Apr 26 20:40:15 2018 -0500 Merge remote-tracking branch 'upstream/master' into ea-take commit 9a6c7d44ce9f474acb68d7b62efc138aedc3a100 Author: Tom Augspurger Date: Thu Apr 26 20:04:17 2018 -0500 Doc updates commit eecd632b16d475a5bbdca2013839298904056997 Author: Tom Augspurger Date: Thu Apr 26 15:00:00 2018 -0500 Skip categorical take tests commit f3b91ca70ce9caacae29f3ecd80cd121f5ee1a33 Author: Tom Augspurger Date: Thu Apr 26 13:43:26 2018 -0500 doc fixup commit fbc4425a09bb973fefae90b3da2e1a205129658e Author: Tom Augspurger Date: Thu Apr 26 13:37:45 2018 -0500 Updates * indexer -> indices * doc user-facing vs physical * assert na_cmps * test reindex w/ non-NA fill_value commit 741f284f1c320ef6bbbf8ddadc871322cf703909 Merge: 5db66248f 630ef1649 Author: Tom Augspurger Date: Thu Apr 26 07:18:32 2018 -0500 Merge remote-tracking branch 'upstream/master' into ea-take commit 5db66248f2c40d5de6ffe9a2c308830c4128079a Author: Tom Augspurger Date: Thu Apr 26 07:17:30 2018 -0500 Doc and move tests commit 74b2c09bfc410be7190bbf6c69cb2c19fee6fbdd Author: Tom Augspurger Date: Wed Apr 25 21:40:18 2018 -0500 Added verisonadded commit fc729d6dea77dee0d89148cdda738c0e9d8ab7c5 Author: Tom Augspurger Date: Wed Apr 25 15:51:27 2018 -0500 Fixed editor commit 1a4d9873629bb28e19a124e40b063879fc5e3ad0 Author: Tom Augspurger Date: Wed Apr 25 15:50:48 2018 -0500 Pass an array commit bbcbf19a560efed894670b0bc495a80283e003fc Author: Tom Augspurger Date: Wed Apr 25 15:07:28 2018 -0500 Cleanup commit d5470a0f9f4d2f3937f0e63256f0750a50276586 Author: Tom Augspurger Date: Wed Apr 25 15:02:26 2018 -0500 Fixed reorder commit 82cad8b1f369db80851c13fe856a30bdb8bd154c Author: Tom Augspurger Date: Wed Apr 25 15:00:43 2018 -0500 Stale comment commit c449afd2208451d67ce9c71c4b9d94448665c534 Author: Tom Augspurger Date: Wed Apr 25 14:48:33 2018 -0500 Bounds checking commit 449983b3686710886da78e2c2b77ce59be9a93d3 Author: Tom Augspurger Date: Wed Apr 25 12:55:31 2018 -0500 Linting commit 69e7fe76b540807bb973496ee5a2e3e636b90287 Author: Tom Augspurger Date: Wed Apr 25 12:40:20 2018 -0500 Updates 1. Reversed order of take keywords 2. Added to extensions API 3. Removed default implementation commit 05d884498a1b15f876e5166bc5208afd804b63c7 Author: Tom Augspurger Date: Wed Apr 25 09:59:33 2018 -0500 Updated docs commit 31cd304e2f61fad841f6d56135fd6bbb202c606f Author: Tom Augspurger Date: Wed Apr 25 09:43:45 2018 -0500 pep8 commit 338566faf627e1fabd638a65055f1804948dd021 Author: Tom Augspurger Date: Wed Apr 25 09:42:28 2018 -0500 Upcasting commit b7ae0bc2cadd821d97f21fc2eac99b321d9473e5 Author: Tom Augspurger Date: Wed Apr 25 09:06:59 2018 -0500 revert combine change commit 125ca0b7cf5796b806e76c2c38d780e55ea12176 Author: Tom Augspurger Date: Wed Apr 25 08:37:07 2018 -0500 Simplify Upcasting is still broken commit c721915114c5dd39cd7aec6c9dec762ca28abcac Author: Tom Augspurger Date: Wed Apr 25 07:50:54 2018 -0500 Removed default_fill_value commit 37915e9f31652904a3a3c9d91e2d4e394aecc14f Merge: 67ba9ddb0 60fe82c8a Author: Tom Augspurger Date: Wed Apr 25 07:44:15 2018 -0500 Merge remote-tracking branch 'upstream/master' into ea-take commit 67ba9ddb0c8dec7f88a7ce14387898b552afe355 Author: Tom Augspurger Date: Wed Apr 25 07:42:54 2018 -0500 more with default fill value commit eba137f8c7ae45288200566aebbf32229c69720c Author: Tom Augspurger Date: Wed Apr 25 05:59:58 2018 -0500 More internals hacking commit 08f24790daa8bbade46dca86dd2fb9ff0353118c Author: Tom Augspurger Date: Wed Apr 25 05:59:17 2018 -0500 Fixup JSON take commit 0be9ec6593f38b698af6475c668ecf4c709bbe20 Author: Tom Augspurger Date: Tue Apr 24 18:02:13 2018 -0500 non-internals changes commit dacd98e5f8ed0947bb2646499ac2e3cfd94e6177 Author: Tom Augspurger Date: Tue Apr 24 14:45:36 2018 -0500 Moves commit fb3c2343c88be05ff1db5ca0ddef274eb0fda6b3 Author: Tom Augspurger Date: Tue Apr 24 13:59:51 2018 -0500 [WIP]: ExtensionArray.take default implementation Implements a take interface that's compatible with NumPy and optionally pandas' NA semantics. ```python In [1]: import pandas as pd In [2]: from pandas.tests.extension.decimal.array import * In [3]: arr = DecimalArray(['1.1', '1.2', '1.3']) In [4]: arr.take([0, 1, -1]) Out[4]: DecimalArray(array(['1.1', '1.2', '1.3'], dtype=object)) In [5]: arr.take([0, 1, -1], fill_value=float('nan')) Out[5]: DecimalArray(array(['1.1', '1.2', Decimal('NaN')], dtype=object)) ``` Closes https://github.com/pandas-dev/pandas/issues/20640 --- pandas/api/extensions/__init__.py | 1 + pandas/core/algorithms.py | 90 ++++++++++++++++- pandas/core/arrays/base.py | 96 ++++++++++++------- pandas/core/dtypes/base.py | 11 +++ pandas/core/dtypes/cast.py | 3 +- pandas/core/dtypes/missing.py | 2 + pandas/core/frame.py | 2 +- pandas/core/generic.py | 8 +- pandas/core/indexing.py | 41 ++++++++ pandas/core/internals.py | 22 ++++- pandas/core/series.py | 3 +- pandas/tests/extension/base/getitem.py | 52 +++++++++- .../extension/category/test_categorical.py | 28 +++++- pandas/tests/extension/decimal/array.py | 28 +++--- .../tests/extension/decimal/test_decimal.py | 10 +- pandas/tests/extension/json/array.py | 57 ++++++++--- pandas/tests/extension/json/test_json.py | 15 ++- pandas/tests/indexing/test_indexing.py | 27 +++++- pandas/tests/test_take.py | 45 +++++++++ 19 files changed, 460 insertions(+), 81 deletions(-) diff --git a/pandas/api/extensions/__init__.py b/pandas/api/extensions/__init__.py index cc09204d992d8..3e6e192a3502c 100644 --- a/pandas/api/extensions/__init__.py +++ b/pandas/api/extensions/__init__.py @@ -2,5 +2,6 @@ from pandas.core.accessor import (register_dataframe_accessor, # noqa register_index_accessor, register_series_accessor) +from pandas.core.algorithms import take # noqa from pandas.core.arrays.base import ExtensionArray # noqa from pandas.core.dtypes.dtypes import ExtensionDtype # noqa diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py index cbc412d74d51d..a03d892432b51 100644 --- a/pandas/core/algorithms.py +++ b/pandas/core/algorithms.py @@ -1448,6 +1448,94 @@ def func(arr, indexer, out, fill_value=np.nan): return func +def take(arr, indices, allow_fill=False, fill_value=None): + """ + Take elements from an array. + + .. versionadded:: 0.23.0 + + Parameters + ---------- + arr : sequence + Non array-likes (sequences without a dtype) are coereced + to an ndarray. + indices : sequence of integers + Indices to be taken. + allow_fill : bool, default False + How to handle negative values in `indices`. + + * False: negative values in `indices` indicate positional indices + from the right (the default). This is similar to :func:`numpy.take`. + + * True: negative values in `indices` indicate + missing values. These values are set to `fill_value`. Any other + other negative values raise a ``ValueError``. + + fill_value : any, optional + Fill value to use for NA-indices when `allow_fill` is True. + This may be ``None``, in which case the default NA value for + the type (``self.dtype.na_value``) is used. + + Returns + ------- + ndarray or ExtensionArray + Same type as the input. + + Raises + ------ + IndexError + When `indices` is out of bounds for the array. + ValueError + When the indexer contains negative values other than ``-1`` + and `allow_fill` is True. + + Notes + ----- + When `allow_fill` is False, `indices` may be whatever dimensionality + is accepted by NumPy for `arr`. + + When `allow_fill` is True, `indices` should be 1-D. + + See Also + -------- + numpy.take + + Examples + -------- + >>> from pandas.api.extensions import take + + With the default ``allow_fill=False``, negative numbers indicate + positional indices from the right. + + >>> take(np.array([10, 20, 30]), [0, 0, -1]) + array([10, 10, 30]) + + Setting ``allow_fill=True`` will place `fill_value` in those positions. + + >>> take(np.array([10, 20, 30]), [0, 0, -1], allow_fill=True) + array([10., 10., nan]) + + >>> take(np.array([10, 20, 30]), [0, 0, -1], allow_fill=True, + ... fill_value=-10) + array([ 10, 10, -10]) + """ + from pandas.core.indexing import validate_indices + + if not is_array_like(arr): + arr = np.asarray(arr) + + indices = np.asarray(indices, dtype=np.intp) + + if allow_fill: + # Pandas style, -1 means NA + validate_indices(indices, len(arr)) + result = take_1d(arr, indices, allow_fill=True, fill_value=fill_value) + else: + # NumPy style + result = arr.take(indices) + return result + + def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None, allow_fill=True): """ @@ -1462,7 +1550,7 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None, Input array. indexer : ndarray 1-D array of indices to take, subarrays corresponding to -1 value - indicies are filed with fill_value + indices are filed with fill_value axis : int, default 0 Axis to take from out : ndarray or None, default None diff --git a/pandas/core/arrays/base.py b/pandas/core/arrays/base.py index f1a81b5eefddd..1922801c30719 100644 --- a/pandas/core/arrays/base.py +++ b/pandas/core/arrays/base.py @@ -462,22 +462,36 @@ def factorize(self, na_sentinel=-1): # ------------------------------------------------------------------------ # Indexing methods # ------------------------------------------------------------------------ - def take(self, indexer, allow_fill=True, fill_value=None): + + def take(self, indices, allow_fill=False, fill_value=None): # type: (Sequence[int], bool, Optional[Any]) -> ExtensionArray """Take elements from an array. Parameters ---------- - indexer : sequence of integers - indices to be taken. -1 is used to indicate values - that are missing. - allow_fill : bool, default True - If False, indexer is assumed to contain no -1 values so no filling - will be done. This short-circuits computation of a mask. Result is - undefined if allow_fill == False and -1 is present in indexer. - fill_value : any, default None - Fill value to replace -1 values with. If applicable, this should - use the sentinel missing value for this type. + indices : sequence of integers + Indices to be taken. + allow_fill : bool, default False + How to handle negative values in `indices`. + + * False: negative values in `indices` indicate positional indices + from the right (the default). This is similar to + :func:`numpy.take`. + + * True: negative values in `indices` indicate + missing values. These values are set to `fill_value`. Any other + other negative values raise a ``ValueError``. + + fill_value : any, optional + Fill value to use for NA-indices when `allow_fill` is True. + This may be ``None``, in which case the default NA value for + the type, ``self.dtype.na_value``, is used. + + For many ExtensionArrays, there will be two representations of + `fill_value`: a user-facing "boxed" scalar, and a low-level + physical NA value. `fill_value` should be the user-facing version, + and the implementation should handle translating that to the + physical version for processing the take if nescessary. Returns ------- @@ -486,44 +500,56 @@ def take(self, indexer, allow_fill=True, fill_value=None): Raises ------ IndexError - When the indexer is out of bounds for the array. + When the indices are out of bounds for the array. + ValueError + When `indices` contains negative values other than ``-1`` + and `allow_fill` is True. Notes ----- - This should follow pandas' semantics where -1 indicates missing values. - Positions where indexer is ``-1`` should be filled with the missing - value for this type. - This gives rise to the special case of a take on an empty - ExtensionArray that does not raises an IndexError straight away - when the `indexer` is all ``-1``. + ExtensionArray.take is called by ``Series.__getitem__``, ``.loc``, + ``iloc``, when `indices` is a sequence of values. Additionally, + it's called by :meth:`Series.reindex`, or any other method + that causes realignemnt, with a `fill_value`. - This is called by ``Series.__getitem__``, ``.loc``, ``iloc``, when the - indexer is a sequence of values. + See Also + -------- + numpy.take + pandas.api.extensions.take Examples -------- - Suppose the extension array is backed by a NumPy array stored as - ``self.data``. Then ``take`` may be written as + Here's an example implementation, which relies on casting the + extension array to object dtype. This uses the helper method + :func:`pandas.api.extensions.take`. .. code-block:: python - def take(self, indexer, allow_fill=True, fill_value=None): - indexer = np.asarray(indexer) - mask = indexer == -1 + def take(self, indices, allow_fill=False, fill_value=None): + from pandas.core.algorithms import take - # take on empty array not handled as desired by numpy - # in case of -1 (all missing take) - if not len(self) and mask.all(): - return type(self)([np.nan] * len(indexer)) + # If the ExtensionArray is backed by an ndarray, then + # just pass that here instead of coercing to object. + data = self.astype(object) - result = self.data.take(indexer) - result[mask] = np.nan # NA for this type - return type(self)(result) + if allow_fill and fill_value is None: + fill_value = self.dtype.na_value - See Also - -------- - numpy.take + # fill value should always be translated from the scalar + # type for the array, to the physical storage type for + # the data, before passing to take. + + result = take(data, indices, fill_value=fill_value, + allow_fill=allow_fill) + return self._from_sequence(result) """ + # Implementer note: The `fill_value` parameter should be a user-facing + # value, an instance of self.dtype.type. When passed `fill_value=None`, + # the default of `self.dtype.na_value` should be used. + # This may differ from the physical storage type your ExtensionArray + # uses. In this case, your implementation is responsible for casting + # the user-facing type to the storage type, before using + # pandas.api.extensions.take raise AbstractMethodError(self) def copy(self, deep=False): diff --git a/pandas/core/dtypes/base.py b/pandas/core/dtypes/base.py index 6dbed5f138d5d..49e98c16c716e 100644 --- a/pandas/core/dtypes/base.py +++ b/pandas/core/dtypes/base.py @@ -16,6 +16,12 @@ class _DtypeOpsMixin(object): # classes will inherit from this Mixin. Once everything is compatible, this # class's methods can be moved to ExtensionDtype and removed. + # na_value is the default NA value to use for this type. This is used in + # e.g. ExtensionArray.take. This should be the user-facing "boxed" version + # of the NA value, not the physical NA vaalue for storage. + # e.g. for JSONArray, this is an empty dictionary. + na_value = np.nan + def __eq__(self, other): """Check whether 'other' is equal to self. @@ -92,6 +98,8 @@ def is_dtype(cls, dtype): class ExtensionDtype(_DtypeOpsMixin): """A custom data type, to be paired with an ExtensionArray. + .. versionadded:: 0.23.0 + Notes ----- The interface includes the following abstract methods that must @@ -101,6 +109,9 @@ class ExtensionDtype(_DtypeOpsMixin): * name * construct_from_string + The `na_value` class attribute can be used to set the default NA value + for this type. :attr:`numpy.nan` is used by default. + This class does not inherit from 'abc.ABCMeta' for performance reasons. Methods and properties required by the interface raise ``pandas.errors.AbstractMethodError`` and no ``register`` method is diff --git a/pandas/core/dtypes/cast.py b/pandas/core/dtypes/cast.py index 75434fcc2b40d..e4ed6d544d42e 100644 --- a/pandas/core/dtypes/cast.py +++ b/pandas/core/dtypes/cast.py @@ -255,7 +255,6 @@ def changeit(): def maybe_promote(dtype, fill_value=np.nan): - # if we passed an array here, determine the fill value by dtype if isinstance(fill_value, np.ndarray): if issubclass(fill_value.dtype.type, (np.datetime64, np.timedelta64)): @@ -294,6 +293,8 @@ def maybe_promote(dtype, fill_value=np.nan): elif is_datetimetz(dtype): if isna(fill_value): fill_value = iNaT + elif is_extension_array_dtype(dtype) and isna(fill_value): + fill_value = dtype.na_value elif is_float(fill_value): if issubclass(dtype.type, np.bool_): dtype = np.object_ diff --git a/pandas/core/dtypes/missing.py b/pandas/core/dtypes/missing.py index 2c98cedd7d715..3b2336bf19547 100644 --- a/pandas/core/dtypes/missing.py +++ b/pandas/core/dtypes/missing.py @@ -502,6 +502,8 @@ def na_value_for_dtype(dtype, compat=True): """ dtype = pandas_dtype(dtype) + if is_extension_array_dtype(dtype): + return dtype.na_value if (is_datetime64_dtype(dtype) or is_datetime64tz_dtype(dtype) or is_timedelta64_dtype(dtype) or is_period_dtype(dtype)): return NaT diff --git a/pandas/core/frame.py b/pandas/core/frame.py index de6985ef3b4ea..82d5a0286b117 100644 --- a/pandas/core/frame.py +++ b/pandas/core/frame.py @@ -3476,7 +3476,7 @@ def _reindex_index(self, new_index, method, copy, level, fill_value=np.nan, allow_dups=False) def _reindex_columns(self, new_columns, method, copy, level, - fill_value=np.nan, limit=None, tolerance=None): + fill_value=None, limit=None, tolerance=None): new_columns, indexer = self.columns.reindex(new_columns, method=method, level=level, limit=limit, tolerance=tolerance) diff --git a/pandas/core/generic.py b/pandas/core/generic.py index 86342b6996abf..6d55f92167d3b 100644 --- a/pandas/core/generic.py +++ b/pandas/core/generic.py @@ -3660,7 +3660,7 @@ def reindex(self, *args, **kwargs): copy = kwargs.pop('copy', True) limit = kwargs.pop('limit', None) tolerance = kwargs.pop('tolerance', None) - fill_value = kwargs.pop('fill_value', np.nan) + fill_value = kwargs.pop('fill_value', None) # Series.reindex doesn't use / need the axis kwarg # We pop and ignore it here, to make writing Series/Frame generic code @@ -3776,7 +3776,7 @@ def _reindex_multi(self, axes, copy, fill_value): @Appender(_shared_docs['reindex_axis'] % _shared_doc_kwargs) def reindex_axis(self, labels, axis=0, method=None, level=None, copy=True, - limit=None, fill_value=np.nan): + limit=None, fill_value=None): msg = ("'.reindex_axis' is deprecated and will be removed in a future " "version. Use '.reindex' instead.") self._consolidate_inplace() @@ -3790,7 +3790,7 @@ def reindex_axis(self, labels, axis=0, method=None, level=None, copy=True, return self._reindex_with_indexers({axis: [new_index, indexer]}, fill_value=fill_value, copy=copy) - def _reindex_with_indexers(self, reindexers, fill_value=np.nan, copy=False, + def _reindex_with_indexers(self, reindexers, fill_value=None, copy=False, allow_dups=False): """allow_dups indicates an internal call here """ @@ -7252,7 +7252,7 @@ def align(self, other, join='outer', axis=None, level=None, copy=True, raise TypeError('unsupported type: %s' % type(other)) def _align_frame(self, other, join='outer', axis=None, level=None, - copy=True, fill_value=np.nan, method=None, limit=None, + copy=True, fill_value=None, method=None, limit=None, fill_axis=0): # defaults join_index, join_columns = None, None diff --git a/pandas/core/indexing.py b/pandas/core/indexing.py index 2eb52ecc6bcc7..fe6d6775c4e0b 100755 --- a/pandas/core/indexing.py +++ b/pandas/core/indexing.py @@ -2417,12 +2417,53 @@ def maybe_convert_indices(indices, n): mask = indices < 0 if mask.any(): indices[mask] += n + mask = (indices >= n) | (indices < 0) if mask.any(): raise IndexError("indices are out-of-bounds") return indices +def validate_indices(indices, n): + """Perform bounds-checking for an indexer. + + -1 is allowed for indicating missing values. + + Parameters + ---------- + indices : ndarray + n : int + length of the array being indexed + + Raises + ------ + ValueError + + Examples + -------- + >>> validate_indices([1, 2], 3) + # OK + >>> validate_indices([1, -2], 3) + ValueError + >>> validate_indices([1, 2, 3], 3) + IndexError + >>> validate_indices([-1, -1], 0) + # OK + >>> validate_indices([0, 1], 0) + IndexError + """ + if len(indices): + min_idx = indices.min() + if min_idx < -1: + msg = ("'indices' contains values less than allowed ({} < {})" + .format(min_idx, -1)) + raise ValueError(msg) + + max_idx = indices.max() + if max_idx >= n: + raise IndexError("indices are out-of-bounds") + + def maybe_convert_ix(*args): """ We likely want to take the cross-product diff --git a/pandas/core/internals.py b/pandas/core/internals.py index a266ea620bd9f..474894aba65df 100644 --- a/pandas/core/internals.py +++ b/pandas/core/internals.py @@ -1888,6 +1888,11 @@ def _holder(self): # For extension blocks, the holder is values-dependent. return type(self.values) + @property + def fill_value(self): + # Used in reindex_indexer + return self.values.dtype.na_value + @property def _can_hold_na(self): # The default ExtensionArray._can_hold_na is True @@ -1951,7 +1956,8 @@ def take_nd(self, indexer, axis=0, new_mgr_locs=None, fill_tuple=None): # axis doesn't matter; we are really a single-dim object # but are passed the axis depending on the calling routing # if its REALLY axis 0, then this will be a reindex and not a take - new_values = self.values.take(indexer, fill_value=fill_value) + new_values = self.values.take(indexer, fill_value=fill_value, + allow_fill=True) # if we are a 1-dim object, then always place at 0 if self.ndim == 1: @@ -5440,6 +5446,14 @@ def is_uniform_join_units(join_units): len(join_units) > 1) +def is_uniform_reindex(join_units): + return ( + # TODO: should this be ju.block._can_hold_na? + all(ju.block and ju.block.is_extension for ju in join_units) and + len(set(ju.block.dtype.name for ju in join_units)) == 1 + ) + + def get_empty_dtype_and_na(join_units): """ Return dtype and N/A values to use when concatenating specified units. @@ -5457,6 +5471,12 @@ def get_empty_dtype_and_na(join_units): if blk is None: return np.float64, np.nan + if is_uniform_reindex(join_units): + # XXX: integrate property + empty_dtype = join_units[0].block.dtype + upcasted_na = join_units[0].block.fill_value + return empty_dtype, upcasted_na + has_none_blocks = False dtypes = [None] * len(join_units) for i, unit in enumerate(join_units): diff --git a/pandas/core/series.py b/pandas/core/series.py index f2ee225f50514..7abd95c68ea2b 100644 --- a/pandas/core/series.py +++ b/pandas/core/series.py @@ -3216,7 +3216,8 @@ def _reindex_indexer(self, new_index, indexer, copy): return self.copy() return self - new_values = algorithms.take_1d(self._values, indexer) + new_values = algorithms.take_1d(self._values, indexer, + allow_fill=True, fill_value=None) return self._constructor(new_values, index=new_index) def _needs_reindex_multi(self, axes, method, level): diff --git a/pandas/tests/extension/base/getitem.py b/pandas/tests/extension/base/getitem.py index ac156900671a6..5c9ede1079079 100644 --- a/pandas/tests/extension/base/getitem.py +++ b/pandas/tests/extension/base/getitem.py @@ -127,20 +127,53 @@ def test_take(self, data, na_value, na_cmp): result = data.take([0, -1]) assert result.dtype == data.dtype assert result[0] == data[0] - na_cmp(result[1], na_value) + assert result[1] == data[-1] + + result = data.take([0, -1], allow_fill=True, fill_value=na_value) + assert result[0] == data[0] + assert na_cmp(result[1], na_value) with tm.assert_raises_regex(IndexError, "out of bounds"): data.take([len(data) + 1]) def test_take_empty(self, data, na_value, na_cmp): empty = data[:0] - result = empty.take([-1]) - na_cmp(result[0], na_value) + + result = empty.take([-1], allow_fill=True) + assert na_cmp(result[0], na_value) + + with pytest.raises(IndexError): + empty.take([-1]) with tm.assert_raises_regex(IndexError, "cannot do a non-empty take"): empty.take([0, 1]) - @pytest.mark.xfail(reason="Series.take with extension array buggy for -1") + def test_take_negative(self, data): + # https://github.com/pandas-dev/pandas/issues/20640 + n = len(data) + result = data.take([0, -n, n - 1, -1]) + expected = data.take([0, 0, n - 1, n - 1]) + self.assert_extension_array_equal(result, expected) + + def test_take_non_na_fill_value(self, data_missing): + fill_value = data_missing[1] # valid + na = data_missing[0] + + array = data_missing._from_sequence([na, fill_value, na]) + result = array.take([-1, 1], fill_value=fill_value, allow_fill=True) + expected = array.take([1, 1]) + self.assert_extension_array_equal(result, expected) + + def test_take_pandas_style_negative_raises(self, data, na_value): + with pytest.raises(ValueError): + data.take([0, -2], fill_value=na_value, allow_fill=True) + + @pytest.mark.parametrize('allow_fill', [True, False]) + def test_take_out_of_bounds_raises(self, data, allow_fill): + arr = data[:3] + with pytest.raises(IndexError): + arr.take(np.asarray([0, 3]), allow_fill=allow_fill) + def test_take_series(self, data): s = pd.Series(data) result = s.take([0, -1]) @@ -166,3 +199,14 @@ def test_reindex(self, data, na_value): expected = pd.Series(data._from_sequence([na_value, na_value]), index=[n, n + 1]) self.assert_series_equal(result, expected) + + def test_reindex_non_na_fill_value(self, data_missing): + valid = data_missing[1] + na = data_missing[0] + + array = data_missing._from_sequence([na, valid]) + ser = pd.Series(array) + result = ser.reindex([0, 1, 2], fill_value=valid) + expected = pd.Series(data_missing._from_sequence([na, valid, valid])) + + self.assert_series_equal(result, expected) diff --git a/pandas/tests/extension/category/test_categorical.py b/pandas/tests/extension/category/test_categorical.py index 6ebe700f13be0..c34339c99322d 100644 --- a/pandas/tests/extension/category/test_categorical.py +++ b/pandas/tests/extension/category/test_categorical.py @@ -81,6 +81,8 @@ def test_merge(self, data, na_value): class TestGetitem(base.BaseGetitemTests): + skip_take = pytest.mark.skip(reason="GH-20664.") + @pytest.mark.skip(reason="Backwards compatibility") def test_getitem_scalar(self): # CategoricalDtype.type isn't "correct" since it should @@ -88,11 +90,35 @@ def test_getitem_scalar(self): # to break things by changing. pass - @pytest.mark.xfail(reason="Categorical.take buggy") + @skip_take def test_take(self): # TODO remove this once Categorical.take is fixed pass + @skip_take + def test_take_negative(self): + pass + + @skip_take + def test_take_pandas_style_negative_raises(self): + pass + + @skip_take + def test_take_non_na_fill_value(self): + pass + + @skip_take + def test_take_out_of_bounds_raises(self): + pass + + @skip_take + def test_take_series(self): + pass + + @skip_take + def test_reindex_non_na_fill_value(self): + pass + @pytest.mark.xfail(reason="Categorical.take buggy") def test_take_empty(self): pass diff --git a/pandas/tests/extension/decimal/array.py b/pandas/tests/extension/decimal/array.py index 5d749126e0cec..e9431bd0c233c 100644 --- a/pandas/tests/extension/decimal/array.py +++ b/pandas/tests/extension/decimal/array.py @@ -8,12 +8,12 @@ import pandas as pd from pandas.core.arrays import ExtensionArray from pandas.core.dtypes.base import ExtensionDtype -from pandas.core.dtypes.common import _ensure_platform_int class DecimalDtype(ExtensionDtype): type = decimal.Decimal name = 'decimal' + na_value = decimal.Decimal('NaN') @classmethod def construct_from_string(cls, string): @@ -28,6 +28,7 @@ class DecimalArray(ExtensionArray): dtype = DecimalDtype() def __init__(self, values): + assert all(isinstance(v, decimal.Decimal) for v in values) values = np.asarray(values, dtype=object) self._data = values @@ -52,6 +53,17 @@ def __getitem__(self, item): else: return type(self)(self._data[item]) + def take(self, indexer, allow_fill=False, fill_value=None): + from pandas.api.extensions import take + + data = self._data + if allow_fill and fill_value is None: + fill_value = self.dtype.na_value + + result = take(data, indexer, fill_value=fill_value, + allow_fill=allow_fill) + return self._from_sequence(result) + def copy(self, deep=False): if deep: return type(self)(self._data.copy()) @@ -80,20 +92,6 @@ def nbytes(self): def isna(self): return np.array([x.is_nan() for x in self._data]) - def take(self, indexer, allow_fill=True, fill_value=None): - indexer = np.asarray(indexer) - mask = indexer == -1 - - # take on empty array not handled as desired by numpy in case of -1 - if not len(self) and mask.all(): - return type(self)([self._na_value] * len(indexer)) - - indexer = _ensure_platform_int(indexer) - out = self._data.take(indexer) - out[mask] = self._na_value - - return type(self)(out) - @property def _na_value(self): return decimal.Decimal('NaN') diff --git a/pandas/tests/extension/decimal/test_decimal.py b/pandas/tests/extension/decimal/test_decimal.py index 53d74cd6d38cb..1f8cf0264f62f 100644 --- a/pandas/tests/extension/decimal/test_decimal.py +++ b/pandas/tests/extension/decimal/test_decimal.py @@ -108,7 +108,15 @@ class TestReshaping(BaseDecimal, base.BaseReshapingTests): class TestGetitem(BaseDecimal, base.BaseGetitemTests): - pass + + def test_take_na_value_other_decimal(self): + arr = DecimalArray([decimal.Decimal('1.0'), + decimal.Decimal('2.0')]) + result = arr.take([0, -1], allow_fill=True, + fill_value=decimal.Decimal('-1.0')) + expected = DecimalArray([decimal.Decimal('1.0'), + decimal.Decimal('-1.0')]) + self.assert_extension_array_equal(result, expected) class TestMissing(BaseDecimal, base.BaseMissingTests): diff --git a/pandas/tests/extension/json/array.py b/pandas/tests/extension/json/array.py index 2e75bb3b8c326..88bb66f38b35c 100644 --- a/pandas/tests/extension/json/array.py +++ b/pandas/tests/extension/json/array.py @@ -1,3 +1,15 @@ +"""Test extension array for storing nested data in a pandas container. + +The JSONArray stores lists of dictionaries. The storage mechanism is a list, +not an ndarray. + +Note: + +We currently store lists of UserDicts (Py3 only). Pandas has a few places +internally that specifically check for dicts, and does non-scalar things +in that case. We *want* the dictionaries to be treated as scalars, so we +hack around pandas by using UserDicts. +""" import collections import itertools import numbers @@ -14,6 +26,11 @@ class JSONDtype(ExtensionDtype): type = collections.Mapping name = 'json' + try: + na_value = collections.UserDict() + except AttributeError: + # source compatibility with Py2. + na_value = {} @classmethod def construct_from_string(cls, string): @@ -91,15 +108,33 @@ def nbytes(self): return sys.getsizeof(self.data) def isna(self): - return np.array([x == self._na_value for x in self.data]) - - def take(self, indexer, allow_fill=True, fill_value=None): - try: - output = [self.data[loc] if loc != -1 else self._na_value - for loc in indexer] - except IndexError: - raise IndexError("Index is out of bounds or cannot do a " - "non-empty take from an empty array.") + return np.array([x == self.dtype.na_value for x in self.data]) + + def take(self, indexer, allow_fill=False, fill_value=None): + # re-implement here, since NumPy has trouble setting + # sized objects like UserDicts into scalar slots of + # an ndarary. + indexer = np.asarray(indexer) + msg = ("Index is out of bounds or cannot do a " + "non-empty take from an empty array.") + + if allow_fill: + if fill_value is None: + fill_value = self.dtype.na_value + # bounds check + if (indexer < -1).any(): + raise ValueError + try: + output = [self.data[loc] if loc != -1 else fill_value + for loc in indexer] + except IndexError: + raise IndexError(msg) + else: + try: + output = [self.data[loc] for loc in indexer] + except IndexError: + raise IndexError(msg) + return self._from_sequence(output) def copy(self, deep=False): @@ -118,10 +153,6 @@ def unique(self): dict(x) for x in list(set(tuple(d.items()) for d in self.data)) ]) - @property - def _na_value(self): - return {} - @classmethod def _concat_same_type(cls, to_concat): data = list(itertools.chain.from_iterable([x.data for x in to_concat])) diff --git a/pandas/tests/extension/json/test_json.py b/pandas/tests/extension/json/test_json.py index 0ef34c3b0f679..1fdb7298eefc4 100644 --- a/pandas/tests/extension/json/test_json.py +++ b/pandas/tests/extension/json/test_json.py @@ -21,6 +21,17 @@ def dtype(): @pytest.fixture def data(): """Length-100 PeriodArray for semantics test.""" + data = make_data() + + # Why the while loop? NumPy is unable to construct an ndarray from + # equal-length ndarrays. Many of our operations involve coercing the + # EA to an ndarray of objects. To avoid random test failures, we ensure + # that our data is coercable to an ndarray. Several tests deal with only + # the first two elements, so that's what we'll check. + + while len(data[0]) == len(data[1]): + data = make_data() + return JSONArray(make_data()) @@ -41,8 +52,8 @@ def data_missing_for_sorting(): @pytest.fixture -def na_value(): - return {} +def na_value(dtype): + return dtype.na_value @pytest.fixture diff --git a/pandas/tests/indexing/test_indexing.py b/pandas/tests/indexing/test_indexing.py index c66310d10ebdc..04225454f61f9 100644 --- a/pandas/tests/indexing/test_indexing.py +++ b/pandas/tests/indexing/test_indexing.py @@ -16,7 +16,8 @@ import numpy as np import pandas as pd -from pandas.core.indexing import _non_reducing_slice, _maybe_numeric_slice +from pandas.core.indexing import (_non_reducing_slice, _maybe_numeric_slice, + validate_indices) from pandas import NaT, DataFrame, Index, Series, MultiIndex import pandas.util.testing as tm @@ -994,3 +995,27 @@ def test_none_coercion_mixed_dtypes(self): datetime(2000, 1, 3)], 'd': [None, 'b', 'c']}) tm.assert_frame_equal(start_dataframe, exp) + + +def test_validate_indices_ok(): + indices = np.asarray([0, 1]) + validate_indices(indices, 2) + validate_indices(indices[:0], 0) + validate_indices(np.array([-1, -1]), 0) + + +def test_validate_indices_low(): + indices = np.asarray([0, -2]) + with tm.assert_raises_regex(ValueError, "'indices' contains"): + validate_indices(indices, 2) + + +def test_validate_indices_high(): + indices = np.asarray([0, 1, 2]) + with tm.assert_raises_regex(IndexError, "indices are out"): + validate_indices(indices, 2) + + +def test_validate_indices_empty(): + with tm.assert_raises_regex(IndexError, "indices are out"): + validate_indices(np.array([0, 1]), 0) diff --git a/pandas/tests/test_take.py b/pandas/tests/test_take.py index 7b97b0e975df3..2b78c91f9dac5 100644 --- a/pandas/tests/test_take.py +++ b/pandas/tests/test_take.py @@ -3,6 +3,7 @@ from datetime import datetime import numpy as np +import pytest from pandas.compat import long import pandas.core.algorithms as algos import pandas.util.testing as tm @@ -445,3 +446,47 @@ def test_2d_datetime64(self): expected = arr.take(indexer, axis=1) expected[:, [2, 4]] = datetime(2007, 1, 1) tm.assert_almost_equal(result, expected) + + +class TestExtensionTake(object): + # The take method found in pd.api.extensions + + def test_bounds_check_large(self): + arr = np.array([1, 2]) + with pytest.raises(IndexError): + algos.take(arr, [2, 3], allow_fill=True) + + with pytest.raises(IndexError): + algos.take(arr, [2, 3], allow_fill=False) + + def test_bounds_check_small(self): + arr = np.array([1, 2, 3], dtype=np.int64) + indexer = [0, -1, -2] + with pytest.raises(ValueError): + algos.take(arr, indexer, allow_fill=True) + + result = algos.take(arr, indexer) + expected = np.array([1, 3, 2], dtype=np.int64) + tm.assert_numpy_array_equal(result, expected) + + @pytest.mark.parametrize('allow_fill', [True, False]) + def test_take_empty(self, allow_fill): + arr = np.array([], dtype=np.int64) + # empty take is ok + result = algos.take(arr, [], allow_fill=allow_fill) + tm.assert_numpy_array_equal(arr, result) + + with pytest.raises(IndexError): + algos.take(arr, [0], allow_fill=allow_fill) + + def test_take_na_empty(self): + result = algos.take(np.array([]), [-1, -1], allow_fill=True, + fill_value=0.0) + expected = np.array([0., 0.]) + tm.assert_numpy_array_equal(result, expected) + + def test_take_coerces_list(self): + arr = [1, 2, 3] + result = algos.take(arr, [0, 0]) + expected = np.array([1, 1]) + tm.assert_numpy_array_equal(result, expected) From b874eed428dd6b63ae8c9f29c0924e41ca8134c7 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Thu, 26 Apr 2018 19:52:22 -0500 Subject: [PATCH 2/9] BUG/API: FutureWarning from Categorical.take indices We're changing how Categorical.take handles negative indices to be in line with Series and other EAs. --- doc/source/whatsnew/v0.23.0.txt | 1 + pandas/core/arrays/categorical.py | 34 +++++++++--- pandas/core/series.py | 9 +++- pandas/tests/categorical/test_algos.py | 53 +++++++++++++++++++ pandas/tests/series/indexing/test_indexing.py | 10 ++++ 5 files changed, 99 insertions(+), 8 deletions(-) diff --git a/doc/source/whatsnew/v0.23.0.txt b/doc/source/whatsnew/v0.23.0.txt index ffa4f1068f84d..fe8b300184cf2 100644 --- a/doc/source/whatsnew/v0.23.0.txt +++ b/doc/source/whatsnew/v0.23.0.txt @@ -856,6 +856,7 @@ Other API Changes - Constructing a Series from a list of length 1 no longer broadcasts this list when a longer index is specified (:issue:`19714`, :issue:`20391`). - :func:`DataFrame.to_dict` with ``orient='index'`` no longer casts int columns to float for a DataFrame with only int and float columns (:issue:`18580`) - A user-defined-function that is passed to :func:`Series.rolling().aggregate() `, :func:`DataFrame.rolling().aggregate() `, or its expanding cousins, will now *always* be passed a ``Series``, rather than a ``np.array``; ``.apply()`` only has the ``raw`` keyword, see :ref:`here `. This is consistent with the signatures of ``.aggregate()`` across pandas (:issue:`20584`) +- The previous default behavior of negative indices in ``Categorical.take`` is deprecated. In a future version it will change from meaning missing values to meaning positional indicies from the right (:issue:`20664`) .. _whatsnew_0230.deprecations: diff --git a/pandas/core/arrays/categorical.py b/pandas/core/arrays/categorical.py index 7f0d54de9def8..8959741f3cb91 100644 --- a/pandas/core/arrays/categorical.py +++ b/pandas/core/arrays/categorical.py @@ -2,6 +2,7 @@ import numpy as np from warnings import warn +import textwrap import types from pandas import compat @@ -29,7 +30,7 @@ is_scalar, is_dict_like) -from pandas.core.algorithms import factorize, take_1d, unique1d +from pandas.core.algorithms import factorize, take_1d, unique1d, take from pandas.core.accessor import PandasDelegate from pandas.core.base import (PandasObject, NoNewAttributesMixin, _shared_docs) @@ -48,6 +49,17 @@ from .base import ExtensionArray +_take_msg = textwrap.dedent("""\ + Interpreting negative values in 'indexer' as missing values. + In the future, this will change to meaning positional indicies + from the right. + + Use 'allow_fill=True' to retain the previous behavior and silence this + warning. + + Use 'allow_fill=False' to accept the new behavior.""") + + def _cat_compare_op(op): def f(self, other): # On python2, you can usually compare any type to any type, and @@ -1732,17 +1744,25 @@ def fillna(self, value=None, method=None, limit=None): return self._constructor(values, categories=self.categories, ordered=self.ordered, fastpath=True) - def take_nd(self, indexer, allow_fill=True, fill_value=None): - """ Take the codes by the indexer, fill with the fill_value. + def take_nd(self, indexer, allow_fill=None, fill_value=None): + """ + Return the indices For internal compatibility with numpy arrays. """ + indexer = np.asarray(indexer, dtype=np.intp) + if allow_fill is None: + if (indexer < 0).any(): + warn(_take_msg, FutureWarning, stacklevel=2) + allow_fill = True - # filling must always be None/nan here - # but is passed thru internally - assert isna(fill_value) + if fill_value is None or isna(fill_value): + # For backwards compatability, we have to override + # any na values for `fill_value` + fill_value = -1 - codes = take_1d(self._codes, indexer, allow_fill=True, fill_value=-1) + codes = take(self._codes, indexer, allow_fill=allow_fill, + fill_value=fill_value) result = self._constructor(codes, categories=self.categories, ordered=self.ordered, fastpath=True) return result diff --git a/pandas/core/series.py b/pandas/core/series.py index 7abd95c68ea2b..a14f3299e11e9 100644 --- a/pandas/core/series.py +++ b/pandas/core/series.py @@ -3499,7 +3499,14 @@ def _take(self, indices, axis=0, convert=True, is_copy=False): indices = _ensure_platform_int(indices) new_index = self.index.take(indices) - new_values = self._values.take(indices) + + if is_categorical_dtype(self): + # https://github.com/pandas-dev/pandas/issues/20664 + # TODO: remove when the default Categorical.take behavior changes + kwargs = {'allow_fill': False} + else: + kwargs = {} + new_values = self._values.take(indices, **kwargs) result = (self._constructor(new_values, index=new_index, fastpath=True).__finalize__(self)) diff --git a/pandas/tests/categorical/test_algos.py b/pandas/tests/categorical/test_algos.py index 1c68377786dd4..14bc264c276ba 100644 --- a/pandas/tests/categorical/test_algos.py +++ b/pandas/tests/categorical/test_algos.py @@ -5,6 +5,17 @@ import pandas.util.testing as tm +@pytest.fixture(params=[True, False]) +def allow_fill(request): + """Boolean 'allow_fill' parameter for Categorical.take""" + return request.param + + +@pytest.fixture(params=[True, False]) +def ordered(request): + """Boolean 'ordered' parameter for Categorical.""" + return request.param + @pytest.mark.parametrize('ordered', [True, False]) @pytest.mark.parametrize('categories', [ ['b', 'a', 'c'], @@ -69,3 +80,45 @@ def test_isin_empty(empty): result = s.isin(empty) tm.assert_numpy_array_equal(expected, result) + + +class TestTake(object): + # https://github.com/pandas-dev/pandas/issues/20664 + + def test_take_warns(self): + cat = pd.Categorical(['a', 'b']) + with tm.assert_produces_warning(FutureWarning): + cat.take([0, -1]) + + def test_take_positive_no_warning(self): + cat = pd.Categorical(['a', 'b']) + with tm.assert_produces_warning(None): + cat.take([0, 0]) + + def test_take_bounds(self, allow_fill): + # https://github.com/pandas-dev/pandas/issues/20664 + cat = pd.Categorical(['a', 'b', 'a']) + with pytest.raises(IndexError): + cat.take([4, 5], allow_fill=allow_fill) + + def test_take_empty(self, allow_fill): + # https://github.com/pandas-dev/pandas/issues/20664 + cat = pd.Categorical([], categories=['a', 'b']) + with pytest.raises(IndexError): + cat.take([0], allow_fill=allow_fill) + + def test_positional_take(self, ordered): + cat = pd.Categorical(['a', 'a', 'b', 'b'], categories=['b', 'a'], + ordered=ordered) + result = cat.take([0, 1, 2], allow_fill=False) + expected = pd.Categorical(['a', 'a', 'b'], categories=cat.categories, + ordered=ordered) + tm.assert_categorical_equal(result, expected) + + def test_positional_take_unobserved(self, ordered): + cat = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'], + ordered=ordered) + result = cat.take([1, 0], allow_fill=False) + expected = pd.Categorical(['b', 'a'], categories=cat.categories, + ordered=ordered) + tm.assert_categorical_equal(result, expected) diff --git a/pandas/tests/series/indexing/test_indexing.py b/pandas/tests/series/indexing/test_indexing.py index 5cc1a8ff1c451..8571fbc10e9bb 100644 --- a/pandas/tests/series/indexing/test_indexing.py +++ b/pandas/tests/series/indexing/test_indexing.py @@ -753,6 +753,16 @@ def test_take(): s.take([-1, 3, 4], convert=False) +def test_take_categorical(): + # https://github.com/pandas-dev/pandas/issues/20664 + s = Series(pd.Categorical(['a', 'b', 'c'])) + result = s.take([-2, -2, 0]) + expected = Series(pd.Categorical(['b', 'b', 'a'], + categories=['a', 'b', 'c']), + index=[1, 1, 0]) + assert_series_equal(result, expected) + + def test_head_tail(test_data): assert_series_equal(test_data.series.head(), test_data.series[:5]) assert_series_equal(test_data.series.head(0), test_data.series[0:0]) From 07a561cb0288876b422a66095329e5b7f4786c01 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Fri, 27 Apr 2018 08:16:41 -0500 Subject: [PATCH 3/9] Linting --- pandas/tests/categorical/test_algos.py | 1 + 1 file changed, 1 insertion(+) diff --git a/pandas/tests/categorical/test_algos.py b/pandas/tests/categorical/test_algos.py index 14bc264c276ba..c3514768d3b6b 100644 --- a/pandas/tests/categorical/test_algos.py +++ b/pandas/tests/categorical/test_algos.py @@ -16,6 +16,7 @@ def ordered(request): """Boolean 'ordered' parameter for Categorical.""" return request.param + @pytest.mark.parametrize('ordered', [True, False]) @pytest.mark.parametrize('categories', [ ['b', 'a', 'c'], From 3a98ede1357c131b061d9f008cc20a1964e80d55 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Fri, 27 Apr 2018 10:51:57 -0500 Subject: [PATCH 4/9] Updates --- doc/source/whatsnew/v0.23.0.txt | 3 +- pandas/core/arrays/categorical.py | 32 ++++++++++++++++--- .../extension/category/test_categorical.py | 2 +- 3 files changed, 31 insertions(+), 6 deletions(-) diff --git a/doc/source/whatsnew/v0.23.0.txt b/doc/source/whatsnew/v0.23.0.txt index fe8b300184cf2..5634eeed016bc 100644 --- a/doc/source/whatsnew/v0.23.0.txt +++ b/doc/source/whatsnew/v0.23.0.txt @@ -856,7 +856,7 @@ Other API Changes - Constructing a Series from a list of length 1 no longer broadcasts this list when a longer index is specified (:issue:`19714`, :issue:`20391`). - :func:`DataFrame.to_dict` with ``orient='index'`` no longer casts int columns to float for a DataFrame with only int and float columns (:issue:`18580`) - A user-defined-function that is passed to :func:`Series.rolling().aggregate() `, :func:`DataFrame.rolling().aggregate() `, or its expanding cousins, will now *always* be passed a ``Series``, rather than a ``np.array``; ``.apply()`` only has the ``raw`` keyword, see :ref:`here `. This is consistent with the signatures of ``.aggregate()`` across pandas (:issue:`20584`) -- The previous default behavior of negative indices in ``Categorical.take`` is deprecated. In a future version it will change from meaning missing values to meaning positional indicies from the right (:issue:`20664`) +- The previous default behavior of negative indices in ``Categorical.take`` is deprecated. In a future version it will change from meaning missing values to meaning positional indices from the right. The future behavior is consistent with :meth:`Series.take` (:issue:`20664`). .. _whatsnew_0230.deprecations: @@ -1025,6 +1025,7 @@ Categorical - Bug in ``Categorical.__iter__`` not converting to Python types (:issue:`19909`) - Bug in :func:`pandas.factorize` returning the unique codes for the ``uniques``. This now returns a ``Categorical`` with the same dtype as the input (:issue:`19721`) - Bug in :func:`pandas.factorize` including an item for missing values in the ``uniques`` return value (:issue:`19721`) +- Bug in :meth:`Series.take` with categorical data interpreting ``-1`` in `indicies` as missing value markers, rather than the last element of the Series (:issue:`20664`) Datetimelike ^^^^^^^^^^^^ diff --git a/pandas/core/arrays/categorical.py b/pandas/core/arrays/categorical.py index 8959741f3cb91..106eaaec39c7d 100644 --- a/pandas/core/arrays/categorical.py +++ b/pandas/core/arrays/categorical.py @@ -1746,9 +1746,33 @@ def fillna(self, value=None, method=None, limit=None): def take_nd(self, indexer, allow_fill=None, fill_value=None): """ - Return the indices + Take elements from the Categorical. - For internal compatibility with numpy arrays. + Parameters + ---------- + indexer : sequence of integers + allow_fill : bool, default None. + How to handle negative values in `indexer`. + + * False: negative values in `indices` indicate positional indices + from the right. This is similar to + :func:`numpy.take`. + + * True: negative values in `indices` indicate missing values + (the default). These values are set to `fill_value`. Any other + other negative values raise a ``ValueError``. + + .. versionchanged:: 0.23.0 + + Deprecated the default value of `allow_fill`. The deprecated + default is ``True``. In the future, this will change to + ``False``. + + Returns + ------- + Categorical + This Categorical will have the same categories and ordered as + `self`. """ indexer = np.asarray(indexer, dtype=np.intp) if allow_fill is None: @@ -1757,8 +1781,8 @@ def take_nd(self, indexer, allow_fill=None, fill_value=None): allow_fill = True if fill_value is None or isna(fill_value): - # For backwards compatability, we have to override - # any na values for `fill_value` + # The isna(fill_value) is included for backwards compatability. + # Categorical overrides any NA value with -1. fill_value = -1 codes = take(self._codes, indexer, allow_fill=allow_fill, diff --git a/pandas/tests/extension/category/test_categorical.py b/pandas/tests/extension/category/test_categorical.py index c34339c99322d..8685819f369b9 100644 --- a/pandas/tests/extension/category/test_categorical.py +++ b/pandas/tests/extension/category/test_categorical.py @@ -111,7 +111,7 @@ def test_take_non_na_fill_value(self): def test_take_out_of_bounds_raises(self): pass - @skip_take + @pytest.mark.skip(reason="GH-20747. Unobserved categories.") def test_take_series(self): pass From 69701f2bd0d384abc465cccfbdc315a8af216cd3 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Fri, 27 Apr 2018 13:04:45 -0500 Subject: [PATCH 5/9] Clarified comment --- pandas/core/arrays/categorical.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/pandas/core/arrays/categorical.py b/pandas/core/arrays/categorical.py index 106eaaec39c7d..517c21cc1bc3a 100644 --- a/pandas/core/arrays/categorical.py +++ b/pandas/core/arrays/categorical.py @@ -1780,9 +1780,9 @@ def take_nd(self, indexer, allow_fill=None, fill_value=None): warn(_take_msg, FutureWarning, stacklevel=2) allow_fill = True - if fill_value is None or isna(fill_value): - # The isna(fill_value) is included for backwards compatability. - # Categorical overrides any NA value with -1. + if isna(fill_value): + # For categorical, any NA value is considered a user-facing + # NA value. Our storage NA value is -1. fill_value = -1 codes = take(self._codes, indexer, allow_fill=allow_fill, From d1c7e388f22fc83fa515d1b3f21bffb1175b0f4c Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Fri, 27 Apr 2018 20:11:57 -0500 Subject: [PATCH 6/9] Actually use data --- pandas/tests/extension/json/test_json.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pandas/tests/extension/json/test_json.py b/pandas/tests/extension/json/test_json.py index 1fdb7298eefc4..b7ac8033f3f6d 100644 --- a/pandas/tests/extension/json/test_json.py +++ b/pandas/tests/extension/json/test_json.py @@ -32,7 +32,7 @@ def data(): while len(data[0]) == len(data[1]): data = make_data() - return JSONArray(make_data()) + return JSONArray(data) @pytest.fixture From 6056f7ac905a3972166f0bf89f0d55026bc59d8c Mon Sep 17 00:00:00 2001 From: Joris Van den Bossche Date: Sat, 28 Apr 2018 11:26:44 +0200 Subject: [PATCH 7/9] typo --- doc/source/whatsnew/v0.23.0.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/whatsnew/v0.23.0.txt b/doc/source/whatsnew/v0.23.0.txt index 80338cffd584f..aacc7852027a1 100644 --- a/doc/source/whatsnew/v0.23.0.txt +++ b/doc/source/whatsnew/v0.23.0.txt @@ -1025,7 +1025,7 @@ Categorical - Bug in ``Categorical.__iter__`` not converting to Python types (:issue:`19909`) - Bug in :func:`pandas.factorize` returning the unique codes for the ``uniques``. This now returns a ``Categorical`` with the same dtype as the input (:issue:`19721`) - Bug in :func:`pandas.factorize` including an item for missing values in the ``uniques`` return value (:issue:`19721`) -- Bug in :meth:`Series.take` with categorical data interpreting ``-1`` in `indicies` as missing value markers, rather than the last element of the Series (:issue:`20664`) +- Bug in :meth:`Series.take` with categorical data interpreting ``-1`` in `indices` as missing value markers, rather than the last element of the Series (:issue:`20664`) Datetimelike ^^^^^^^^^^^^ From f7e0e6dab26eef97be564fef3507925f5d1b5d4e Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Sat, 28 Apr 2018 13:55:27 -0500 Subject: [PATCH 8/9] update --- pandas/tests/categorical/conftest.py | 13 +++++++++++++ pandas/tests/categorical/test_algos.py | 12 ------------ 2 files changed, 13 insertions(+), 12 deletions(-) create mode 100644 pandas/tests/categorical/conftest.py diff --git a/pandas/tests/categorical/conftest.py b/pandas/tests/categorical/conftest.py new file mode 100644 index 0000000000000..274389d484995 --- /dev/null +++ b/pandas/tests/categorical/conftest.py @@ -0,0 +1,13 @@ +import pytest + + +@pytest.fixture(params=[True, False]) +def allow_fill(request): + """Boolean 'allow_fill' parameter for Categorical.take""" + return request.param + + +@pytest.fixture(params=[True, False]) +def ordered(request): + """Boolean 'ordered' parameter for Categorical.""" + return request.param diff --git a/pandas/tests/categorical/test_algos.py b/pandas/tests/categorical/test_algos.py index c3514768d3b6b..dcf2081ae32fe 100644 --- a/pandas/tests/categorical/test_algos.py +++ b/pandas/tests/categorical/test_algos.py @@ -5,18 +5,6 @@ import pandas.util.testing as tm -@pytest.fixture(params=[True, False]) -def allow_fill(request): - """Boolean 'allow_fill' parameter for Categorical.take""" - return request.param - - -@pytest.fixture(params=[True, False]) -def ordered(request): - """Boolean 'ordered' parameter for Categorical.""" - return request.param - - @pytest.mark.parametrize('ordered', [True, False]) @pytest.mark.parametrize('categories', [ ['b', 'a', 'c'], From 12485c525cd18f967c473ccb5f99c887e4f972b2 Mon Sep 17 00:00:00 2001 From: Tom Augspurger Date: Mon, 30 Apr 2018 06:17:37 -0500 Subject: [PATCH 9/9] Moved to deprecations --- doc/source/whatsnew/v0.23.0.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/source/whatsnew/v0.23.0.txt b/doc/source/whatsnew/v0.23.0.txt index aacc7852027a1..2440a8d586c96 100644 --- a/doc/source/whatsnew/v0.23.0.txt +++ b/doc/source/whatsnew/v0.23.0.txt @@ -856,7 +856,6 @@ Other API Changes - Constructing a Series from a list of length 1 no longer broadcasts this list when a longer index is specified (:issue:`19714`, :issue:`20391`). - :func:`DataFrame.to_dict` with ``orient='index'`` no longer casts int columns to float for a DataFrame with only int and float columns (:issue:`18580`) - A user-defined-function that is passed to :func:`Series.rolling().aggregate() `, :func:`DataFrame.rolling().aggregate() `, or its expanding cousins, will now *always* be passed a ``Series``, rather than a ``np.array``; ``.apply()`` only has the ``raw`` keyword, see :ref:`here `. This is consistent with the signatures of ``.aggregate()`` across pandas (:issue:`20584`) -- The previous default behavior of negative indices in ``Categorical.take`` is deprecated. In a future version it will change from meaning missing values to meaning positional indices from the right. The future behavior is consistent with :meth:`Series.take` (:issue:`20664`). .. _whatsnew_0230.deprecations: @@ -892,6 +891,7 @@ Deprecations removed in a future version (:issue:`20419`). - ``DatetimeIndex.offset`` is deprecated. Use ``DatetimeIndex.freq`` instead (:issue:`20716`) - ``Index.get_duplicates()`` is deprecated and will be removed in a future version (:issue:`20239`) +- The previous default behavior of negative indices in ``Categorical.take`` is deprecated. In a future version it will change from meaning missing values to meaning positional indices from the right. The future behavior is consistent with :meth:`Series.take` (:issue:`20664`). .. _whatsnew_0230.prior_deprecations: