Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: take interface for (Extension)Array-likes #20640

Closed
jorisvandenbossche opened this issue Apr 9, 2018 · 27 comments · Fixed by #20814
Closed

API: take interface for (Extension)Array-likes #20640

jorisvandenbossche opened this issue Apr 9, 2018 · 27 comments · Fixed by #20814
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays.
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Triggered by #20582, I was looking at the take implementation in ExtensionArray and Categorical (which is already an ExtensionArray subclass) and in the rest of pandas:

  • ExtensionArray.take currently uses the "internal pandas"-like behaviour for take: -1 is an indicator for missing value (the behaviour we need for reindexing etc)
  • Series.take actually uses the numpy behaviour, where negative values (including -1) start counting from the end of the array-like.

To illustrate the difference with a small example:

In [9]: pd.Categorical(['a', 'b', 'c']).take([0, -1])
Out[9]: 
[a, NaN]
Categories (3, object): [a, b, c]

In [10]: pd.Series(['a', 'b', 'c']).take([0, -1])
Out[10]: 
0    a
1    c
dtype: object

This difference is a bit unfortunate IMO. If ExtensionArray.take is a public method (which it is right now), it would be nice if it has consistent behaviour with Series.take.
If we agree on that, I was thinking about following options:

  • make ExtensionArray.take private for now (eg require a _take method for the interface) and keep the "internal pandas"-like behaviour
  • make ExtensionArray.take default behaviour consistent with Series.take, but still have the allow_fill/fill_value arguments so that when they are specified it has the "internal pandas"-like behavour (so that internal code that expects this behaviour which already passes those keywords keeps working)
@jorisvandenbossche jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels Apr 9, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.23.0 milestone Apr 9, 2018
@jreback
Copy link
Contributor

jreback commented Apr 9, 2018

.take() was originally a numpy method so -1 indicates a position rather than a missing value. I don't think we can easily change this, so would make EA consistent here.

@jorisvandenbossche
Copy link
Member Author

@jreback argued (for good reason) in #20582 that having to deal with bounds checking / correct behaviour for empty arrays in each ExtensionArray is rather error prone and complex code in each array implementation.

So it would be good that we have in pandas a take implementation that does this for you (and can also be used in our own ExtensionArrays, to eg fix #20664).
The current take_1d however does currently no bounds checks whatsoever.

take_1d is internally used in many cases where the indexer passed to take_1d is the result of eg get_indexer, so this ensures that the indexer is correct, and for those no out-of-bounds checks are needed.

If we want to expose our take implementation to extension array authors, I would personally create a new take functions which just calls the existing take_1d, but additionally does first bounds checking.
That way, we keep on using the faster take_1d functions internally where we know bounds checks are not needed, and we can make the new take method public for external usage as well.

@TomAugspurger
Copy link
Contributor

I would personally create a new take functions which just calls the existing take_1d, but additionally does first bounds checking.

+1.

@TomAugspurger
Copy link
Contributor

Only other thing to add is that

  • I don't want to break API on Categorical.take
  • I want consistency between Categorical.take / _take and ExtensionArray.take / _take. I care less about consistency with Series than I do within EAs, though it'd be nice to have consistency here.

If we need, then we can make ExtensionArray._take the official method that's part of the interface.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Apr 12, 2018

I don't want to break API on Categorical.take

Do you think there are many people relying on the fact that -1 returns NaN for Categorical.take?
We can deprecate instead of just breaking it.
We in any case need to somehow fix Series.take, because now it dispatches to ExtensionArray.take, which gives wrong results for Series (because it has this difference in behaviour for -1).

@TomAugspurger
Copy link
Contributor

Do you think there are many people relying on the fact that -1 returns NaN for Categorical.take?

Probably not :)

We in any case need to somehow fix Series.take, because now it dispatches to ExtensionArray.take

Hmm, so you're saying that (on master) we've already changed Series[Categorical].take? I didn't realize that... OK, that makes me less certain on what do do here (making a _take vs. just changing ExtensionArray/Categorical.take. I'd support either I think.).

@jorisvandenbossche
Copy link
Member Author

Hmm, so you're saying that (on master) we've already changed Series[Categorical].take

No, we didn't change this on master. It's an already existing bug (but only from 0.22, at least in 0.20 it was still working correctly). It's just that if we keep the difference in behaviour between Array.take and Series.take regarding -1, we should not just dispatch from Series.take to Array.take, but deal with this difference in behaviour.
But for me that's a reason to actually align the default behaviour between Series and Array (Array.take still needs the filling behaviour with -1 which Series.take does not have, but it does not need to be the default).

@jorisvandenbossche
Copy link
Member Author

@jreback @TomAugspurger more comments on this one?

What I am now thinking might be best:

  • Make all our take implementation follow numpy semantics by default (meaning -1 gets converted to a positive indexer and does not indicate missingness). Consequences:
  • Provide a top-level take function
    • This would be a wrapper around take_1d, but does conversion of negative to positive indices by default, does bounds checking, ..
    • This would be exposed as a general public take function (similar to factorize and unique) and can be used by extension array authors to apply it on their underlying values if possible.

@jorisvandenbossche
Copy link
Member Author

The current example implementation (in the docstring) of:

           def take(self, indexer, allow_fill=True, fill_value=None):
               indexer = np.asarray(indexer)
               mask = indexer == -1

               # take on empty array not handled as desired by numpy
               # in case of -1 (all missing take)
               if not len(self) and mask.all():
                   return type(self)([np.nan] * len(indexer))

               result = self.data.take(indexer)
               result[mask] = np.nan  # NA for this type
               return type(self)(result)

would then become something like (not fully sure about the fill_value handling, need to check that):

           def take(self, indexer, allow_fill=True, fill_value=None):
               if fill_value is None:
                   fill_value = np.nan  # NA for this type
               result = pd.take(self.data, indexer, allow_fill=allow_fill, fill_value=fill_value)
               return type(self)(result)

@jreback
Copy link
Contributor

jreback commented Apr 16, 2018

I don't think we need to elevate .take to a top-level. Its just another way to do indexing, which we already have many, and its not settable, so not terrible useful. Why have another advertised method of doing the same thing that (.iloc) can do.

Also a bit -1 (pun) on changing the semantics for this; honestly I don't think the negative indexing is useful and the filling missing values IS useful (but numpy has trouble with this as it will never change dtypes; we just make this work).

Instead I like your 2nd option, but expose API things that we want EA authors (and internal routines) to use that are not quite first-class top-level things (e.g. .factorize() and .unique() are useful generally in there own right).

maybe in: pandas.core.arrays.api ?

@jorisvandenbossche
Copy link
Member Author

Instead I like your 2nd option

Maybe it was not clear, but I didn't put it as two options that we have to choose from, but as two parts of a single proposal. Although they can be discussed / implemented separately, but it's not either one of the other.
Because also for the second bullet point (exposed take function), we need to decide on the semantics (first bullet point).

Also a bit -1 (pun) on changing the semantics for this;

Above (#20640 (comment)) you said you were ok with making EA consistent with Series semantics. Which means changing the semantics for EA in master (but EA itself is not yet in released version, so that is no problem. Only Categorical.take is affected).

In any case, the "most" public take, Series.take, already uses numpy semantics. And in the case of Series.take, having -1 mean "missing" does not make really sense, as you then don't have a value for in the index (unless you would also inject missing data in the index, but I don't think we should do that by default).

Currently we have different take semantics for different take methods. If we want to get them consistent, some will need to change.

I don't think we need to elevate .take to a top-level. Its just another way to do indexing, which we already have many, and its not settable, so not terrible useful. Why have another advertised method of doing the same thing that (.iloc) can do.

The "top-level" was maybe a bit a distraction. Let's agree on the idea of an exposed take function (I put 'top-level' in my proposal above, but should just have said 'public', as it can be top-level or also in eg pandas.api somewhere).
The exact location where it is exposed, we can discuss later (it's not the most important thing to decide first), but if we do it, it will be a new, officially public (and thus advertised) method. If we want EA authors to be able to use this, I think this is the only way. We can also decide that EA authors need to implement this themselves (as they need to do now).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 16, 2018 via email

@jorisvandenbossche
Copy link
Member Author

@jreback @TomAugspurger we forgot to discuss this on the dev hangout with all the subclassing/composition discussion, but any input on this?
I think we should try to decide on this API before 0.23.0.

@TomAugspurger
Copy link
Contributor

Agreed for 0.23.0

I think we should follow ndarray.take's semantics here.

@jorisvandenbossche do you want me to work on this?

@jreback
Copy link
Contributor

jreback commented Apr 24, 2018

yeah ok with fixing take as suggested

@jorisvandenbossche
Copy link
Member Author

@TomAugspurger if you have time, that would be welcome (I have only limited time today and tomorrow)

@jorisvandenbossche
Copy link
Member Author

One additional aspect I was tinkering on (but didn't raise yet) is the allow_fill=True, fill_value=None API.

I don't really "like" it, so was trying to think if there would be a nicer way to do it (since this will be new API, we don't necessarily need to follow the internal take_1d spelling).

  • Would it be possible to only have one keyword?
    • Because both can be somewhat redundant. In most cases, you either need to specify both or none (since the allow_fill default will be False), which doesn't feel as the best API design.
    • It might not be possible, because a fill_value of None is actually a valid argument (as fill value), so cannot be used to indicate that there should be no filling (unless we use a custom singleton instead of None for the default)
  • The naming: I personally find fill_value a bit confusing, and I seem to remember we had a similar discussion on ExtensionArray related PR about fill_value vs na_value
    • On the other hand, related methods like reindex also use fill_value

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 24, 2018 via email

@jorisvandenbossche
Copy link
Member Author

Yep, that's what I meant with a custom singleton. I think I would be in favor of that.

For other na_values, we set -1 to be NA. Should we raise for other negative indexers?

Yes, I think so. In case of filling, only -1 should be allowed as negative value.

@TomAugspurger
Copy link
Contributor

One slight issue with fill_value. Let's consider DecimalArray. We'd like its fill value be Decimal('NaN'), instead of np.nan. How would we get that information to DecimalArray.take?

Currently Series[DecimalArray].reindex() passes through fill_value=np.nan, since that's the default in algorithms.take_nd. Are we OK with

  • specifying no default for take_nd's fill_value.
  • only passing through take_nd's fill_value to ExtensionArray.take if it isn't the default?

I'll see if that breaks anything.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 24, 2018

To clarify, that would allow EAs to define a default fill_value as a class attribute. We could make the default np.nan.

That default value would probably go on the type?

@TomAugspurger
Copy link
Contributor

Hmm, this doesn't look too promising.

We may just have to tell EA-authors that when you're expected to do pandas-style take, you're getting np.nan for fill_value. It's up to you to interpret that how you want (e.g. transforming np.nan to Decimal('NaN').)

@jorisvandenbossche
Copy link
Member Author

Currently Series[DecimalArray].reindex() passes through fill_value=np.nan, since that's the default in algorithms.take_nd

I don't know if it is relevant, but reindex does already do the "correct" thing although its default fill_value is not honored:

In [23]: from pandas.tests.extension.decimal.array import DecimalArray, make_data

In [24]: arr = DecimalArray(make_data())

In [27]: s = pd.Series(arr[:3])

In [29]: s.reindex([2, 3])
Out[29]: 
2    0.31849019906500730670018128876108676195144653...
3                                                  NaN
dtype: decimal

In [30]: s.reindex([2, 3]).values
Out[30]: 
DecimalArray(array([Decimal('0.318490199065007306700181288761086761951446533203125'),
       Decimal('NaN')], dtype=object))

I suppose this is because currenlty DecimalArray.take does not honor the fill_value keyword.

Hmm, this doesn't look too promising.

We may just have to tell EA-authors that when you're expected to do pandas-style take, you're getting np.nan for fill_value. It's up to you to interpret that how you want (e.g. transforming np.nan to Decimal('NaN').)

That might indicate that the current API of allow_fill=True/False, fill_value=.. is actually not that bad? As then pandas can say: we want pandas-style take with filling without specifying a fill_value (which would then mean that they can use their default).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 24, 2018

Here's the new implementation for DecimalArray.take

    def take(self, indexer, fill_value=_no_default):
        indexer = np.asarray(indexer)
        if fill_value is _no_default:
            # NumPy style
            result = self.data.take(indexer)
        else:
            if isinstance(fill_value, numbers.Real) and np.isnan(fill_value):
                # convert the default np.nan to NaN.
                fill_value = decimal.Decimal("NaN")
            mask = indexer == -1

            # take on empty array not handled as desired by numpy
            # in case of -1 (all missing take)
            if not len(self) and mask.all():
                return type(self)([fill_value] * len(indexer))

            result = self.data.take(indexer)
            result[mask] = fill_value

        return type(self)(result)

The bit I don't like is the convert np.nan to NaN. It means we can't do decimal_array.take(indexer, fill_value=np.nan), and actually get np.nan as the fill value (if that's a thing you want to do)...

But this does satisfy the requirements I think.

In [14]: s = pd.Series(DecimalArray([1, 2, 3]))

In [15]: s.take([0, 1, -1])
Out[15]:
0    1
1    2
2    3
dtype: decimal

In [16]: s.reindex([0, 1, -1])
Out[16]:
 0      1
 1      2
-1    NaN
dtype: decimal

I'll put up a WIP PR so it's easier to discuss.

@jorisvandenbossche
Copy link
Member Author

IMO we should not put this in DecimalArray.take (which means: require extension authors to do something similar), but move that code (more or less) to a public take (but that can be longer term improvement of course. Most important on short term is public API on ExtensionArrays itself).
Eg in the case of geopandas, I would like to use such a function on my integer array with fill_value of 0.

@jorisvandenbossche
Copy link
Member Author

The bit I don't like is the convert np.nan to NaN

Is that because pandas will typically call EA.take with a fill_value=np.nan ?
In which case almost every EA will need to just ignore that default? (or convert it like you do above)

@TomAugspurger
Copy link
Contributor

IMO we should not put this in DecimalArray.take

Agreed. I'm trying to scope out how things should work. Then I'll work on a new take method in pandas that EA authors can use to make the implementation easier.

Is that because pandas will typically call EA.take with a fill_value=np.nan

Precisely. Specifically algos.take_nd is calling EA.take with whatever is passed to it (if anything). I briefly looked at changing the default for algos.take_nd, but things became messy.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Apr 24, 2018
Implements a take interface that's compatible with NumPy and optionally pandas'
NA semantics.

```python
In [1]: import pandas as pd

In [2]: from pandas.tests.extension.decimal.array import *

In [3]: arr = DecimalArray(['1.1', '1.2', '1.3'])

In [4]: arr.take([0, 1, -1])
Out[4]: DecimalArray(array(['1.1', '1.2', '1.3'], dtype=object))

In [5]: arr.take([0, 1, -1], fill_value=float('nan'))
Out[5]: DecimalArray(array(['1.1', '1.2', Decimal('NaN')], dtype=object))
```

Closes pandas-dev#20640
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Apr 27, 2018
commit ec0cecd
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Fri Apr 27 06:02:48 2018 -0500

    Updates

commit 6858409
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Fri Apr 27 05:48:59 2018 -0500

    Added note

commit eb43fa4
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Apr 26 20:47:35 2018 -0500

    Really truly fix it hopefully.

commit 7c4f625
Merge: 9a6c7d4 6cacdde
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Apr 26 20:40:15 2018 -0500

    Merge remote-tracking branch 'upstream/master' into ea-take

commit 9a6c7d4
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Apr 26 20:04:17 2018 -0500

    Doc updates

commit eecd632
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Apr 26 15:00:00 2018 -0500

    Skip categorical take tests

commit f3b91ca
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Apr 26 13:43:26 2018 -0500

    doc fixup

commit fbc4425
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Apr 26 13:37:45 2018 -0500

    Updates

    * indexer -> indices
    * doc user-facing vs physical
    * assert na_cmps
    * test reindex w/ non-NA fill_value

commit 741f284
Merge: 5db6624 630ef16
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Apr 26 07:18:32 2018 -0500

    Merge remote-tracking branch 'upstream/master' into ea-take

commit 5db6624
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Apr 26 07:17:30 2018 -0500

    Doc and move tests

commit 74b2c09
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 21:40:18 2018 -0500

    Added verisonadded

commit fc729d6
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 15:51:27 2018 -0500

    Fixed editor

commit 1a4d987
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 15:50:48 2018 -0500

    Pass an array

commit bbcbf19
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 15:07:28 2018 -0500

    Cleanup

commit d5470a0
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 15:02:26 2018 -0500

    Fixed reorder

commit 82cad8b
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 15:00:43 2018 -0500

    Stale comment

commit c449afd
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 14:48:33 2018 -0500

    Bounds checking

commit 449983b
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 12:55:31 2018 -0500

    Linting

commit 69e7fe7
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 12:40:20 2018 -0500

    Updates

    1. Reversed order of take keywords
    2. Added to extensions API
    3. Removed default implementation

commit 05d8844
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 09:59:33 2018 -0500

    Updated docs

commit 31cd304
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 09:43:45 2018 -0500

    pep8

commit 338566f
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 09:42:28 2018 -0500

    Upcasting

commit b7ae0bc
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 09:06:59 2018 -0500

    revert combine change

commit 125ca0b
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 08:37:07 2018 -0500

    Simplify

    Upcasting is still broken

commit c721915
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 07:50:54 2018 -0500

    Removed default_fill_value

commit 37915e9
Merge: 67ba9dd 60fe82c
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 07:44:15 2018 -0500

    Merge remote-tracking branch 'upstream/master' into ea-take

commit 67ba9dd
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 07:42:54 2018 -0500

    more with default fill value

commit eba137f
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 05:59:58 2018 -0500

    More internals hacking

commit 08f2479
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Wed Apr 25 05:59:17 2018 -0500

    Fixup JSON take

commit 0be9ec6
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Tue Apr 24 18:02:13 2018 -0500

    non-internals changes

commit dacd98e
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Tue Apr 24 14:45:36 2018 -0500

    Moves

commit fb3c234
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Tue Apr 24 13:59:51 2018 -0500

    [WIP]: ExtensionArray.take default implementation

    Implements a take interface that's compatible with NumPy and optionally pandas'
    NA semantics.

    ```python
    In [1]: import pandas as pd

    In [2]: from pandas.tests.extension.decimal.array import *

    In [3]: arr = DecimalArray(['1.1', '1.2', '1.3'])

    In [4]: arr.take([0, 1, -1])
    Out[4]: DecimalArray(array(['1.1', '1.2', '1.3'], dtype=object))

    In [5]: arr.take([0, 1, -1], fill_value=float('nan'))
    Out[5]: DecimalArray(array(['1.1', '1.2', Decimal('NaN')], dtype=object))
    ```

    Closes pandas-dev#20640
jorisvandenbossche pushed a commit that referenced this issue Apr 27, 2018
Implements a take interface that's compatible with NumPy and optionally pandas'
NA semantics.

Closes #20640
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants