-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: take interface for (Extension)Array-likes #20640
Comments
|
@jreback argued (for good reason) in #20582 that having to deal with bounds checking / correct behaviour for empty arrays in each ExtensionArray is rather error prone and complex code in each array implementation. So it would be good that we have in pandas a
If we want to expose our |
+1. |
Only other thing to add is that
If we need, then we can make |
Do you think there are many people relying on the fact that -1 returns NaN for |
Probably not :)
Hmm, so you're saying that (on master) we've already changed |
No, we didn't change this on master. It's an already existing bug (but only from 0.22, at least in 0.20 it was still working correctly). It's just that if we keep the difference in behaviour between Array.take and Series.take regarding -1, we should not just dispatch from Series.take to Array.take, but deal with this difference in behaviour. |
@jreback @TomAugspurger more comments on this one? What I am now thinking might be best:
|
The current example implementation (in the docstring) of:
would then become something like (not fully sure about the
|
I don't think we need to elevate Also a bit -1 (pun) on changing the semantics for this; honestly I don't think the negative indexing is useful and the filling missing values IS useful (but numpy has trouble with this as it will never change dtypes; we just make this work). Instead I like your 2nd option, but expose API things that we want EA authors (and internal routines) to use that are not quite first-class top-level things (e.g. maybe in: |
Maybe it was not clear, but I didn't put it as two options that we have to choose from, but as two parts of a single proposal. Although they can be discussed / implemented separately, but it's not either one of the other.
Above (#20640 (comment)) you said you were ok with making EA consistent with Series semantics. Which means changing the semantics for EA in master (but EA itself is not yet in released version, so that is no problem. Only Categorical.take is affected). In any case, the "most" public Currently we have different
The "top-level" was maybe a bit a distraction. Let's agree on the idea of an exposed |
+1 for aligning EA.take with Series & ndarray.take
…________________________________
From: Joris Van den Bossche <notifications@github.com>
Sent: Monday, April 16, 2018 6:10:58 AM
To: pandas-dev/pandas
Cc: Tom Augspurger; Mention
Subject: Re: [pandas-dev/pandas] API: take interface for (Extension)Array-likes (#20640)
Instead I like your 2nd option
Maybe it was not clear, but I didn't put it as two options that we have to choose from, but as two parts of a single proposal. Although they can be discussed / implemented separately, but it's not either one of the other.
Because also for the second bullet point (exposed take function), we need to decide on the semantics (first bullet point).
Also a bit -1 (pun) on changing the semantics for this;
Above (#20640 (comment)<#20640 (comment)>) you said you were ok with making EA consistent with Series semantics. Which means changing the semantics for EA in master (but EA itself is not yet in released version, so that is no problem. Only Categorical.take is affected).
In any case, the "most" public take, Series.take, already uses numpy semantics. And in the case of Series.take, having -1 mean "missing" does not make really sense, as you then don't have a value for in the index (unless you would also inject missing data in the index, but I don't think we should do that by default).
Currently we have different take semantics for different take methods. If we want to get them consistent, some will need to change.
I don't think we need to elevate .take to a top-level. Its just another way to do indexing, which we already have many, and its not settable, so not terrible useful. Why have another advertised method of doing the same thing that (.iloc) can do.
The "top-level" was maybe a bit a distraction. Let's agree on the idea of an exposed take function (I put 'top-level' in my proposal above, but should just have said 'public', as it can be top-level or also in eg pandas.api somewhere).
The exact location where it is exposed, we can discuss later (it's not the most important thing to decide first), but if we do it, it will be a new, officially public (and thus advertised) method. If we want EA authors to be able to use this, I think this is the only way. We can also decide that EA authors need to implement this themselves (as they need to do now).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#20640 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIk2-RDwgfDDh6-6LRm8cBcF798K8ks5tpHxCgaJpZM4TMF9x>.
|
@jreback @TomAugspurger we forgot to discuss this on the dev hangout with all the subclassing/composition discussion, but any input on this? |
Agreed for 0.23.0 I think we should follow ndarray.take's semantics here. @jorisvandenbossche do you want me to work on this? |
yeah ok with fixing take as suggested |
@TomAugspurger if you have time, that would be welcome (I have only limited time today and tomorrow) |
One additional aspect I was tinkering on (but didn't raise yet) is the I don't really "like" it, so was trying to think if there would be a nicer way to do it (since this will be new API, we don't necessarily need to follow the internal
|
I suppose we could do
```python
no_default = object()
def take(self, indexer, na_value/fill_value=no_default):
pass
```
for na_value / fill_value (whatever we call) of no_default, we use Python /
NumPy semantics of slicing from the right.
For other na_values, we set -1 to be NA. Should we raise for other negative
indexers?
|
Yep, that's what I meant with a custom singleton. I think I would be in favor of that.
Yes, I think so. In case of filling, only -1 should be allowed as negative value. |
One slight issue with Currently
I'll see if that breaks anything. |
To clarify, that would allow EAs to define a default That default value would probably go on the type? |
Hmm, this doesn't look too promising. We may just have to tell EA-authors that when you're expected to do pandas-style |
I don't know if it is relevant, but
I suppose this is because currenlty
That might indicate that the current API of |
Here's the new implementation for def take(self, indexer, fill_value=_no_default):
indexer = np.asarray(indexer)
if fill_value is _no_default:
# NumPy style
result = self.data.take(indexer)
else:
if isinstance(fill_value, numbers.Real) and np.isnan(fill_value):
# convert the default np.nan to NaN.
fill_value = decimal.Decimal("NaN")
mask = indexer == -1
# take on empty array not handled as desired by numpy
# in case of -1 (all missing take)
if not len(self) and mask.all():
return type(self)([fill_value] * len(indexer))
result = self.data.take(indexer)
result[mask] = fill_value
return type(self)(result) The bit I don't like is the But this does satisfy the requirements I think. In [14]: s = pd.Series(DecimalArray([1, 2, 3]))
In [15]: s.take([0, 1, -1])
Out[15]:
0 1
1 2
2 3
dtype: decimal
In [16]: s.reindex([0, 1, -1])
Out[16]:
0 1
1 2
-1 NaN
dtype: decimal I'll put up a WIP PR so it's easier to discuss. |
IMO we should not put this in |
Is that because pandas will typically call |
Agreed. I'm trying to scope out how things should work. Then I'll work on a new
Precisely. Specifically |
Implements a take interface that's compatible with NumPy and optionally pandas' NA semantics. ```python In [1]: import pandas as pd In [2]: from pandas.tests.extension.decimal.array import * In [3]: arr = DecimalArray(['1.1', '1.2', '1.3']) In [4]: arr.take([0, 1, -1]) Out[4]: DecimalArray(array(['1.1', '1.2', '1.3'], dtype=object)) In [5]: arr.take([0, 1, -1], fill_value=float('nan')) Out[5]: DecimalArray(array(['1.1', '1.2', Decimal('NaN')], dtype=object)) ``` Closes pandas-dev#20640
commit ec0cecd Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Fri Apr 27 06:02:48 2018 -0500 Updates commit 6858409 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Fri Apr 27 05:48:59 2018 -0500 Added note commit eb43fa4 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Apr 26 20:47:35 2018 -0500 Really truly fix it hopefully. commit 7c4f625 Merge: 9a6c7d4 6cacdde Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Apr 26 20:40:15 2018 -0500 Merge remote-tracking branch 'upstream/master' into ea-take commit 9a6c7d4 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Apr 26 20:04:17 2018 -0500 Doc updates commit eecd632 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Apr 26 15:00:00 2018 -0500 Skip categorical take tests commit f3b91ca Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Apr 26 13:43:26 2018 -0500 doc fixup commit fbc4425 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Apr 26 13:37:45 2018 -0500 Updates * indexer -> indices * doc user-facing vs physical * assert na_cmps * test reindex w/ non-NA fill_value commit 741f284 Merge: 5db6624 630ef16 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Apr 26 07:18:32 2018 -0500 Merge remote-tracking branch 'upstream/master' into ea-take commit 5db6624 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Thu Apr 26 07:17:30 2018 -0500 Doc and move tests commit 74b2c09 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 21:40:18 2018 -0500 Added verisonadded commit fc729d6 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 15:51:27 2018 -0500 Fixed editor commit 1a4d987 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 15:50:48 2018 -0500 Pass an array commit bbcbf19 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 15:07:28 2018 -0500 Cleanup commit d5470a0 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 15:02:26 2018 -0500 Fixed reorder commit 82cad8b Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 15:00:43 2018 -0500 Stale comment commit c449afd Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 14:48:33 2018 -0500 Bounds checking commit 449983b Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 12:55:31 2018 -0500 Linting commit 69e7fe7 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 12:40:20 2018 -0500 Updates 1. Reversed order of take keywords 2. Added to extensions API 3. Removed default implementation commit 05d8844 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 09:59:33 2018 -0500 Updated docs commit 31cd304 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 09:43:45 2018 -0500 pep8 commit 338566f Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 09:42:28 2018 -0500 Upcasting commit b7ae0bc Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 09:06:59 2018 -0500 revert combine change commit 125ca0b Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 08:37:07 2018 -0500 Simplify Upcasting is still broken commit c721915 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 07:50:54 2018 -0500 Removed default_fill_value commit 37915e9 Merge: 67ba9dd 60fe82c Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 07:44:15 2018 -0500 Merge remote-tracking branch 'upstream/master' into ea-take commit 67ba9dd Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 07:42:54 2018 -0500 more with default fill value commit eba137f Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 05:59:58 2018 -0500 More internals hacking commit 08f2479 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Wed Apr 25 05:59:17 2018 -0500 Fixup JSON take commit 0be9ec6 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Tue Apr 24 18:02:13 2018 -0500 non-internals changes commit dacd98e Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Tue Apr 24 14:45:36 2018 -0500 Moves commit fb3c234 Author: Tom Augspurger <tom.w.augspurger@gmail.com> Date: Tue Apr 24 13:59:51 2018 -0500 [WIP]: ExtensionArray.take default implementation Implements a take interface that's compatible with NumPy and optionally pandas' NA semantics. ```python In [1]: import pandas as pd In [2]: from pandas.tests.extension.decimal.array import * In [3]: arr = DecimalArray(['1.1', '1.2', '1.3']) In [4]: arr.take([0, 1, -1]) Out[4]: DecimalArray(array(['1.1', '1.2', '1.3'], dtype=object)) In [5]: arr.take([0, 1, -1], fill_value=float('nan')) Out[5]: DecimalArray(array(['1.1', '1.2', Decimal('NaN')], dtype=object)) ``` Closes pandas-dev#20640
Implements a take interface that's compatible with NumPy and optionally pandas' NA semantics. Closes #20640
Triggered by #20582, I was looking at the
take
implementation in ExtensionArray and Categorical (which is already an ExtensionArray subclass) and in the rest of pandas:ExtensionArray.take
currently uses the "internal pandas"-like behaviour for take:-1
is an indicator for missing value (the behaviour we need for reindexing etc)Series.take
actually uses the numpy behaviour, where negative values (including-1
) start counting from the end of the array-like.To illustrate the difference with a small example:
This difference is a bit unfortunate IMO. If
ExtensionArray.take
is a public method (which it is right now), it would be nice if it has consistent behaviour withSeries.take
.If we agree on that, I was thinking about following options:
ExtensionArray.take
private for now (eg require a_take
method for the interface) and keep the "internal pandas"-like behaviourExtensionArray.take
default behaviour consistent withSeries.take
, but still have theallow_fill
/fill_value
arguments so that when they are specified it has the "internal pandas"-like behavour (so that internal code that expects this behaviour which already passes those keywords keeps working)The text was updated successfully, but these errors were encountered: