-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add argmax and argmin to ExtensionArray #27801
Conversation
pass | ||
else: | ||
ser = pd.Series(data) | ||
with pytest.raises(TypeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review.
I think this is non-sense to implement argmax
, argmin
, max
and min
for ArrowBoolArray
which have two values only. Calling those methods will raise NotImplementedError
. Getting but not calling the max
and min
attributes by gettattr
doesn't raise any error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @gfyoung was asking that you assert something about the error message with the match=
keyword.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those functions just raise the TypeError
without any error message.
|
||
def test_min(self): | ||
# GH 24382 | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come these are empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The methods added in this PR ignore nan ,i.e., skipna=True
. The existing categorical.min return nan if categorical contain any nan. This behavior is expected in test_min_max (tests/arrays/categorical/test_analytics.py).
If min
and max
of the generic EA ignoring nan
is what we want, future PR is required to add skipna
parameter to categorical.min
and categorical.max
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want a release where these become out of sync, so perhaps a PR implementing skipna=True/False
for Categorical first makes sense.
pass | ||
else: | ||
ser = pd.Series(data) | ||
with pytest.raises(TypeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @gfyoung was asking that you assert something about the error message with the match=
keyword.
|
||
def test_min(self): | ||
# GH 24382 | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want a release where these become out of sync, so perhaps a PR implementing skipna=True/False
for Categorical first makes sense.
can you merge master; move the release note to 1.0 |
@makbigc can you rebase? I think getting these implemented could help with some of the groupby cleanup ive been working on |
@makbigc can you rebase |
sorry for the late reply. I will work on it. |
doc/source/whatsnew/v0.25.1.rst
Outdated
@@ -6,6 +6,7 @@ What's new in 0.25.1 (July XX, 2019) | |||
Enhancements | |||
~~~~~~~~~~~~ | |||
|
|||
- Add :meth:`ExtensionArray.argmax`, :meth:`ExtensionArray.max`, :meth:`ExtensionArray.argmin` and :meth:`ExtensionArray.min` (:issue:`24382`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to 1.0.0 now?
This would definitely be a neat feature @makbigc are you still wanting to work on this? |
@makbigc can you merge master? |
I don't know what linting error I have made. |
@makbigc mind merging master - will then take a look at any test failures. |
@makbigc sorry to chase you up, just wanted to ask - are you still working on this? Thanks :) |
OK, I pushed some updates to this PR:
So all please take a new look! I didn't yet hook this into What's also still missing is a |
My understanding based on other threads is that we shouldn't be using values_for(argsort|factorize) for anything other than (argsort|factorize) |
@jbrockmendel I think Before we better define the expected sematics of |
I dont understand this distinction. My understanding based on what you've said elsewhere is that we cant count on _values_for_argsort to be implemented, only argsort |
That was indeed the original idea: as EA author you can either implement Now, what became apparent in the recent discussions regarding those |
@jbrockmendel any response on the above? Do you understand why / are you OK with using |
OK with it |
And more specific comments on the PR? |
@@ -319,6 +319,33 @@ def nargsort( | |||
return indexer | |||
|
|||
|
|||
def nargminmax(values, method: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don’t think these
should be in base.py
rather in array_ops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I put them here is because I think it makes sense to keep it close to nargsort
, since the code is very similar (using the same approach with the idx
/non_nan_idx
etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don’t think these should be in base.py
BTW, this is not base.py
, but core/sorting.py
, which groups a whole bunch of functionality related to sortable values.
(it might make sense to move sorting.py
into the array_ops
submodule, but I would do that as a separate move)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k that makes sense (and let’s move sorting.py) as a followon
@@ -319,6 +319,33 @@ def nargsort( | |||
return indexer | |||
|
|||
|
|||
def nargminmax(values, method: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
annotate values as EA?
func = np.argmax if method == "argmax" else np.argmin | ||
|
||
mask = np.asarray(isna(values)) | ||
values = values._values_for_argsort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not wild about relying on non-public attrs here. could we have the EA method pass values
and mask
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_values_for_argsort
is a "public developer" API (it's part of the EA interface), that's the entire point of it.
I know we have had the discussion about the point of _values_for_argsort
and in principle we could also do without. But at this point, we have that method, it is used for argsort
as well, so I think it is most logical that I use it here. And we can continue that general discussion about _values_for_argsort
elsewhere.
With the EA methods in place, should we be dispatching to it from Series/DataFrame etc? |
Yes, see my "I didn't yet hook this into Series.argmin/argmax" above (#27801 (comment)). There are some open questions regarding behaviour: #33941 (I suppose we are fine with new extension dtypes to deviate from the current Series behaviour, but not fully sure for longer existing ones (categorical, datetimetz etc)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question inilne.
Can you add this to the Methods
in the ExtensionArray docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@makbigc can you merge master |
Going to merge this once CI passes (after the update with master) |
@makbigc thanks for starting this PR, and sorry again the review / discussion process didn't go that smoothly initially. |
Opened #35178 as follow-up issue to actually use this in the Series methods |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
The methods added in this PR ignore
nan
,i.e.,skipna=True
. The existingcategorical.min
returnnan
ifcategorical
contain anynan
. This behavior is expected intest_min_max
(tests/arrays/categorical/test_analytics.py).