-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix series.round() handling of EA #26817
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always add tests first.
Codecov Report
@@ Coverage Diff @@
## master #26817 +/- ##
===========================================
- Coverage 91.86% 41.12% -50.74%
===========================================
Files 179 179
Lines 50707 50707
===========================================
- Hits 46583 20855 -25728
- Misses 4124 29852 +25728
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26817 +/- ##
==========================================
- Coverage 91.87% 91.86% -0.01%
==========================================
Files 180 180
Lines 50712 50721 +9
==========================================
+ Hits 46590 46595 +5
- Misses 4122 4126 +4
Continue to review full report at Codecov.
|
For tests, I am not fully sure what is the best approach. You could add a |
I made some progress with this but, as discussed in #26730, introducing |
Can you give a bit more details? Trying to understand: for reductions, we currently define BTW, feel free to already push some WIP code, or point to your attempts of a pint array with |
The failing test now is In general, for each function in numpy, I have to either implement the function myself, or do any casts necessary and delegate to numpy.
It feels incredibly wrong to have to write the logic to handle this, just to override the implementation of cc @shoyer, is this really what NEP-18 expects devs to do? |
Here's another quandry: That's crazy. |
There has recently been a lot of discussion on the numpy mailing list about exactly this (a way to only implement part of the functions, and fall back to the "raw" numpy implementation otherwise). At some point there was a proposal to have a @shoyer your update to the NEP says "We considered three possible ways to resolve this issue, but none were entirely satisfactory". Do you have any recommendation to do this nevertheless? (there is a private |
@pilkibun From the mailing list discussion, it seems there is a |
I added an unpleasent traversal of all the numpy function arguments, coercing any EA via If acceptable, it's generic enough to perhaps consider including as a default implementation in the ExtensionArray base class. |
Other than the checks build, I think the test suite passes now. The test I added is numpy ver-limited to the np_dev build, because only 1.17 has That said, I don't find this approach very felicitous, because:
So I'm unhappy with this, even though it seemingly passes the tests. I don't know how to move forward with this. Thoughts? |
I updates the first post with some of the issued I encountered writing this, I think their typical problems likely to be encountered by any EA authors, so may be useful for refining the EA base classes/examples/documentation. |
unlikely to merge this with all of these work-arounds. This needs either to be a somewhat specific solution or a well designed and integrated soln. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be easier to review if you revert the unrelated changes.
@@ -323,7 +323,7 @@ def test_repeat(self, data, repeats, as_series, use_numpy): | |||
self.assert_equal(result, expected) | |||
|
|||
@pytest.mark.parametrize('repeats, kwargs, error, msg', [ | |||
(2, dict(axis=1), ValueError, "'axis"), | |||
(2, dict(axis=1), ValueError, "'?axis"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test seemed to raise with right error message, but without the quote. I'll revert and see if the CI fails again.
self._dtype = DecimalDtype(context) | ||
|
||
# aliases for common attribute names, to ensure pandas supports these |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to the rounding changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue helped me write a bug while working on the rounding changes, and I fixed it to protect myself. does that count as related?
@@ -153,6 +169,40 @@ def _reduce(self, name, skipna=True, **kwargs): | |||
"the {} operation".format(name)) | |||
return op(axis=0) | |||
|
|||
# numpy experimental NEP-18 (opt-in numpy 1.16, enabled in in 1.17) | |||
def __array_function__(self, func, types, args, kwargs): | |||
def coerce_EA(coll): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like a good idea. For now, just use np.asarray
. I think we have a separate issue about converting EAs to preferred ndarray representation, but __array__
is the best we have for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already said I'm not thrilled with this approach, but I'm not sure what you're suggesting.
How can you support pd.concat([EA,EA])
without resorting to something like this?
__array__
in relation to what? I've already discussed how implementing NEP-18 on a class, makes numpy ignore __array__
, since point 4 in he first post.
@TomAugspurger , @jreback If NEP-18 is used, something like this seems necessary. "Just use Here are the cases again:
standard pandas series know how to do all these things, and EA should be able to do so too, if the EA author needs them to. We could dispense with NEP18, and implement |
@pilkibun in regards to #26817 (comment)
this is not a valid function
this is not a problem at all as pandas can easily correctly dispatch on this w/o any numpy action at all
this is already handled by _reduce operators
is the only valid case AFAIK to actually use Let's NOT try to solve the world in this PR. Please please focus on a single one of these and don't try to change the others. These are each complicated enough and don't need to be intermixed. |
@pilkibun workflow related comment: can you just push new commits instead of force pushes? That makes it a lot easier to see what changed. |
that's a type, it is about |
@pilkibun can you try to explain how |
@pilkibun could you add back all the changes related to NEP18 you had? Having it removed from the history of this branch makes it very difficult to discuss the comments that @jreback and @TomAugspurger had. |
Reset as best I could. sorry for the pain. No more. |
@jreback, That sounds like you're being really pragmatic and down to earth, while I'm wasting time playing with nonsense. I don't think that's fair. If you actually read #26817 (comment) (please do), you'd see I opened by saying
So, suggesting the solution is to "just solve round()" is disingeneous. The real discussion is about how to solve the larger issue, with So Please don't oversimplify complicated things just for a rejoinder which doesn't help the discussion. |
@pilkibun I never said any of that, please do NOT put words into my statements. I AM being practical. This will not be merged as is. You need to separate things into well constructed and documented components. We are certainly willing to take tactical patches while awaiting towards a strategic fix. However, you are mixing up too many different items here. It is best to stick to a single well formed patch. |
@pilkibun I can't even begin to have a discussion about this code change. You have SO much going on that its not even worth discussing in this state except to say: simplify what you are changing. Please don't waste my time. I review many many PRs. The need to be focused and well formed. This is is fine if its 'pie-in-the' sky. |
Gladly. Here and in other discussion I've found you to be rude, arrogant, dismissive, condescending, I have had many missteps, but then again its a large project, the code is new to me, and I'm working on something which hasn't been fully fleshed out. Joris and tom make me want to do better when I fumble, while you make me want to walk away. I think the solution is simple. Save yourself precious time, and me the aggravation, and leave us to work in peace. I'm sure the two of them are more than capable of guiding this towards an acceptable result, or rejecting it if it isn't good enough. |
I'd like to get back to discussing implementing ops for EA, if possible, instead of you dragging me into an argument which will drain everyone of good will. |
Back to the issue @jorisvandenbossche, I can't replicate the pd.concat issue. I must have encountered it during some changes. It does seem to work even without NEP18. I guess I had trouble with
|
Let's put this conversation on hold for a bit, to give everyone some time away from it. I'll revisit it in 24h. |
I opened #26900 and #26901 a couple of hours ago, to eliminate some of the noise here, but then @jorisvandenbossche , you were unhappy about losing the discussion commits. How do you want me to handle that? |
Welcome back, I hope everyone had a productive break. I'm going to try to I think the root cause of the conflict were different expectations between the Typically, pandas uses the issue for design, and the PR for implementation. This So, let's attempt a summary at what happened: The precipitating comment seems to be
From there, things escalated. In #26817 (comment), Jeff In #26817 (comment), by
There were also a couple slightly incendiary (can something be slightly In #26817 (comment), I And things went further downhill from there as patience frayed all around. But I Summary over; now where do we stand on the meta-issue? A couple of general First, online communication is difficult. We have a diverse set of backgrounds Second, expectation setting seems crucial. As trivial as it sounds, I think much Related to both, I think Brett Cannon's thoughts on this are worth reading: And as a concrete example of how things could have gone differently, the comment
could be rephrased as
I think that better communicates what I think was the intended message. But the @jreback @pilkibun do you have thoughts? Perhaps reactions to the general Reminder: let's have a meeting of minds on the meta-issue before returning to |
git diff upstream/master -u -- "*.py" | flake8 --diff
Note that the change to a
np.round(self.array)
call was chosen so that__array__
and return a numeric ndarrayEA authors to implement round functionality via the
__array_function__
protocol.Update:
This has become a test case on supporting a wider range of numpy functions for EA.
Here a some of the issues encounterd, several are indicative of the issues EA authors will encounter when trying to do the same:
object
dtype, so results need to becast to the extension type. Example:
np.repeat
pd/np.concatenate
pd/np.concatenate
__array__()
even if available, so there's no transparent casting to object.5.
ExtensionArray
ABC doesn't offer a public API for requesting the backing numpy array of an EA, explicitly asking for numeric version (if applicable), or object dtype. hgrecco/pint-pandas implements a_ndarray_values
method forPintArray
, because you always need one (just as Series has.values
and now.array
), but it's not part of the EA base class and, for mixed-dtype cases, you need such a method when you dealing with a foreign EA.Decimal
EA example defines a whole slew of synonyms forself._data
,data
,_values
,_items
for "pandas compatibility", these attributes have ill-defined or undefined semantics, leading to the same confusing situation () ofSeries
having.values
_data
and ,.get_values
,.data
, and now.array
. Seems like repeating old mistakes again.self.foo=self._foo = values
, because then assigning to one does not update the other, and its not clear where the "actual" data is. Instead should use properties which lookup from the single, true, attribute which holds the data. I had a bug related to this because I set_data
on a copy instead of calling a constructor. safety first.transform
like method provided for the user to do that without reaching in and touching private attributes of the EA. Example, aPintArray
with units, which you want toround()
and convert to internal dtype. The ExtensionType doesn't change, but the underlying numeric subtype does. For examplepint[grams] (with float64 magnitudes) -> pint[grams] (with int64 magnitudes)
.Update:
__array_ufunc__
for opsnp.floor
, 'array_functionfor ops like
round, and
_reducefor ops like sum/min/max, which pandas doesn't call numpy fo (instead calling
pd.core.nanops`), and has incompatible signatures with.