Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: ExtensionArray support for objects with _can_hold_na=False and relational operators #20801

Closed
wants to merge 3 commits into from

Conversation

Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Apr 23, 2018

closes #20659
closes #20761

  • tests added / passed
    • New tests/extensions/relobject tests arrays that cannot hold NaN, and has relational operators defined
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
    • Not added as this is part of the ExtensionArray support coming in the next release, which already has a whatsnew entry

NOTE: I expect that the pandas main developers will object to a lot of what is done here! So let's be open on the discussion.

Goals of this pull request:

  1. If there is an ExtensionArray, allow the relational operators to delegate to the base type
  2. Modify ExtensionArray tests to handle the cases of _can_hold_na==False as best as possible.
  3. An eventual goal is that if I have a RelObjectArray that consists of objects that have operators such as __le__ defined that return objects, and a and b are Series containing that type of array, then a <= b returns a Series of objects rather than booleans.

Things I'm unsure of:

  • In core/ops.py, I am using inspect to determine whether the user has implemented the relational operators for the subclassed ExtensionArray type as well as for the base type of elements in the array. I'm not sure if this is best practice or not.
  • In my example that I'm separately working on, the relational operators (including __eq__) return objects as opposed to booleans, and some (intentionally) throw exceptions. So I tried to catch these various cases. Not having __lt__ defined means that sorting is undefined, and not having __eq__ return a boolean messes up some things related to groupby. For my application, that is OK.

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 23, 2018
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Apr 23, 2018
@TomAugspurger
Copy link
Contributor

I am using inspect to determine whether the user has implemented the relational operators for the subclassed ExtensionArray type as well as for the base type of elements in the array. I'm not sure if this is best practice or not.

One issue here is for __eq__ and __neq__. All EAs will implement those by virtue of inheriting from object, even if the author doesn't implement it "correctly". We could have EA authors opt into this by defining a class attributes like _equatable, _orderable, etc.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NA stuff looks good. Small question about the test changes.

Taking a look at the ops stuff in a bit.

* _formatting_values
* _can_hold_na

Some methods require casting the ExtensionArray to an ndarray of Python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge snafu?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I reordered _formatting_values and _can_hold_na, since one is a method and the other is an attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant this section specifically. This block is repeated starting on line 57. Or perhaps I'm missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really weird. The copy on my machine is clean and fine, but the copy on GitHub has the repeat. Maybe because I resolved conflicts with master using GitHub interface. UGH.

# type: () -> bool
"""Whether your array can hold missing values. True by default.
_can_hold_na = True
"""Whether your array can hold missing values. True by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what our recommended style is here. You can probably just change it to a comment.

@@ -82,8 +82,9 @@ def test_getitem_scalar(self, data):
assert isinstance(result, data.dtype.type)

def test_getitem_scalar_na(self, data_missing, na_cmp, na_value):
result = data_missing[0]
assert na_cmp(result, na_value)
if data_missing._can_hold_na:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these changes are a bit unfortunate...

Could we instead have the data_missing fixture raise pytest.skip when necessary? I suspect this would have to be done by the EA author in their code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could make that change to avoid any tests where data_missing is used, and the EA author would have to make that change if they wanted _can_hold_na=False.



@pytest.fixture
def data_missing_for_sorting():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I clearly didn't have arrays that don't support NA in mind when I wrote this :)

Could you add this note to the data_missing_for_sorting docstring (in tests/extensions/conftest.py).?

@jreback jreback removed this from the 0.23.0 milestone Apr 24, 2018
@jreback
Copy link
Contributor

jreback commented Apr 24, 2018

this won't be for 0.23, you are basically defining the ops protocol for abstraction, which is needed, but also needs substantial testing in the current EA classes.

@TomAugspurger
Copy link
Contributor

this won't be for 0.23

Yeah, I think I agree given the deadline we set (RC was supposed to be yesterday). I could see this taking a bit of time to get right.

@Dr-Irv could you split the changes for _can_hold_na out to a separate PR? That part seems ready to go.l

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Apr 24, 2018

@TomAugspurger Yes, I can split the changes out for the _can_hold_na part. And I see that the operator stuff I tried to do is part of a bigger discussion. Should I create a new test case for the _can_hold_na part?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 24, 2018 via email