Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: improve efficiency of BaseMaskedArray.__setitem__ #44186

Closed
wants to merge 1 commit into from

Conversation

alexreg
Copy link
Contributor

@alexreg alexreg commented Oct 26, 2021

This somewhat deals with #44172, though that won't be fully resolved until 2D ExtensionArrays are supported (per the comments there).

CC @jbrockmendel

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need an asv for this

@@ -364,6 +365,17 @@ def map_string(s):

_HANDLED_TYPES = (np.ndarray, numbers.Number, bool, np.bool_)

def __setitem__(self, key, value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type the args & return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -364,6 +365,17 @@ def map_string(s):

_HANDLED_TYPES = (np.ndarray, numbers.Number, bool, np.bool_)

def __setitem__(self, key, value):
if lib.is_bool(value):
key = check_array_indexer(self, key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can move this to the base class instead of duplicating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so? I need to do is_bool and is_int and is_float depending on the subclass. So unless we want to introduce a new method...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would basically constitute _validate_setitem_value. @alexreg you said you tried this and it added complexity and hurt performance. can you describe what you tried?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I tried copying bits from existing _validate_setitem_value methods, but that may have been the wrong way to go about it (or at the least overkill). Let me try again and see...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I pushed again. How about that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better. still needs tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no doubt... in honesty though, I think I'd just be flailing about helplessly trying to write tests for this. I wouldn't know what input data to use and how to set that up properly, for one. Not to mention what precisely to test. If anyone else wants to come in and add tests for this, that would be a big help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the test_setitem_foo tests in tests.arrays.test_datetimelike. Something similar in either tests.arrays.masked_shared or tests.arrays.(integer|boolean|floating).test_indexing

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Oct 28, 2021
@alexreg alexreg force-pushed the fix-boolean-apply-perf branch 2 times, most recently from d468b36 to fd645db Compare October 28, 2021 23:24
@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

@alexreg status here? pls merge master as well

This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).
@alexreg
Copy link
Contributor Author

alexreg commented Nov 28, 2021

@jreback Sorry, haven't had much time lately. It would probably take me a while to properly wrap my head around the testing framework and write good tests. Ideally someone else could step in with that, but if not, maybe I'll get to it in a month or two.

I rebased on current master, however, and pushed that. (Nice and easy.)

@jbrockmendel
Copy link
Member

Ideally someone else could step in with that

None of us are going to write tests for you, but we will help you through getting the hang of writing them.

def test_descriptive_name():
    obj = pd.whatever()
    with pytest.raises(something, match=expected_exception_message):
         obj[0] = some_invalid_value_that_doesnt_raise_now_but_does_with_your_bugfix

@jreback jreback added this to the 1.5 milestone Jan 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants