Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArrayManager] Array version of putmask logic #44396

Closed

Conversation

jorisvandenbossche
Copy link
Member

xref #39146

Trying to remove the usage of apply_with_block for putmask.
(I also need to this for the Copy-on-Write branch to be able to perform the copy on write in putmask)

@jorisvandenbossche jorisvandenbossche added ArrayManager Internals Related to non-user accessible pandas implementation labels Nov 11, 2021
@jbrockmendel
Copy link
Member

can the fallback logic stay in internals? maybe internals.methods? there are a few funcs in internals.blocks that might go there too.

mask=mask,
new=new,
)
kwargs = {"mask": mask, "new": new}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not feasible to go through 'apply'?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Nov 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. So I forgot that ArrayManager.apply indeed already has this "align" logic as well. However, that is not used at the moment (only putmask and where define align keys, and those both still go through apply_with_block at the moment; this is also confirmed by codecov that it is unused).

But then I would rather remove it from apply, instead of going through apply, which will also prevent duplication. The reason for doing it in putmask: 1) apply is already quite complex (eg also dealing with ignoring failures), so being able to remove the "alignment" handling there would be nice (can still share this later with where), and 2) putmask can update the existing arrays, while apply always constructs a new manager, and 3) for the CoW branch I need to add copy-on-write logic to putmask, which is not needed for apply in general.

@jorisvandenbossche
Copy link
Member Author

can the fallback logic stay in internals? maybe internals.methods?

Yes, either way is fine for me. Happy to move it into internals.

from pandas.core.arrays._mixins import NDArrayBackedExtensionArray


def putmask_flexible(array: np.ndarray | ExtensionArray, mask, new):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is all of this logic from? this doesn't look like 'simple' wrappers to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's logic that for the BlockManager lives on the blocks (the Block.putmask version also calls those putmask_without_repeat and putmask_smart helpers from core.array_algos.putmask

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great can u then remove from where it's used now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BlockManager is still used ..
I know this gives some duplication, but the parts that can be shared are already in pandas.core.array_algos.putmask. Both the Block version as this version are using those functions.

I can try to see if there are some more parts that could be moved into array_algos.putmask

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I could easily remove most of the logic of ExtensionBlock.putmask and reuse the new helper function. For base Block.putmask that will be more difficult, as that directly involves splitting blocks if needed etc (so Block-specific logic)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me put this another way. you are adding a ton of code to methods, is this generally useful? can we reuse elsewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are adding a ton of code to methods, is this generally useful? can we reuse elsewhere?

No, it is only for use within array_manager.py, but I put it in a separate internals/methods.py file to not make array_manager.py any longer (and on request of Brock I also didn't put it in /core/array_algos/putmask.py because this is the custom putmask logic for internals (with fallback logic etc that we need for Series/DataFrame), and not a general putmask algo)

@@ -190,7 +191,7 @@ def __repr__(self) -> str:
def apply(
self: T,
f,
align_keys: list[str] | None = None,
align_keys: list[str] | None = None, # not used for ArrayManager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not worth having a separate one for AM?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean exactly? Separate signature for ArrayManager?
I initially remove this from the signature (since it is not used), but we have this method as an abstract method on the base class manager, and thus it gives typing errors if I remove it here.

from pandas.core.arrays._mixins import NDArrayBackedExtensionArray


def putmask_flexible(array: np.ndarray | ExtensionArray, mask, new):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me put this another way. you are adding a ton of code to methods, is this generally useful? can we reuse elsewhere?

@jbrockmendel
Copy link
Member

(I also need to this for the Copy-on-Write branch to be able to perform the copy on write in putmask)

Can you elaborate on this a bit? If we're going to implement CoW for BM anyway, some of the duplication might be avoidable.

@jorisvandenbossche
Copy link
Member Author

(I also need to this for the Copy-on-Write branch to be able to perform the copy on write in putmask)
Can you elaborate on this a bit?

See the 3 lines at https://github.com/pandas-dev/pandas/pull/41878/files#diff-b8c89656832340e9986eeee40df66eb4282555045a257e7a489d98b21b0adfbeR390-R393 that I currently added in that PR. Since putmask is used in indexing, I need to check for a copy-on-write (and this is not needed for apply in general).

If we're going to implement CoW for BM anyway, some of the duplication might be avoidable.

I think that in general the CoW detection on BlockManager will be a separate implementation (but of course difficult to say before actually trying). For example the simplest way is to do this on a block-by-block basis (which would already give different code as the 3 lines I referenced).

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 27, 2021
@jbrockmendel
Copy link
Member

I increasingly think that this type of change should be left until right before BlockManager is ripped out. Until then, this just duplicates code and makes it easy for behavior to accidentally drift apart.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2022

can you merge main

@simonjayhawkins
Copy link
Member

closing as stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants