Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Practical steps towards a simplified BlockManager #34669

Open
jorisvandenbossche opened this issue Jun 9, 2020 · 7 comments
Open

Practical steps towards a simplified BlockManager #34669

jorisvandenbossche opened this issue Jun 9, 2020 · 7 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action

Comments

@jorisvandenbossche
Copy link
Member

In the mailing list discussion on a simplified (non-consolidating) BlockManager, @jbrockmendel brought up the relevant question of how we could get there (incrementally?): https://mail.python.org/pipermail/pandas-dev/2020-May/001223.html

Since that is a more practical discussion, moving that to an issue here.

Longer term, there are multiple options of how such a blockmanager could be enabled by the user (listing 2 here, but there are probably other options as well):

  • Go "all in" on extension arrays. We briefly discussed before the idea of using (nullable) extension dtypes for all dtypes by default in pandas 2.0. If we strive towards that, and assuming we keep the current 1D-restriction on ExtensionBlock, then we would "automatically" get a BlockManager with 1D blocks. And we could then focus on optimizing code paths specifically for the case of having all 1D ExtensionBlocks.
  • A "consolidation policy" option similarly as in the branch discussed in Expose the blocks API and disable automatic consolidation #10556. Right now, that branch still uses 2D blocks (but separate 2D blocks of shape (1, n) per column) and not actually 1D blocks. So that might not be fully ideal. We could add 1D versions of our numeric blocks as well, but that would probably add a lot of complexity, although
    temporary, to the Blocks, so maybe not an ideal path forward.

For the "all extension arrays" option, we could also use light-weight EAs to store numpy arrays (like the "PandasArrays" we have now, only then actually using it), as long as we don't yet have actual extension arrays for all dtypes.
For both, we could have a constructor keyword to enabled it, and/or a global config option to enable it.

Now, on the shorter term, there are probably some work items we could tackle to reduce the API surface / usage of blocks outside of the internals, to make things like the above more realistic to implement (and make it easier to experiment with alternative BlockManager implementations):

  • Continue removing cases where pandas code outside of the internals work directly on blocks (@jbrockmendel you already did a lot of work on this recently, do you have an idea / overview of which are still remaining and the potential stumble blocks to further remove those cases?)
  • Remove any "index label" related code in the internals, as the BlockManager in principle only needs to concern about integer locations. Although, @jbrockmendel, with your work in CLN: remove BlockManager.get #33052, REF: remove BlockManager.set #33347,
    REF: BlockManager.delete -> idelete #33332 (removal of BlockManager.get/set/delete) this might actually already be done?
  • Limit the API surface of the BlockManager. It might be good to get an idea of which methods on the BlockManager are used outside of the internals, clearly list those (and make others private), and see whether we can reduce this list.
    (Maybe actually reducing might not be possible, but the exercise can still be useful to get an explicit overview of what is needed to implement a BlockManager).

It might be that there are already more concrete open issues about some of those aspects.

cc @pandas-dev/pandas-core

@jorisvandenbossche jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Jun 9, 2020
@jbrockmendel
Copy link
Member

Continue removing cases where pandas code outside of the internals work directly on blocks [...] do you have an idea / overview of which are still remaining and the potential stumble blocks to further remove those cases

This is now easy to grep for because it is ._mgr instead of ._data. The places where I'd really like to avoid this are, in roughly the order in which they are code-smelly:

  • io.formats.csv L337
  • io.pytables L3877
  • core.groupby.generic._cython_agg_blocks
  • _libs.reduction (a whole other can of worms)
  • core.resample._take_new_index (just calls obj._mgr.reindex_indexer, so not that big a deal, but id really prefer if we could use a DataFrame method here)

Remove any "index label" related code in the internals, as the BlockManager in principle only needs to concern about integer locations. [...] this might actually already be done?

I lost momentum on this, would be happy to see it picked back up. IIRC BlockManager.insert was the next one I had planned to address but there wasn't a nice way to change it without a dramatic overhaul.

Limit the API surface of the BlockManager. It might be good to get an idea of which methods on the BlockManager are used outside of the internals, clearly list those (and make others private), and see whether we can reduce this list.

+1

@mroeschke
Copy link
Member

do you have an idea / overview of which are still remaining and the potential stumble blocks to further remove those cases

rolling operations also operate block-wise, but removal would be fairly straightforward

@jbrockmendel
Copy link
Member

rolling operations also operate block-wise

where is that implemented? Does it go through BlockManager.apply?

@mroeschke
Copy link
Member

In rolling: https://github.com/pandas-dev/pandas/blob/master/pandas/core/window/rolling.py#L232

Which ultimately hits: https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L5479

Which would be neat to get rid of since I think this is the only usage of this method in core/

@jbrockmendel
Copy link
Member

thanks, ill take a look

@mroeschke
Copy link
Member

mroeschke commented Jun 10, 2020

I can volunteer to change rolling to column-wise

@jbrockmendel
Copy link
Member

I can volunteer to change rolling to column-wise

I think the thing to do is go through BlockManager.apply, so as to be agnostic to the ongoing 1D/2D issue

@mroeschke mroeschke added the Internals Related to non-user accessible pandas implementation label Aug 7, 2021
@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants