Practical steps towards a simplified BlockManager #34669

jorisvandenbossche · 2020-06-09T15:36:41Z

In the mailing list discussion on a simplified (non-consolidating) BlockManager, @jbrockmendel brought up the relevant question of how we could get there (incrementally?): https://mail.python.org/pipermail/pandas-dev/2020-May/001223.html

Since that is a more practical discussion, moving that to an issue here.

Longer term, there are multiple options of how such a blockmanager could be enabled by the user (listing 2 here, but there are probably other options as well):

Go "all in" on extension arrays. We briefly discussed before the idea of using (nullable) extension dtypes for all dtypes by default in pandas 2.0. If we strive towards that, and assuming we keep the current 1D-restriction on ExtensionBlock, then we would "automatically" get a BlockManager with 1D blocks. And we could then focus on optimizing code paths specifically for the case of having all 1D ExtensionBlocks.
A "consolidation policy" option similarly as in the branch discussed in Expose the blocks API and disable automatic consolidation #10556. Right now, that branch still uses 2D blocks (but separate 2D blocks of shape (1, n) per column) and not actually 1D blocks. So that might not be fully ideal. We could add 1D versions of our numeric blocks as well, but that would probably add a lot of complexity, although
temporary, to the Blocks, so maybe not an ideal path forward.

For the "all extension arrays" option, we could also use light-weight EAs to store numpy arrays (like the "PandasArrays" we have now, only then actually using it), as long as we don't yet have actual extension arrays for all dtypes.
For both, we could have a constructor keyword to enabled it, and/or a global config option to enable it.

Now, on the shorter term, there are probably some work items we could tackle to reduce the API surface / usage of blocks outside of the internals, to make things like the above more realistic to implement (and make it easier to experiment with alternative BlockManager implementations):

Continue removing cases where pandas code outside of the internals work directly on blocks (@jbrockmendel you already did a lot of work on this recently, do you have an idea / overview of which are still remaining and the potential stumble blocks to further remove those cases?)
Remove any "index label" related code in the internals, as the BlockManager in principle only needs to concern about integer locations. Although, @jbrockmendel, with your work in CLN: remove BlockManager.get #33052, REF: remove BlockManager.set #33347,
REF: BlockManager.delete -> idelete #33332 (removal of BlockManager.get/set/delete) this might actually already be done?
Limit the API surface of the BlockManager. It might be good to get an idea of which methods on the BlockManager are used outside of the internals, clearly list those (and make others private), and see whether we can reduce this list.
(Maybe actually reducing might not be possible, but the exercise can still be useful to get an explicit overview of what is needed to implement a BlockManager).

It might be that there are already more concrete open issues about some of those aspects.

cc @pandas-dev/pandas-core

jbrockmendel · 2020-06-09T15:56:55Z

Continue removing cases where pandas code outside of the internals work directly on blocks [...] do you have an idea / overview of which are still remaining and the potential stumble blocks to further remove those cases

This is now easy to grep for because it is ._mgr instead of ._data. The places where I'd really like to avoid this are, in roughly the order in which they are code-smelly:

io.formats.csv L337
io.pytables L3877
core.groupby.generic._cython_agg_blocks
_libs.reduction (a whole other can of worms)
core.resample._take_new_index (just calls obj._mgr.reindex_indexer, so not that big a deal, but id really prefer if we could use a DataFrame method here)

Remove any "index label" related code in the internals, as the BlockManager in principle only needs to concern about integer locations. [...] this might actually already be done?

I lost momentum on this, would be happy to see it picked back up. IIRC BlockManager.insert was the next one I had planned to address but there wasn't a nice way to change it without a dramatic overhaul.

Limit the API surface of the BlockManager. It might be good to get an idea of which methods on the BlockManager are used outside of the internals, clearly list those (and make others private), and see whether we can reduce this list.

+1

mroeschke · 2020-06-09T16:39:56Z

do you have an idea / overview of which are still remaining and the potential stumble blocks to further remove those cases

rolling operations also operate block-wise, but removal would be fairly straightforward

jbrockmendel · 2020-06-10T20:03:28Z

rolling operations also operate block-wise

where is that implemented? Does it go through BlockManager.apply?

mroeschke · 2020-06-10T20:13:49Z

In rolling: https://github.com/pandas-dev/pandas/blob/master/pandas/core/window/rolling.py#L232

Which ultimately hits: https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L5479

Which would be neat to get rid of since I think this is the only usage of this method in core/

jbrockmendel · 2020-06-10T21:42:44Z

thanks, ill take a look

mroeschke · 2020-06-10T22:21:39Z

I can volunteer to change rolling to column-wise

jbrockmendel · 2020-06-10T22:33:10Z

I can volunteer to change rolling to column-wise

I think the thing to do is go through BlockManager.apply, so as to be agnostic to the ongoing 1D/2D issue

jorisvandenbossche added the Needs Discussion Requires discussion from core team before further action label Jun 9, 2020

jorisvandenbossche mentioned this issue Aug 31, 2020

POC: ArrayManager -- array-based data manager for columnar store #36010

Merged

jorisvandenbossche mentioned this issue Jan 13, 2021

Refactor - ArrayManager overview issue #39146

Closed

11 tasks

mroeschke added the Internals Related to non-user accessible pandas implementation label Aug 7, 2021

jbrockmendel added the ArrayManager label May 3, 2023

rhshadrach removed the ArrayManager label Oct 7, 2023

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Practical steps towards a simplified BlockManager #34669

Practical steps towards a simplified BlockManager #34669

jorisvandenbossche commented Jun 9, 2020

jbrockmendel commented Jun 9, 2020

mroeschke commented Jun 9, 2020

jbrockmendel commented Jun 10, 2020

mroeschke commented Jun 10, 2020

jbrockmendel commented Jun 10, 2020

mroeschke commented Jun 10, 2020 •

edited

Loading

jbrockmendel commented Jun 10, 2020

Practical steps towards a simplified BlockManager #34669

Practical steps towards a simplified BlockManager #34669

Comments

jorisvandenbossche commented Jun 9, 2020

jbrockmendel commented Jun 9, 2020

mroeschke commented Jun 9, 2020

jbrockmendel commented Jun 10, 2020

mroeschke commented Jun 10, 2020

jbrockmendel commented Jun 10, 2020

mroeschke commented Jun 10, 2020 • edited Loading

jbrockmendel commented Jun 10, 2020

mroeschke commented Jun 10, 2020 •

edited

Loading