Expose the blocks API and disable automatic consolidation #10556

shoyer · 2015-07-12T20:06:31Z

In my discussion with Jonathan and others and at the SciPy sprints, we agreed that it would be really nice to expose some minimal tools for manipulating and view the internal pandas blocks system. For example, it should be possible to:

manually consolidate blocks
view a representation of the internal blocking of a dataframe (via matplotlib?)

It's not so much that we want to create and use blocks directly, but that we want to make it easier to understand the internal data model and make performance with more predictable.

At the same time, we would like to disable automatic consolidation of blocks in the DataFrame constructor and when inserting new columns. Consolidation is certainly a useful feature, but it is currently not always possible to even predict when it will happen.

Most users never notice or care about consolidation. Power users (concerned about memory or performance) are at least as likely to find it frustrating as helpful, so we should make this something that they can trigger explicitly (as part of the blocks API). This would make it possible to create dataframes while guaranteeing that none of the data is copied (#9216).

cc @jonathanrocher @sinhrks @jreback @cpcloud @TomAugspurger @ARF1 @quicknir

jonathanrocher · 2015-07-14T03:04:32Z

Yes, I am definitely a big +1 on both of these ideas. I need to understand the current state better, before experimenting with such a tool. Just to clarify what @shoyer means by the blocks API, we discussed exposing these view and consolidation methods via a blocks attribute: df.blocks.view(), df.blocks.consolidate(), or something like that.

jreback · 2015-07-14T12:43:54Z

well

df.as_blocks(copy=False) (copy flag added in master recently) gives you a view right now

and as discussed, I would like to see why this should actually be exposed to the user. What is the purpose? A good use case would go a long way. Showing an implementation is only guaranteed to cause issues down the road when one wants to change it (and cannot because its exposed)

shoyer · 2015-07-14T20:55:09Z

I don't think we should expose the full blocks API. Rather, we should expose a limited set of functionality suitable for basic understand and manipulation of blocks (mostly .consolidate_blocks()), so we can have the option of disabling automatic consolidation. If we ever change the underlying implementation, we can just make this method a no-op.

For viewing block information, we might expose a tuple, similarly to the .chunks attribute on dask arrays.

sinhrks · 2015-07-15T11:23:35Z

I assume exposing some API is for let power users experiment, rather than daily usage? I think adding more detail to internals.rst is very nice (#4082), but not sure what level of "public" is actually required (guarantee backward compat or adding to API.rst, etc)

+1 for optionally disabling auto consolidation, it should allow pandas handle larger size of data.

jreback · 2015-07-15T11:43:39Z

I still have not seen any reason to actually expose anything about the internals as an API.

It seems that one might want to control how the internal engine works, e.g. simply have an option on construction that allows potentially different internal representation.

No consolidation is a nice idea on the surface and their may be a reason to allow this and/or switch as a default. HOWEVER, I would like to see someone do a performance study to see if it actually makes a difference in a non-trivial case.

Turning off auto-consolidation is actually trivial, so if someone would like to step up, all ears.

quicknir · 2015-07-16T20:39:55Z

@shoyer Thanks for tagging me on this. I am a strong +1 on this. As an example of a use case, ARF1 showed in #9216 how one could use the block manager interface to create an empty DataFrame nearly 100 times faster than the naive approach. It would be nice if one could do this without accessing _implementation methods that are subject to change without warning.

When developing a C/C++ library that interfaces with pandas, memory layout is critical. Increasing public interface that allows better guarantees and more explicit behavior of memory layout is very helpful. I can give other examples in pandas where similar issues arise.

jreback · 2015-09-27T20:49:30Z

jreback@44c280b

is a branch that allows a parameter policy to control how consolidation works.

Its not completely automatic at this point (e.g. you have to pass policy='split|column' to get this behavior (which could be an option, support.consolidation). Further for a dict we can simply default to column I think.

note that:

column -> don't consolidate, IOW, this is a 'column' repr internally. This won't 'split' a passed in ndarray though (e.g. its still a single block to begin with). This could eventually be the default.
split -> force split a passed ndarray. This can be costly, but with the upside that you get proper memory layout. This has to be a user decision as its not always necessary / wanted.
needs some more tests and such
needs some perf testing
need to have various ops (e.g. concat), deal with the policy properly (may be non-trivial)

jreback · 2015-09-27T21:07:27Z

Here are some perf considerations

In [3]: np.random.seed(1234)

# blocks
In [4]: df_blocks = DataFrame(np.random.randn(100000,1000))

# create a single ndarray, then split it
In [5]: df_split = DataFrame(np.random.randn(100000,1000),policy='split')

# create a dict of ndarrays, no need to split
In [10]: df_dsplit = DataFrame(dict([ (i,np.random.randn(100000)) for i in range(1000) ]),policy='split')

Creations

# block
In [7]: %timeit DataFrame(np.random.randn(100000,1000))
1 loops, best of 3: 3.79 s per loop

# ndarray-split
In [6]: %timeit DataFrame(np.random.randn(100000,1000),policy='split')
1 loops, best of 3: 5.93 s per loop

# dict-split
In [14]: %timeit DataFrame(dict([ (i,np.random.randn(100000)) for i in range(1000) ]),policy='split')
1 loops, best of 3: 3.93 s per loop

row-wise ops

# blocked
In [8]: %timeit df_blocks.sum()
1 loops, best of 3: 1.55 s per loop

# ndarray-split
In [9]: %timeit df_split.sum()
1 loops, best of 3: 607 ms per loop

# dict-split
In [15]: %timeit df_dsplit.sum()
1 loops, best of 3: 617 ms per loop

So you may be suprised by the speed of the splits. This is because these are already c-contiguous, while the block memory is NOT. Its in c-order, but not for a row-wise op.

shoyer · 2015-09-28T21:54:44Z

For the creation benchmarks, it would be helpful to put generating the
random values outside the loop. This does look very encouraging, though.

- closes pandas-dev#10556, add policy argument to constructors - closes pandas-dev#9216, all passing of dict with view directly to the API - closes pandas-dev#5902 - closes pandas-dev#8571 by defining __copy__/__deepcopy__

jbrockmendel · 2017-04-29T03:40:31Z

Would this make the "dict of Series" description of a DataFrame more reliable? The sometimes-vexing behavior I'm thinking of is

>>> arr = np.random.randn(1000, 5)
>>> df = pd.DataFrame(arr, columns=list('ABCDE'))
>>> A = df['A']
>>> G = 2*A
>>> G.name = 'G'
>>> df['G'] = G
>>> df['G'] is G
False
>>> df['A'] is A
False

Having id(df['A']) be consistent would make keeping track of _metadata simpler in some cases.

jreback · 2017-05-01T11:51:53Z

this issue is not going to happen for pandas 1.x. In pandas 2.x this will be moot as no more blocks.

- closes pandas-dev#10556, add policy argument to constructors - closes pandas-dev#9216, all passing of dict with view directly to the API - closes pandas-dev#5902 - closes pandas-dev#8571 by defining __copy__/__deepcopy__

- closes pandas-dev#10556, add policy argument to constructors - closes pandas-dev#9216, all passing of dict with view directly to the API - closes pandas-dev#5902

jorisvandenbossche · 2018-01-16T08:53:04Z

I rebased the commit of @jreback to play a bit with it to do some more timings. Using the same example as above, but now excluding the array creation and on more operations:

In [2]: np.random.seed(1234)
   ...: array = np.random.randn(100000,1000)
   ...: array_dict = dict([ (i,np.random.randn(100000)) for i in range(1000) ])

In [3]: # blocks
   ...: df_blocks = DataFrame(array)
   ...: 
   ...: # create a single ndarray, then split it
   ...: df_split = DataFrame(array, policy='split')
   ...: 
   ...: # create a dict of ndarrays, no need to split
   ...: df_dsplit = DataFrame(array_dict, policy='split')

Creation: splitting array much slower, but when you already have dict of arrays it is logically faster to not consolidate them:

In [4]: %timeit DataFrame(array)
102 µs ± 932 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit DataFrame(array, policy='split')
3.01 s ± 93.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit DataFrame(array_dict, policy='split')
29.5 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit DataFrame(array_dict, policy='block')
627 ms ± 4.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Operations column-wise:

In [8]: # blocked
   ...: %timeit df_blocks.sum()
994 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: # splitted
    ...: %timeit df_split.sum()
1.99 s ± 84.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Operation row-wise:

In [11]: # blocked
    ...: %timeit df_blocks.sum(axis=1)
889 ms ± 29.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: # splitted
    ...: %timeit df_split.sum(axis=1)
1.85 s ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Slicing (much slower!):

In [21]: %timeit df_blocks.iloc[0:1000]
106 µs ± 600 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [22]: %timeit df_split.iloc[0:1000]
    ...: 
16.9 ms ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So in general, as expected, it gives a slowdown on the above operations (of course it is the specific case of a lot of all-float columns (only iterating over the rows with eg itertuples was faster with the splitted version ...). But especially the slicing operation is much much slower.

So in any case, if we would like to further explore this, I think it will certainly mean having to refactor the BlockManager to make operations that have to loop over the columns faster.

wesm · 2019-12-18T22:39:21Z

I've run up against this issue myself (like fighting with my own ghost) in the context of ARROW-3789 as there are use cases where users are converting very large (tens of gigabytes) Arrow tables and wish to avoid excess allocations / memory copying. In certain narrow use cases (i.e. no strings, no nulls), it is actually possible to do fully zero-copy construction of DataFrame from Arrow memory.

This bring up the spectre of unpredictable consolidation. I don't know if there have been any mitigations implemented in the last 4 years about this but I'm interested to see if there are ideas

ARF1 · 2019-12-20T10:39:16Z

@wesm I had a similar problem with performant boclz integration a few years back. Would this approach help in the meantime? While this is not zero-copy, in my case the memory copy had a negligible performance impact compared to the pandas blocks construction.

This worked for my fairly limited use case then. No thorough testing was done. Also, things may have changed since then. I have not used this in years.

wesm · 2019-12-24T21:23:24Z

I've seen use cases where the memory doubling will cause OOM, so zero-copy / no consolidation is essential

…eries/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas. This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc. Summary of what's here: - Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone - Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented - Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in pandas-dev/pandas#10556) - Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users) The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)` Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability. Closes #6067 from wesm/ARROW-3789 and squashes the following commits: 3b42602 <Wes McKinney> Code review comments 8f39cce <Wes McKinney> Add some documentation. Try fixing MSVC warnings c22d280 <Wes McKinney> Fix one MSVC cast warning 4306803 <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

jorisvandenbossche · 2020-01-17T15:01:38Z

I don't know if there have been any mitigations implemented in the last 4 years about this but I'm interested to see if there are ideas

Personally, I don't think getting such feature in pandas would be that hard. The commit linked above is not that complicated. And now we have the new metadata machinery ("attrs" attribute), we could maybe use that to attach the "consolidation policy" as metadata to the DataFrame.

If there is some agreement in that we still would want to merge something like this, I could try to revive that branch.

Generally speaking, I think the long term path forward are the 1D ExtensionArray based columns, which directly have the consequence of not getting consolidated (because they are 1D) and moving towards a "collection of 1D arrays" model for the DataFrame. We already have a few such dtypes now, and I think the plan is to keep adding more.
But of course, even if we get there eventually, this is long term work (certainly until those would become the default. So personally, I think it might still be worth considering to look into the "no consolidation policy" for the shorter term.

- closes pandas-dev#10556, add policy argument to constructors - closes pandas-dev#9216, all passing of dict with view directly to the API - closes pandas-dev#5902

jorisvandenbossche · 2020-05-25T20:10:18Z

I have updated the branch adding a consolidation policy once again (starting from @TomAugspurger's recent rebase, see the linked commit by github above).

And with that, I experimented a bit comparing a few operations with a consolidated DataFrame vs non-consolidated DataFrame: see notebook at https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c

Of course, for several operations, it is faster to operate on a 2D array compared to operating on many 1D arrays. However, with a decent-sized dataframe (I used 100_000 rows x 100 columns here), the difference is generally not that big (for this size of dataframe, the actual operation dominates the overhead of doing things column-wise).
And there were actually also a few operations that are faster column-wise .. (although this could quite likely also be improved to get block-wise as fast, with some time investment).

My general sense (of course, based on very limited tests) is that the slowdown with non-consolidated dataframes for typical operations is not gigantic, and with some effort optimizing the BlockManager for this use case (right now there is often a lot of overhead in finding the correct block type, reconstructing blocks, etc, which is something we should be able to optimize for non-consolidated dataframes) we should be able to remove a large part of this slowdown.
And of course, there are also operations that will actually benefit from non-consolidated dataframes (especially where copies can be avoided this way).

jreback · 2020-05-25T20:50:00Z

@jorisvandenbossche you are missing a big issue here

just operating column wise is not a good test

you need to actually copy the original frame into single columns where the memory gets realigned; taking a view does nothing here which is effectively what you are doing

you will see massive speed ups if you get proper cache access for some times of operations

please see me original PR in non consolidation (a while back)

jorisvandenbossche · 2020-05-25T20:59:43Z

@jreback I am not really sure what you mean. Can you clarify a bit?

just operating column wise is not a good test

Why not? That's what we would do if we didn't have consolidation?
Of course, it's still using 2D blocks to store a 1D array, which is not ideal (so we should be able to optimize this if we actually have 1D blocks), but already doing this should give an idea about overhad of doing things column-wise.

you need to actually copy the original frame into single columns where the memory gets realigned; taking a view does nothing here which is effectively what you are doing

Can you explain what you mean? (maybe a code snippet what I should do)
You mean that df2 = pd.DataFrame(arr, policy="split") is not actually properly splitting the array?
(as far as I can see, it is actually taking a copy, as the df2 is no longer a view on the original array)

please see me original PR in non consolidation (a while back)

Do you have a link? (I was not aware of a PR about this)

jreback · 2020-05-25T21:13:39Z

see above : #10556 (comment)

you need to make really sure that ops are actually working with cache aligned data and not with just a view from a 2D array which is not aligned at all, otherwise your timings are not real

jorisvandenbossche · 2020-05-25T21:21:31Z

see above : #10556 (comment)

I am still not fully clear what you are pointing to. The linked comment explains the added features in your branch (the different policy options, mentioning that policy="split" results in "proper memory layout"), but it is this branch (well, your old branch but rebased on current master) that I have been using for running the notebook.

you need to make really sure that ops are actually working with cache aligned data and not with just a view from a 2D array which is not aligned at all, otherwise your timings are not real

Yes, but as far as I can see, this is the case (the Block's values in the "split" case are properly layout, not a strided view in the original)

jbrockmendel · 2020-05-25T21:46:18Z

FWIW, i think the lowest-hanging fruit is to not call consolidate_inplace in create_block_manager_from_arrays and create_block_manager_from_blocks. Also some avoidable places like BlockManager.is_numeric_mixed_type

…

On Mon, May 25, 2020 at 2:21 PM Joris Van den Bossche < ***@***.***> wrote: see above : #10556 <#10556> (comment) I am still not fully clear what you are pointing to. The linked comment explains the added features in your branch (the different policy options, mentioning that policy="split" results in "proper memory layout"), but it is this branch (well, your old branch but rebased on current master) that I have been using for running the notebook. you need to make really sure that ops are actually working with cache aligned data and not with just a view from a 2D array which is not aligned at all, otherwise your timings are not real Yes, but as far as I can see, this is the case (the Block's values in the "split" case are properly layout, not a strided view in the original) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#10556 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5UM6C66XGTHNK7PORDEJLRTLOOTANCNFSM4BK4FIOA> .

jbrockmendel · 2023-01-23T19:05:22Z

We now fully avoid silent consolidation. Is this still active?

mroeschke · 2023-05-11T23:38:46Z

This discussion doesn't seem to relevant anymore with the removal of silent consolidation so closing

jreback added API Design Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action labels Jul 13, 2015

jreback added the Prio-high label Aug 23, 2015

jreback added this to the Next Major Release milestone Aug 23, 2015

jreback removed the Prio-high label Mar 29, 2017

jreback modified the milestones: 2.0, Next Major Release Mar 29, 2017

jreback closed this as completed May 1, 2017

wesm mentioned this issue Dec 19, 2019

ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options apache/arrow#6067

Closed

jorisvandenbossche reopened this Jan 17, 2020

mroeschke added the Enhancement label Apr 27, 2020

jorisvandenbossche mentioned this issue Jun 9, 2020

Practical steps towards a simplified BlockManager #34669

Open

jorisvandenbossche mentioned this issue Aug 31, 2020

POC: ArrayManager -- array-based data manager for columnar store #36010

Merged

jorisvandenbossche mentioned this issue Dec 14, 2021

Refactor - ArrayManager overview issue #39146

Closed

11 tasks

mroeschke removed this from the 2.0 milestone Jan 13, 2023

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jan 23, 2023

mroeschke closed this as completed May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose the blocks API and disable automatic consolidation #10556

Expose the blocks API and disable automatic consolidation #10556

shoyer commented Jul 12, 2015

jonathanrocher commented Jul 14, 2015

jreback commented Jul 14, 2015

shoyer commented Jul 14, 2015

sinhrks commented Jul 15, 2015

jreback commented Jul 15, 2015

quicknir commented Jul 16, 2015

jreback commented Sep 27, 2015

jreback commented Sep 27, 2015

shoyer commented Sep 28, 2015 •

edited

Loading

jbrockmendel commented Apr 29, 2017

jreback commented May 1, 2017

jorisvandenbossche commented Jan 16, 2018 •

edited

Loading

wesm commented Dec 18, 2019

ARF1 commented Dec 20, 2019

wesm commented Dec 24, 2019

jorisvandenbossche commented Jan 17, 2020

jorisvandenbossche commented May 25, 2020

jreback commented May 25, 2020

jorisvandenbossche commented May 25, 2020

jreback commented May 25, 2020

jorisvandenbossche commented May 25, 2020

jbrockmendel commented May 25, 2020 via email

jbrockmendel commented Jan 23, 2023

mroeschke commented May 11, 2023

Expose the blocks API and disable automatic consolidation #10556

Expose the blocks API and disable automatic consolidation #10556

Comments

shoyer commented Jul 12, 2015

jonathanrocher commented Jul 14, 2015

jreback commented Jul 14, 2015

shoyer commented Jul 14, 2015

sinhrks commented Jul 15, 2015

jreback commented Jul 15, 2015

quicknir commented Jul 16, 2015

jreback commented Sep 27, 2015

jreback commented Sep 27, 2015

shoyer commented Sep 28, 2015 • edited Loading

jbrockmendel commented Apr 29, 2017

jreback commented May 1, 2017

jorisvandenbossche commented Jan 16, 2018 • edited Loading

wesm commented Dec 18, 2019

ARF1 commented Dec 20, 2019

wesm commented Dec 24, 2019

jorisvandenbossche commented Jan 17, 2020

jorisvandenbossche commented May 25, 2020

jreback commented May 25, 2020

jorisvandenbossche commented May 25, 2020

jreback commented May 25, 2020

jorisvandenbossche commented May 25, 2020

jbrockmendel commented May 25, 2020 via email

jbrockmendel commented Jan 23, 2023

mroeschke commented May 11, 2023

shoyer commented Sep 28, 2015 •

edited

Loading

jorisvandenbossche commented Jan 16, 2018 •

edited

Loading