Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose the blocks API and disable automatic consolidation #10556

Closed
shoyer opened this issue Jul 12, 2015 · 24 comments
Closed

Expose the blocks API and disable automatic consolidation #10556

shoyer opened this issue Jul 12, 2015 · 24 comments
Labels
API Design Closing Candidate May be closeable, needs more eyeballs Enhancement Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action

Comments

@shoyer
Copy link
Member

shoyer commented Jul 12, 2015

In my discussion with Jonathan and others and at the SciPy sprints, we agreed that it would be really nice to expose some minimal tools for manipulating and view the internal pandas blocks system. For example, it should be possible to:

  1. manually consolidate blocks
  2. view a representation of the internal blocking of a dataframe (via matplotlib?)

It's not so much that we want to create and use blocks directly, but that we want to make it easier to understand the internal data model and make performance with more predictable.

At the same time, we would like to disable automatic consolidation of blocks in the DataFrame constructor and when inserting new columns. Consolidation is certainly a useful feature, but it is currently not always possible to even predict when it will happen.

Most users never notice or care about consolidation. Power users (concerned about memory or performance) are at least as likely to find it frustrating as helpful, so we should make this something that they can trigger explicitly (as part of the blocks API). This would make it possible to create dataframes while guaranteeing that none of the data is copied (#9216).

cc @jonathanrocher @sinhrks @jreback @cpcloud @TomAugspurger @ARF1 @quicknir

@jreback jreback added API Design Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action labels Jul 13, 2015
@jonathanrocher
Copy link

Yes, I am definitely a big +1 on both of these ideas. I need to understand the current state better, before experimenting with such a tool. Just to clarify what @shoyer means by the blocks API, we discussed exposing these view and consolidation methods via a blocks attribute: df.blocks.view(), df.blocks.consolidate(), or something like that.

@jreback
Copy link
Contributor

jreback commented Jul 14, 2015

well

df.as_blocks(copy=False) (copy flag added in master recently) gives you a view right now

and as discussed, I would like to see why this should actually be exposed to the user. What is the purpose? A good use case would go a long way. Showing an implementation is only guaranteed to cause issues down the road when one wants to change it (and cannot because its exposed)

@shoyer
Copy link
Member Author

shoyer commented Jul 14, 2015

I don't think we should expose the full blocks API. Rather, we should expose a limited set of functionality suitable for basic understand and manipulation of blocks (mostly .consolidate_blocks()), so we can have the option of disabling automatic consolidation. If we ever change the underlying implementation, we can just make this method a no-op.

For viewing block information, we might expose a tuple, similarly to the .chunks attribute on dask arrays.

@sinhrks
Copy link
Member

sinhrks commented Jul 15, 2015

I assume exposing some API is for let power users experiment, rather than daily usage? I think adding more detail to internals.rst is very nice (#4082), but not sure what level of "public" is actually required (guarantee backward compat or adding to API.rst, etc)

+1 for optionally disabling auto consolidation, it should allow pandas handle larger size of data.

@jreback
Copy link
Contributor

jreback commented Jul 15, 2015

I still have not seen any reason to actually expose anything about the internals as an API.

It seems that one might want to control how the internal engine works, e.g. simply have an option on construction that allows potentially different internal representation.

No consolidation is a nice idea on the surface and their may be a reason to allow this and/or switch as a default. HOWEVER, I would like to see someone do a performance study to see if it actually makes a difference in a non-trivial case.

Turning off auto-consolidation is actually trivial, so if someone would like to step up, all ears.

@quicknir
Copy link

@shoyer Thanks for tagging me on this. I am a strong +1 on this. As an example of a use case, ARF1 showed in #9216 how one could use the block manager interface to create an empty DataFrame nearly 100 times faster than the naive approach. It would be nice if one could do this without accessing _implementation methods that are subject to change without warning.

When developing a C/C++ library that interfaces with pandas, memory layout is critical. Increasing public interface that allows better guarantees and more explicit behavior of memory layout is very helpful. I can give other examples in pandas where similar issues arise.

@jreback jreback added this to the Next Major Release milestone Aug 23, 2015
@jreback
Copy link
Contributor

jreback commented Sep 27, 2015

jreback@44c280b

is a branch that allows a parameter policy to control how consolidation works.

Its not completely automatic at this point (e.g. you have to pass policy='split|column' to get this behavior (which could be an option, support.consolidation). Further for a dict we can simply default to column I think.

note that:

  • column -> don't consolidate, IOW, this is a 'column' repr internally. This won't 'split' a passed in ndarray though (e.g. its still a single block to begin with). This could eventually be the default.
  • split -> force split a passed ndarray. This can be costly, but with the upside that you get proper memory layout. This has to be a user decision as its not always necessary / wanted.
  • needs some more tests and such
  • needs some perf testing
  • need to have various ops (e.g. concat), deal with the policy properly (may be non-trivial)

@jreback
Copy link
Contributor

jreback commented Sep 27, 2015

Here are some perf considerations

In [3]: np.random.seed(1234)

# blocks
In [4]: df_blocks = DataFrame(np.random.randn(100000,1000))

# create a single ndarray, then split it
In [5]: df_split = DataFrame(np.random.randn(100000,1000),policy='split')

# create a dict of ndarrays, no need to split
In [10]: df_dsplit = DataFrame(dict([ (i,np.random.randn(100000)) for i in range(1000) ]),policy='split')

Creations

# block
In [7]: %timeit DataFrame(np.random.randn(100000,1000))
1 loops, best of 3: 3.79 s per loop

# ndarray-split
In [6]: %timeit DataFrame(np.random.randn(100000,1000),policy='split')
1 loops, best of 3: 5.93 s per loop

# dict-split
In [14]: %timeit DataFrame(dict([ (i,np.random.randn(100000)) for i in range(1000) ]),policy='split')
1 loops, best of 3: 3.93 s per loop

row-wise ops

# blocked
In [8]: %timeit df_blocks.sum()
1 loops, best of 3: 1.55 s per loop

# ndarray-split
In [9]: %timeit df_split.sum()
1 loops, best of 3: 607 ms per loop

# dict-split
In [15]: %timeit df_dsplit.sum()
1 loops, best of 3: 617 ms per loop

So you may be suprised by the speed of the splits. This is because these are already c-contiguous, while the block memory is NOT. Its in c-order, but not for a row-wise op.

@shoyer
Copy link
Member Author

shoyer commented Sep 28, 2015

For the creation benchmarks, it would be helpful to put generating the
random values outside the loop. This does look very encouraging, though.

jreback added a commit to jreback/pandas that referenced this issue Oct 11, 2015
- closes pandas-dev#10556, add policy argument to constructors
- closes pandas-dev#9216, all passing of dict with view directly to the API
- closes pandas-dev#5902
- closes pandas-dev#8571 by defining __copy__/__deepcopy__
@jreback jreback removed the Prio-high label Mar 29, 2017
@jreback jreback modified the milestones: 2.0, Next Major Release Mar 29, 2017
@jbrockmendel
Copy link
Member

Would this make the "dict of Series" description of a DataFrame more reliable? The sometimes-vexing behavior I'm thinking of is

>>> arr = np.random.randn(1000, 5)
>>> df = pd.DataFrame(arr, columns=list('ABCDE'))
>>> A = df['A']
>>> G = 2*A
>>> G.name = 'G'
>>> df['G'] = G
>>> df['G'] is G
False
>>> df['A'] is A
False

Having id(df['A']) be consistent would make keeping track of _metadata simpler in some cases.

@jreback
Copy link
Contributor

jreback commented May 1, 2017

this issue is not going to happen for pandas 1.x. In pandas 2.x this will be moot as no more blocks.

@jreback jreback closed this as completed May 1, 2017
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this issue Jan 14, 2018
- closes pandas-dev#10556, add policy argument to constructors
- closes pandas-dev#9216, all passing of dict with view directly to the API
- closes pandas-dev#5902
- closes pandas-dev#8571 by defining __copy__/__deepcopy__
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this issue Jan 16, 2018
- closes pandas-dev#10556, add policy argument to constructors
- closes pandas-dev#9216, all passing of dict with view directly to the API
- closes pandas-dev#5902
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 16, 2018

I rebased the commit of @jreback to play a bit with it to do some more timings. Using the same example as above, but now excluding the array creation and on more operations:

In [2]: np.random.seed(1234)
   ...: array = np.random.randn(100000,1000)
   ...: array_dict = dict([ (i,np.random.randn(100000)) for i in range(1000) ])

In [3]: # blocks
   ...: df_blocks = DataFrame(array)
   ...: 
   ...: # create a single ndarray, then split it
   ...: df_split = DataFrame(array, policy='split')
   ...: 
   ...: # create a dict of ndarrays, no need to split
   ...: df_dsplit = DataFrame(array_dict, policy='split')

Creation: splitting array much slower, but when you already have dict of arrays it is logically faster to not consolidate them:

In [4]: %timeit DataFrame(array)
102 µs ± 932 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit DataFrame(array, policy='split')
3.01 s ± 93.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit DataFrame(array_dict, policy='split')
29.5 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit DataFrame(array_dict, policy='block')
627 ms ± 4.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Operations column-wise:

In [8]: # blocked
   ...: %timeit df_blocks.sum()
994 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: # splitted
    ...: %timeit df_split.sum()
1.99 s ± 84.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Operation row-wise:

In [11]: # blocked
    ...: %timeit df_blocks.sum(axis=1)
889 ms ± 29.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: # splitted
    ...: %timeit df_split.sum(axis=1)
1.85 s ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Slicing (much slower!):

In [21]: %timeit df_blocks.iloc[0:1000]
106 µs ± 600 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [22]: %timeit df_split.iloc[0:1000]
    ...: 
16.9 ms ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So in general, as expected, it gives a slowdown on the above operations (of course it is the specific case of a lot of all-float columns (only iterating over the rows with eg itertuples was faster with the splitted version ...). But especially the slicing operation is much much slower.

So in any case, if we would like to further explore this, I think it will certainly mean having to refactor the BlockManager to make operations that have to loop over the columns faster.

@wesm
Copy link
Member

wesm commented Dec 18, 2019

I've run up against this issue myself (like fighting with my own ghost) in the context of ARROW-3789 as there are use cases where users are converting very large (tens of gigabytes) Arrow tables and wish to avoid excess allocations / memory copying. In certain narrow use cases (i.e. no strings, no nulls), it is actually possible to do fully zero-copy construction of DataFrame from Arrow memory.

This bring up the spectre of unpredictable consolidation. I don't know if there have been any mitigations implemented in the last 4 years about this but I'm interested to see if there are ideas

@ARF1
Copy link

ARF1 commented Dec 20, 2019

@wesm I had a similar problem with performant boclz integration a few years back. Would this approach help in the meantime? While this is not zero-copy, in my case the memory copy had a negligible performance impact compared to the pandas blocks construction.

This worked for my fairly limited use case then. No thorough testing was done. Also, things may have changed since then. I have not used this in years.

@wesm
Copy link
Member

wesm commented Dec 24, 2019

I've seen use cases where the memory doubling will cause OOM, so zero-copy / no consolidation is essential

wesm added a commit to apache/arrow that referenced this issue Jan 15, 2020
…eries/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options

The primary goal of this patch is to provide a way for some users to avoid memory doubling with converting from Arrow to pandas.

This took me entirely too much time to get right, but partly I was attempting to disentangle some of the technical debt and overdue refactoring in arrow_to_pandas.cc.

Summary of what's here:

- Refactor ChunkedArray->Series and Table->DataFrame conversion paths to use the exact same code rather than two implementations of the same thing with slightly different behavior. The `ArrowDeserializer` helper class is now gone
- Do zero-copy construction of internal DataFrame blocks for the case of a contiguous non-nullable array and a block with only 1 column represented
- Add `split_blocks` option to `to_pandas` which constructs one block per DataFrame column, resulting in more zero-copy opportunities. Note that pandas's internal "consolidation" can still cause memory doubling (see discussion about this in pandas-dev/pandas#10556)
- Add `self_destruct` option to `to_pandas` which releases the Table's internal buffers as soon as they are converted to the required pandas structure. This allows memory to be reclaimed by the OS as conversion is taking place rather than having a forced memory-doubling and then post-facto reclamation (which has been causing OOM for some users)

The most conservative invocation of `to_pandas` now would be `table.to_pandas(use_threads=False, split_blocks=True, self_destruct=True)`

Note that the self-destruct option makes the `Table` object unsafe for further use. This is a bit dissatisfying but I wasn't sure how else to provide this capability.

Closes #6067 from wesm/ARROW-3789 and squashes the following commits:

3b42602 <Wes McKinney> Code review comments
8f39cce <Wes McKinney> Add some documentation. Try fixing MSVC warnings
c22d280 <Wes McKinney> Fix one MSVC cast warning
4306803 <Wes McKinney> Add "split blocks" and "self destruct" options to Table.to_pandas, with zero-copy operations for improved memory use when converting from Arrow to pandas

Authored-by: Wes McKinney <wesm+git@apache.org>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
@jorisvandenbossche
Copy link
Member

I don't know if there have been any mitigations implemented in the last 4 years about this but I'm interested to see if there are ideas

Personally, I don't think getting such feature in pandas would be that hard. The commit linked above is not that complicated. And now we have the new metadata machinery ("attrs" attribute), we could maybe use that to attach the "consolidation policy" as metadata to the DataFrame.

If there is some agreement in that we still would want to merge something like this, I could try to revive that branch.


Generally speaking, I think the long term path forward are the 1D ExtensionArray based columns, which directly have the consequence of not getting consolidated (because they are 1D) and moving towards a "collection of 1D arrays" model for the DataFrame. We already have a few such dtypes now, and I think the plan is to keep adding more.
But of course, even if we get there eventually, this is long term work (certainly until those would become the default. So personally, I think it might still be worth considering to look into the "no consolidation policy" for the shorter term.

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue Mar 17, 2020
- closes pandas-dev#10556, add policy argument to constructors
- closes pandas-dev#9216, all passing of dict with view directly to the API
- closes pandas-dev#5902
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue Mar 17, 2020
- closes pandas-dev#10556, add policy argument to constructors
- closes pandas-dev#9216, all passing of dict with view directly to the API
- closes pandas-dev#5902
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this issue May 25, 2020
- closes pandas-dev#10556, add policy argument to constructors
- closes pandas-dev#9216, all passing of dict with view directly to the API
- closes pandas-dev#5902
@jorisvandenbossche
Copy link
Member

I have updated the branch adding a consolidation policy once again (starting from @TomAugspurger's recent rebase, see the linked commit by github above).

And with that, I experimented a bit comparing a few operations with a consolidated DataFrame vs non-consolidated DataFrame: see notebook at https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c

Of course, for several operations, it is faster to operate on a 2D array compared to operating on many 1D arrays. However, with a decent-sized dataframe (I used 100_000 rows x 100 columns here), the difference is generally not that big (for this size of dataframe, the actual operation dominates the overhead of doing things column-wise).
And there were actually also a few operations that are faster column-wise .. (although this could quite likely also be improved to get block-wise as fast, with some time investment).

My general sense (of course, based on very limited tests) is that the slowdown with non-consolidated dataframes for typical operations is not gigantic, and with some effort optimizing the BlockManager for this use case (right now there is often a lot of overhead in finding the correct block type, reconstructing blocks, etc, which is something we should be able to optimize for non-consolidated dataframes) we should be able to remove a large part of this slowdown.
And of course, there are also operations that will actually benefit from non-consolidated dataframes (especially where copies can be avoided this way).

@jreback
Copy link
Contributor

jreback commented May 25, 2020

@jorisvandenbossche you are missing a big issue here

just operating column wise is not a good test

you need to actually copy the original frame into single columns where the memory gets realigned; taking a view does nothing here which is effectively what you are doing

you will see massive speed ups if you get proper cache access for some times of operations

please see me original PR in non consolidation (a while back)

@jorisvandenbossche
Copy link
Member

@jreback I am not really sure what you mean. Can you clarify a bit?

just operating column wise is not a good test

Why not? That's what we would do if we didn't have consolidation?
Of course, it's still using 2D blocks to store a 1D array, which is not ideal (so we should be able to optimize this if we actually have 1D blocks), but already doing this should give an idea about overhad of doing things column-wise.

you need to actually copy the original frame into single columns where the memory gets realigned; taking a view does nothing here which is effectively what you are doing

Can you explain what you mean? (maybe a code snippet what I should do)
You mean that df2 = pd.DataFrame(arr, policy="split") is not actually properly splitting the array?
(as far as I can see, it is actually taking a copy, as the df2 is no longer a view on the original array)

please see me original PR in non consolidation (a while back)

Do you have a link? (I was not aware of a PR about this)

@jreback
Copy link
Contributor

jreback commented May 25, 2020

see above : #10556 (comment)

you need to make really sure that ops are actually working with cache aligned data and not with just a view from a 2D array which is not aligned at all, otherwise your timings are not real

@jorisvandenbossche
Copy link
Member

see above : #10556 (comment)

I am still not fully clear what you are pointing to. The linked comment explains the added features in your branch (the different policy options, mentioning that policy="split" results in "proper memory layout"), but it is this branch (well, your old branch but rebased on current master) that I have been using for running the notebook.

you need to make really sure that ops are actually working with cache aligned data and not with just a view from a 2D array which is not aligned at all, otherwise your timings are not real

Yes, but as far as I can see, this is the case (the Block's values in the "split" case are properly layout, not a strided view in the original)

@jbrockmendel
Copy link
Member

jbrockmendel commented May 25, 2020 via email

@jbrockmendel
Copy link
Member

We now fully avoid silent consolidation. Is this still active?

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jan 23, 2023
@mroeschke
Copy link
Member

This discussion doesn't seem to relevant anymore with the removal of silent consolidation so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Closing Candidate May be closeable, needs more eyeballs Enhancement Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

10 participants