Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: ArrayManager -- array-based data manager for columnar store #36010

Merged
merged 42 commits into from
Jan 13, 2021
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
a51835b
POC: ArrayManager -- array-based data manager for columnar store
jorisvandenbossche Jun 1, 2020
591579b
Update with latest master + some fixes
jorisvandenbossche Aug 27, 2020
896080a
add pd.options.mode.data_manager to switch
jorisvandenbossche Sep 4, 2020
f9c4dda
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 5, 2020
d18082a
add apply_with_block workaround
jorisvandenbossche Sep 5, 2020
cf3c07a
fix alignment in apply
jorisvandenbossche Sep 5, 2020
b252c6d
reorder methods to match BlockManager
jorisvandenbossche Sep 5, 2020
0fb645e
skip json tests for now
jorisvandenbossche Sep 5, 2020
eb55fef
skip more json tests + to_csv with to_native_types
jorisvandenbossche Sep 5, 2020
d241f31
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 6, 2020
47c3ee3
support both ndarrays and ExtensionArrays
jorisvandenbossche Sep 17, 2020
75f7de2
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 17, 2020
f36e395
add unstack
jorisvandenbossche Sep 17, 2020
be20816
fix native types, skip quantile, hdf, stata tests
jorisvandenbossche Sep 17, 2020
8b7cc81
remove skip in the benchmarks
jorisvandenbossche Sep 17, 2020
a239f50
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 17, 2020
a0ccf9a
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 22, 2020
dc1b190
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Oct 16, 2020
55d38be
remove manager keyword from DataFrame constructor, add _as_manager in…
jorisvandenbossche Oct 16, 2020
3dea0d7
move new ArrayManager code to separate file
jorisvandenbossche Oct 16, 2020
1a61333
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Nov 10, 2020
9751d33
de-privatize
jbrockmendel Nov 10, 2020
e45b645
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Dec 11, 2020
3749c7d
try fix up typing
jorisvandenbossche Dec 11, 2020
af53040
add pytest option + add one github actions build to run them
jorisvandenbossche Dec 11, 2020
cc45673
fix pytest marks for skipping when using array-manager
jorisvandenbossche Dec 12, 2020
27cf215
several fixes - get tests/frame/methods tests passing
jorisvandenbossche Dec 12, 2020
f6a97df
ci - only run the tests/frame/methods tests
jorisvandenbossche Dec 12, 2020
67c4c2b
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Dec 12, 2020
670ed76
mypy fix
jorisvandenbossche Dec 12, 2020
5128ad1
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Dec 18, 2020
5c73688
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Jan 8, 2021
a9a8c2d
move to internals/construction.py
jorisvandenbossche Jan 8, 2021
c7898fb
update for latest changes - fix tests/mypy
jorisvandenbossche Jan 8, 2021
3430307
fix todo
jorisvandenbossche Jan 8, 2021
1a30013
fix import in tests
jorisvandenbossche Jan 8, 2021
ef86b1e
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Jan 10, 2021
c5548d9
add union alias to typing
jorisvandenbossche Jan 10, 2021
afe8f80
updates based on review
jorisvandenbossche Jan 10, 2021
b88c757
skip json tests to avoid segfaults
jorisvandenbossche Jan 10, 2021
ddc51d0
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Jan 12, 2021
9dc5600
fix for Label -> Hashable change in master
jorisvandenbossche Jan 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -482,6 +482,14 @@ def use_inf_as_na_cb(key):
cf.register_option(
"use_inf_as_null", False, use_inf_as_null_doc, cb=use_inf_as_na_cb
)
cf.register_option(
"data_manager",
# TODO switch back to default of "block" before merging
# "block",
"array",
"internal manager type",
validator=is_one_of_factory(["block", "array"]),
)

cf.deprecate_option(
"mode.use_inf_as_null", msg=use_inf_as_null_doc, rkey="mode.use_inf_as_na"
Expand Down
23 changes: 20 additions & 3 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@
from pandas.core.indexes.multi import MultiIndex, maybe_droplevels
from pandas.core.indexes.period import PeriodIndex
from pandas.core.indexing import check_bool_indexer, convert_to_index_sliceable
from pandas.core.internals import BlockManager
from pandas.core.internals import ArrayManager, BlockManager
from pandas.core.internals.construction import (
arrays_to_mgr,
dataclasses_to_dicts,
Expand Down Expand Up @@ -437,6 +437,9 @@ def __init__(
columns: Optional[Axes] = None,
dtype: Optional[Dtype] = None,
copy: bool = False,
# TODO do we want to keep this as a keyword as well? (I think it can be handy)
# can we somehow make it a "private" keyword? (`_manager` ?)
manager: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i'd be ok with ```flags`` being an added keyword to the constructor (and you can then make manager a flag)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just for convenience (to easily change back and forth) I'd prefer a private method on DataFrame to change the manager (or return a new DataFrame with a new manager).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keyword right now makes it a bit more convenient to pick a specific one in the tests, or to compare both versions side by side.

Eg instead of

with pd.options_context("mode.data_manager", "block"):
    df = pd.DataFrame(...)

you can do

df = pd.DataFrame(..., manager="block")

if you want eg a certain test to only run using a specific manager, regardless of the global setting.

I fully agree we should be careful with making this a public keyword, but I think that also for internal use eg in tests, it would be good to have a convenient way to do this.

With a private method, you are thinking of a class method like pd.DataFrame._construct_with_array_manager(...) / pd.DataFrame._construct_with_block_manager(..) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a private method, you are thinking of a class method like pd.DataFrame._construct_with_array_manager(...) / pd.DataFrame._construct_with_block_manager(..) ?

I was thinking pd.DataFrame(...)._as_array_manager(). But whatever is easiest for testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK. For testing that should also be fine (as long as it's not testing the constructor ;)). It will be less efficient as it's only converting to the other format after initial construction (which I do now anyway, but the goal is to fix that at some point of course), but for testing purposes that doesn't matter.

Could also be a pd.DataFrame(..)._as_manager("block"/"array") ?
Because we want to have both. Eg some test might be testing specifically aspects of block -based dataframe, so even when running the tests with global option to use array manager, that test should still use block manager.

):
if data is None:
data = {}
Expand All @@ -446,7 +449,7 @@ def __init__(
if isinstance(data, DataFrame):
data = data._mgr

if isinstance(data, BlockManager):
if isinstance(data, (BlockManager, ArrayManager)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these have a common base class?

if index is None and columns is None and dtype is None and copy is False:
# GH#33357 fastpath
NDFrame.__init__(
Expand Down Expand Up @@ -555,6 +558,16 @@ def __init__(
values, index, columns, dtype=values.dtype, copy=False
)

if manager is None:
manager = get_option("mode.data_manager")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if this slows down DataFrame() noticeably? I recall one of the options-related methods being somewhat slow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a few timings (on this branch), but it's not fully clear:

In [1]: from pandas._config import get_option  

In [2]: %timeit get_option("mode.data_manager")  
1.89 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [3]: %timeit pd.DataFrame(np.random.randn(3, 2))
86.9 µs ± 5.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [4]: df = pd.DataFrame(np.random.randn(3, 2))   

# creating from manager (with manager type matching default of "array")
In [5]: %timeit pd.DataFrame(df._mgr) 
2.23 µs ± 465 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

# specifing the manager -> don't look up the option, you would expect it to be faster (but not much difference)
In [7]: %timeit pd.DataFrame(df._mgr, manager="array") 
1.95 µs ± 75 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [8]: df = pd.DataFrame(np.random.randn(3, 2), manager="block")   

In [9]: %timeit pd.DataFrame(df._mgr, manager="block") 
2.29 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [10]: pd.options.mode.data_manager = "block"  

In [11]: %timeit pd.DataFrame(df._mgr)  
2.33 µs ± 245 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [12]: %timeit pd.DataFrame(df._mgr, manager="block")   
2.28 µs ± 246 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [13]: %timeit get_option("mode.data_manager")  
1.79 µs ± 246 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So while the get_option itself takes a bit less than 2µs, avoiding it in the DataFrame constructor (by specifying the option), you don't really see an equivalent speed-up.

In any case, I think it's only relevant for the case of passing a Manager, because once you pass any kind of data (like an array I did above), it takes much more time anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, what I forgot in the explanation above, is that right now we don't check the option for the "fastpath" case of passing just a manager object. Only when data gets passed that needs to be converted to a manager (or when the manager needs to be reindexed), then the option is checked (and in this case the constructor takes a lot more time than checking the option anyway).


if manager == "array" and not isinstance(mgr, ArrayManager):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we just pass thru _as_manager always (and relocate it to a proper place)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_as_manager expects an already initialized dataframe/manager, while that is not yet necessarily the case here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is to always use _as_manager which can construct things rather than doing this if/then here. this can be simply a module level constructor for both Block/Array manager i think. avoids bloat here in the constructor itself and pushes this to pandas/core/construction.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to remove the if/else here, and defer to a function in internals/construction.py to ensure correct Manager type (and convert if needed).

# TODO proper initialization
df = DataFrame(mgr, manager="block")
arrays = [arr.copy() for arr in df._iter_column_arrays()]
mgr = ArrayManager(arrays, [mgr.axes[1], mgr.axes[0]])
# TODO check for case of manager="block" but mgr is ArrayManager

NDFrame.__init__(self, mgr)

# ----------------------------------------------------------------------
Expand Down Expand Up @@ -629,6 +642,8 @@ def _is_homogeneous_type(self) -> bool:
... "B": np.array([1, 2], dtype=np.int64)})._is_homogeneous_type
False
"""
if isinstance(self._mgr, ArrayManager):
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to return something here, as the code below relies on BlockManager internals.
So I could "replicate" the same logic but for the ArrayManager. But I think for POC-reasons I didn't yet do the effort up to now (it's also not used in many places, and it's also not causing any bugs, just a potentially missed optimization (but which is probably not yet implemented for ArrayManager anyway), don't really exactly remember it.
Now, it shouldn't be hard to add, just checking the length of the set of dtypes of columns instead of blocks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments where appropriate to point out places where things are temporary/POC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific case, simply updated it to do the actual correct thing (check the number of unique types). I think other edits are either OK or already have a TODO note

if self._mgr.any_extension_types:
return len({block.dtype for block in self._mgr.blocks}) == 1
else:
Expand All @@ -640,6 +655,8 @@ def _can_fast_transpose(self) -> bool:
"""
Can we transpose this DataFrame without creating any new array objects.
"""
if isinstance(self._data, ArrayManager):
return False
if self._data.any_extension_types:
# TODO(EA2D) special case would be unnecessary with 2D EAs
return False
Expand Down Expand Up @@ -5351,7 +5368,7 @@ def sort_values( # type: ignore[override]
)

if ignore_index:
new_data.axes[1] = ibase.default_index(len(indexer))
new_data.set_axis(1, ibase.default_index(len(indexer)))
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved

result = self._constructor(new_data)
if inplace:
Expand Down
21 changes: 13 additions & 8 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@
from pandas.core.indexes.datetimes import DatetimeIndex
from pandas.core.indexes.period import Period, PeriodIndex
import pandas.core.indexing as indexing
from pandas.core.internals import BlockManager
from pandas.core.internals import ArrayManager, BlockManager
from pandas.core.missing import find_valid_index
from pandas.core.ops import align_method_FRAME
from pandas.core.shared_docs import _shared_docs
Expand Down Expand Up @@ -196,7 +196,7 @@ class NDFrame(PandasObject, SelectionMixin, indexing.IndexingMixin):
_deprecations: FrozenSet[str] = frozenset(["get_values", "tshift"])
_metadata: List[str] = []
_is_copy = None
_mgr: BlockManager
_mgr: Union[BlockManager, ArrayManager]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should make this an actual type in _typing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls do this

_attrs: Dict[Optional[Hashable], Any]
_typ: str

Expand All @@ -205,7 +205,7 @@ class NDFrame(PandasObject, SelectionMixin, indexing.IndexingMixin):

def __init__(
self,
data: BlockManager,
data: Union[BlockManager, ArrayManager],
copy: bool = False,
attrs: Optional[Mapping[Optional[Hashable], Any]] = None,
):
Expand All @@ -222,7 +222,9 @@ def __init__(
object.__setattr__(self, "_flags", Flags(self, allows_duplicate_labels=True))

@classmethod
def _init_mgr(cls, mgr, axes, dtype=None, copy: bool = False) -> BlockManager:
def _init_mgr(
cls, mgr, axes, dtype=None, copy: bool = False
) -> Union[BlockManager, ArrayManager]:
""" passed a manager and a axes dict """
for a, axe in axes.items():
if axe is not None:
Expand All @@ -235,8 +237,9 @@ def _init_mgr(cls, mgr, axes, dtype=None, copy: bool = False) -> BlockManager:
mgr = mgr.copy()
if dtype is not None:
# avoid further copies if we can
if len(mgr.blocks) > 1 or mgr.blocks[0].values.dtype != dtype:
mgr = mgr.astype(dtype=dtype)
# TODO
# if len(mgr.blocks) > 1 or mgr.blocks[0].values.dtype != dtype:
mgr = mgr.astype(dtype=dtype)
return mgr

# ----------------------------------------------------------------------
Expand Down Expand Up @@ -4469,11 +4472,11 @@ def sort_index(
new_data = self._mgr.take(indexer, axis=baxis, verify=False)

# reconstruct axis if needed
new_data.axes[baxis] = new_data.axes[baxis]._sort_levels_monotonic()
new_data.set_axis(baxis, new_data.axes[baxis]._sort_levels_monotonic())

if ignore_index:
axis = 1 if isinstance(self, ABCDataFrame) else 0
new_data.axes[axis] = ibase.default_index(len(indexer))
new_data.set_axis(axis, ibase.default_index(len(indexer)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment on why these need to be changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight, it's probably not needed anymore. Originally, I changed the axes attribute on ArrayManager to be (index, columns) order. But afterwards, I changed this to be stored in _axes, and having an axes property that switches the internal order, to match the "public" interface of BlockManager.

So with that change, I could revert those edits (I think, but can check that)

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Jan 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked this again, and so this change is still needed (without it I get failures for the array manager).
The reason is that mgr.axes is no longer the actual "storage" of the axes for the ArrayManager, but is there stored in _axes. And because mgr.axes returns a list, it's actually updating the list in place, and not the actual storage. Which is a bit of a gotcha now with the manager's axes attribute.

But based on a quick check, this seems to be the only place where we assign to mgr.axes[i] = ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explaining. we should make another attempt to get axes out of FooManager before too long


result = self._constructor(new_data)

Expand Down Expand Up @@ -5438,6 +5441,8 @@ def _protect_consolidate(self, f):
Consolidate _mgr -- if the blocks have changed, then clear the
cache
"""
if isinstance(self._mgr, ArrayManager):
return f()
blocks_before = len(self._mgr.blocks)
result = f()
if len(self._mgr.blocks) != blocks_before:
Expand Down
2 changes: 2 additions & 0 deletions pandas/core/internals/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
)
from pandas.core.internals.concat import concatenate_block_managers
from pandas.core.internals.managers import (
ArrayManager,
BlockManager,
SingleBlockManager,
create_block_manager_from_arrays,
Expand All @@ -35,6 +36,7 @@
"TimeDeltaBlock",
"safe_reshape",
"make_block",
"ArrayManager",
"BlockManager",
"SingleBlockManager",
"concatenate_block_managers",
Expand Down
18 changes: 17 additions & 1 deletion pandas/core/internals/concat.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from collections import defaultdict
import copy
import itertools
from typing import Dict, List

import numpy as np
Expand All @@ -26,7 +27,7 @@
import pandas.core.algorithms as algos
from pandas.core.arrays import DatetimeArray, ExtensionArray
from pandas.core.internals.blocks import make_block
from pandas.core.internals.managers import BlockManager
from pandas.core.internals.managers import ArrayManager, BlockManager


def concatenate_block_managers(
Expand All @@ -46,6 +47,21 @@ def concatenate_block_managers(
-------
BlockManager
"""
if isinstance(mgrs_indexers[0][0], ArrayManager):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we assuming that they are either all-ArrayManager or all-BlockManager?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we assuming that they are either all-ArrayManager or all-BlockManager?

Yes, and this concat implementation right now is very limited in general (eg only the simple case without any reindexing needed, several tests are skipped because of this, I put several TODO(ArrayManager) concat with reindexing because of this).
Concatenation is one of the big areas of work for follow-up on this initial PR.


if concat_axis == 1:
# TODO for now only fastpath without indexers
mgrs = [t[0] for t in mgrs_indexers]
arrays = [
concat_compat([mgrs[i].arrays[j] for i in range(len(mgrs))], axis=0)
for j in range(len(mgrs[0].arrays))
]
return ArrayManager(arrays, [axes[1], axes[0]])
elif concat_axis == 0:
mgrs = [t[0] for t in mgrs_indexers]
arrays = list(itertools.chain.from_iterable([mgr.arrays for mgr in mgrs]))
return ArrayManager(arrays, [axes[1], axes[0]])

concat_plans = [
_get_mgr_concatenation_plan(mgr, indexers) for mgr, indexers in mgrs_indexers
]
Expand Down
Loading