-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: ArrayManager -- array-based data manager for columnar store #36010
Changes from 3 commits
a51835b
591579b
896080a
f9c4dda
d18082a
cf3c07a
b252c6d
0fb645e
eb55fef
d241f31
47c3ee3
75f7de2
f36e395
be20816
8b7cc81
a239f50
a0ccf9a
dc1b190
55d38be
3dea0d7
1a61333
9751d33
e45b645
3749c7d
af53040
cc45673
27cf215
f6a97df
67c4c2b
670ed76
5128ad1
5c73688
a9a8c2d
c7898fb
3430307
1a30013
ef86b1e
c5548d9
afe8f80
b88c757
ddc51d0
9dc5600
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -128,7 +128,7 @@ | |
from pandas.core.indexes.multi import MultiIndex, maybe_droplevels | ||
from pandas.core.indexes.period import PeriodIndex | ||
from pandas.core.indexing import check_bool_indexer, convert_to_index_sliceable | ||
from pandas.core.internals import BlockManager | ||
from pandas.core.internals import ArrayManager, BlockManager | ||
from pandas.core.internals.construction import ( | ||
arrays_to_mgr, | ||
dataclasses_to_dicts, | ||
|
@@ -446,6 +446,9 @@ def __init__( | |
columns: Optional[Axes] = None, | ||
dtype: Optional[Dtype] = None, | ||
copy: bool = False, | ||
# TODO setting default to "array" for testing purposes (the actual default | ||
# needs to stay "block" initially of course for backwards compatibility) | ||
manager: Optional[str] = None, | ||
): | ||
if data is None: | ||
data = {} | ||
|
@@ -455,7 +458,7 @@ def __init__( | |
if isinstance(data, DataFrame): | ||
data = data._mgr | ||
|
||
if isinstance(data, BlockManager): | ||
if isinstance(data, (BlockManager, ArrayManager)): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. these have a common base class? |
||
if index is None and columns is None and dtype is None and copy is False: | ||
# GH#33357 fastpath | ||
NDFrame.__init__( | ||
|
@@ -564,6 +567,16 @@ def __init__( | |
values, index, columns, dtype=values.dtype, copy=False | ||
) | ||
|
||
if manager is None: | ||
manager = get_option("mode.data_manager") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you check if this slows down There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I ran a few timings (on this branch), but it's not fully clear: In [1]: from pandas._config import get_option
In [2]: %timeit get_option("mode.data_manager")
1.89 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [3]: %timeit pd.DataFrame(np.random.randn(3, 2))
86.9 µs ± 5.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [4]: df = pd.DataFrame(np.random.randn(3, 2))
# creating from manager (with manager type matching default of "array")
In [5]: %timeit pd.DataFrame(df._mgr)
2.23 µs ± 465 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# specifing the manager -> don't look up the option, you would expect it to be faster (but not much difference)
In [7]: %timeit pd.DataFrame(df._mgr, manager="array")
1.95 µs ± 75 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: df = pd.DataFrame(np.random.randn(3, 2), manager="block")
In [9]: %timeit pd.DataFrame(df._mgr, manager="block")
2.29 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [10]: pd.options.mode.data_manager = "block"
In [11]: %timeit pd.DataFrame(df._mgr)
2.33 µs ± 245 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [12]: %timeit pd.DataFrame(df._mgr, manager="block")
2.28 µs ± 246 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [13]: %timeit get_option("mode.data_manager")
1.79 µs ± 246 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) So while the In any case, I think it's only relevant for the case of passing a Manager, because once you pass any kind of data (like an array I did above), it takes much more time anyway. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, what I forgot in the explanation above, is that right now we don't check the option for the "fastpath" case of passing just a manager object. Only when data gets passed that needs to be converted to a manager (or when the manager needs to be reindexed), then the option is checked (and in this case the constructor takes a lot more time than checking the option anyway). |
||
|
||
if manager == "array" and not isinstance(mgr, ArrayManager): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why don't we just pass thru _as_manager always (and relocate it to a proper place) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. my point is to always use _as_manager which can construct things rather than doing this if/then here. this can be simply a module level constructor for both Block/Array manager i think. avoids bloat here in the constructor itself and pushes this to pandas/core/construction.py There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated to remove the if/else here, and defer to a function in |
||
# TODO proper initialization | ||
df = DataFrame(mgr, manager="block") | ||
arrays = [arr.copy() for arr in df._iter_column_arrays()] | ||
mgr = ArrayManager(arrays, [mgr.axes[1], mgr.axes[0]]) | ||
# TODO check for case of manager="block" but mgr is ArrayManager | ||
|
||
NDFrame.__init__(self, mgr) | ||
|
||
# ---------------------------------------------------------------------- | ||
|
@@ -638,6 +651,8 @@ def _is_homogeneous_type(self) -> bool: | |
... "B": np.array([1, 2], dtype=np.int64)})._is_homogeneous_type | ||
False | ||
""" | ||
if isinstance(self._mgr, ArrayManager): | ||
return False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this different? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I need to return something here, as the code below relies on BlockManager internals. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add comments where appropriate to point out places where things are temporary/POC? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For this specific case, simply updated it to do the actual correct thing (check the number of unique types). I think other edits are either OK or already have a TODO note |
||
if self._mgr.any_extension_types: | ||
return len({block.dtype for block in self._mgr.blocks}) == 1 | ||
else: | ||
|
@@ -649,6 +664,8 @@ def _can_fast_transpose(self) -> bool: | |
""" | ||
Can we transpose this DataFrame without creating any new array objects. | ||
""" | ||
if isinstance(self._data, ArrayManager): | ||
return False | ||
if self._data.any_extension_types: | ||
# TODO(EA2D) special case would be unnecessary with 2D EAs | ||
return False | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -100,7 +100,7 @@ | |
from pandas.core.indexes.datetimes import DatetimeIndex | ||
from pandas.core.indexes.period import Period, PeriodIndex | ||
import pandas.core.indexing as indexing | ||
from pandas.core.internals import BlockManager | ||
from pandas.core.internals import ArrayManager, BlockManager | ||
from pandas.core.missing import find_valid_index | ||
from pandas.core.ops import _align_method_FRAME | ||
from pandas.core.shared_docs import _shared_docs | ||
|
@@ -197,7 +197,7 @@ class NDFrame(PandasObject, SelectionMixin, indexing.IndexingMixin): | |
_deprecations: FrozenSet[str] = frozenset(["get_values", "tshift"]) | ||
_metadata: List[str] = [] | ||
_is_copy = None | ||
_mgr: BlockManager | ||
_mgr: Union[BlockManager, ArrayManager] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should make this an actual type in _typing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pls do this |
||
_attrs: Dict[Optional[Hashable], Any] | ||
_typ: str | ||
|
||
|
@@ -206,7 +206,7 @@ class NDFrame(PandasObject, SelectionMixin, indexing.IndexingMixin): | |
|
||
def __init__( | ||
self, | ||
data: BlockManager, | ||
data: Union[BlockManager, ArrayManager], | ||
copy: bool = False, | ||
attrs: Optional[Mapping[Optional[Hashable], Any]] = None, | ||
): | ||
|
@@ -223,7 +223,9 @@ def __init__( | |
object.__setattr__(self, "_flags", Flags(self, allows_duplicate_labels=True)) | ||
|
||
@classmethod | ||
def _init_mgr(cls, mgr, axes, dtype=None, copy: bool = False) -> BlockManager: | ||
def _init_mgr( | ||
cls, mgr, axes, dtype=None, copy: bool = False | ||
) -> Union[BlockManager, ArrayManager]: | ||
""" passed a manager and a axes dict """ | ||
for a, axe in axes.items(): | ||
if axe is not None: | ||
|
@@ -236,8 +238,9 @@ def _init_mgr(cls, mgr, axes, dtype=None, copy: bool = False) -> BlockManager: | |
mgr = mgr.copy() | ||
if dtype is not None: | ||
# avoid further copies if we can | ||
if len(mgr.blocks) > 1 or mgr.blocks[0].values.dtype != dtype: | ||
mgr = mgr.astype(dtype=dtype) | ||
# TODO | ||
# if len(mgr.blocks) > 1 or mgr.blocks[0].values.dtype != dtype: | ||
mgr = mgr.astype(dtype=dtype) | ||
return mgr | ||
|
||
# ---------------------------------------------------------------------- | ||
|
@@ -5372,6 +5375,8 @@ def _protect_consolidate(self, f): | |
Consolidate _mgr -- if the blocks have changed, then clear the | ||
cache | ||
""" | ||
if isinstance(self._mgr, ArrayManager): | ||
return f() | ||
blocks_before = len(self._mgr.blocks) | ||
result = f() | ||
if len(self._mgr.blocks) != blocks_before: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,6 @@ | ||
from collections import defaultdict | ||
import copy | ||
import itertools | ||
from typing import Dict, List | ||
|
||
import numpy as np | ||
|
@@ -26,7 +27,7 @@ | |
import pandas.core.algorithms as algos | ||
from pandas.core.arrays import ExtensionArray | ||
from pandas.core.internals.blocks import make_block | ||
from pandas.core.internals.managers import BlockManager | ||
from pandas.core.internals.managers import ArrayManager, BlockManager | ||
|
||
|
||
def concatenate_block_managers( | ||
|
@@ -46,6 +47,21 @@ def concatenate_block_managers( | |
------- | ||
BlockManager | ||
""" | ||
if isinstance(mgrs_indexers[0][0], ArrayManager): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are we assuming that they are either all-ArrayManager or all-BlockManager? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, and this concat implementation right now is very limited in general (eg only the simple case without any reindexing needed, several tests are skipped because of this, I put several |
||
|
||
if concat_axis == 1: | ||
# TODO for now only fastpath without indexers | ||
mgrs = [t[0] for t in mgrs_indexers] | ||
arrays = [ | ||
np.concatenate([mgrs[i].arrays[j] for i in range(len(mgrs))]) | ||
for j in range(len(mgrs[0].arrays)) | ||
] | ||
return ArrayManager(arrays, [axes[1], axes[0]]) | ||
elif concat_axis == 0: | ||
mgrs = [t[0] for t in mgrs_indexers] | ||
arrays = list(itertools.chain.from_iterable([mgr.arrays for mgr in mgrs])) | ||
return ArrayManager(arrays, [axes[1], axes[0]]) | ||
|
||
concat_plans = [ | ||
_get_mgr_concatenation_plan(mgr, indexers) for mgr, indexers in mgrs_indexers | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i'd be ok with ```flags`` being an added keyword to the constructor (and you can then make manager a flag)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is just for convenience (to easily change back and forth) I'd prefer a private method on DataFrame to change the manager (or return a new DataFrame with a new manager).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keyword right now makes it a bit more convenient to pick a specific one in the tests, or to compare both versions side by side.
Eg instead of
you can do
if you want eg a certain test to only run using a specific manager, regardless of the global setting.
I fully agree we should be careful with making this a public keyword, but I think that also for internal use eg in tests, it would be good to have a convenient way to do this.
With a private method, you are thinking of a class method like
pd.DataFrame._construct_with_array_manager(...)
/pd.DataFrame._construct_with_block_manager(..)
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking
pd.DataFrame(...)._as_array_manager()
. But whatever is easiest for testing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK. For testing that should also be fine (as long as it's not testing the constructor ;)). It will be less efficient as it's only converting to the other format after initial construction (which I do now anyway, but the goal is to fix that at some point of course), but for testing purposes that doesn't matter.
Could also be a
pd.DataFrame(..)._as_manager("block"/"array")
?Because we want to have both. Eg some test might be testing specifically aspects of block -based dataframe, so even when running the tests with global option to use array manager, that test should still use block manager.