-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: ArrayManager -- array-based data manager for columnar store #36010
Changes from 30 commits
a51835b
591579b
896080a
f9c4dda
d18082a
cf3c07a
b252c6d
0fb645e
eb55fef
d241f31
47c3ee3
75f7de2
f36e395
be20816
8b7cc81
a239f50
a0ccf9a
dc1b190
55d38be
3dea0d7
1a61333
9751d33
e45b645
3749c7d
af53040
cc45673
27cf215
f6a97df
67c4c2b
670ed76
5128ad1
5c73688
a9a8c2d
c7898fb
3430307
1a30013
ef86b1e
c5548d9
afe8f80
b88c757
ddc51d0
9dc5600
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -137,7 +137,7 @@ | |
) | ||
from pandas.core.indexes.multi import MultiIndex, maybe_droplevels | ||
from pandas.core.indexing import check_bool_indexer, convert_to_index_sliceable | ||
from pandas.core.internals import BlockManager | ||
from pandas.core.internals import ArrayManager, BlockManager | ||
from pandas.core.internals.construction import ( | ||
arrays_to_mgr, | ||
dataclasses_to_dicts, | ||
|
@@ -513,7 +513,7 @@ def __init__( | |
if isinstance(data, DataFrame): | ||
data = data._mgr | ||
|
||
if isinstance(data, BlockManager): | ||
if isinstance(data, (BlockManager, ArrayManager)): | ||
if index is None and columns is None and dtype is None and copy is False: | ||
# GH#33357 fastpath | ||
NDFrame.__init__(self, data) | ||
|
@@ -613,8 +613,52 @@ def __init__( | |
values, index, columns, dtype=values.dtype, copy=False | ||
) | ||
|
||
manager = get_option("mode.data_manager") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you move this to the module level? otherwise this would be constantly checked There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, because you need to be able to change the setting after initial import. |
||
|
||
if manager == "array" and not isinstance(mgr, ArrayManager): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why don't we just pass thru _as_manager always (and relocate it to a proper place) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. my point is to always use _as_manager which can construct things rather than doing this if/then here. this can be simply a module level constructor for both Block/Array manager i think. avoids bloat here in the constructor itself and pushes this to pandas/core/construction.py There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated to remove the if/else here, and defer to a function in |
||
# TODO proper initialization | ||
df = DataFrame(mgr) | ||
mgr = df._as_manager("array")._mgr | ||
# TODO check for case of manager="block" but mgr is ArrayManager | ||
|
||
NDFrame.__init__(self, mgr) | ||
|
||
def _as_manager(self, typ): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you import from construction from that do this here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Moved the bulk of the implementation to internals/construction.py |
||
""" | ||
Private helper function to create a DataFrame with specific manager. | ||
|
||
Parameters | ||
---------- | ||
mgr : {"block", "array"} | ||
jbrockmendel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Returns | ||
------- | ||
DataFrame | ||
New DataFrame using specified manager type. Is not guaranteed | ||
to be a copy or not. | ||
""" | ||
new_mgr: Union[BlockManager, ArrayManager] | ||
mgr = self._mgr | ||
if typ == "block": | ||
if isinstance(mgr, BlockManager): | ||
new_mgr = mgr | ||
else: | ||
new_mgr = arrays_to_mgr( | ||
mgr.arrays, mgr.axes[0], mgr.axes[1], mgr.axes[0], dtype=None | ||
) | ||
elif typ == "array": | ||
if isinstance(mgr, ArrayManager): | ||
new_mgr = mgr | ||
else: | ||
arrays = [arr.copy() for arr in self._iter_column_arrays()] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we require a specif ordering for the columns? (I think we actually do, namely 'F'), this copy preserves here which I don't think you actually want, you really do want to make sure these are 1-d and ordered There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is doing this no-copy not viable? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. right I think we actually need to copy these to coniguous 1d, but i would make this explict. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed, generally a no-copy is not possible if you want a proper 1D array layout. |
||
new_mgr = ArrayManager(arrays, [mgr.axes[1], mgr.axes[0]]) | ||
else: | ||
raise ValueError( | ||
f"'typ' needs to be one of {{'block', 'array'}}, got '{type}'" | ||
) | ||
# fastpath of passing a manager doesn't check the option/manager class | ||
return DataFrame(new_mgr) | ||
|
||
# ---------------------------------------------------------------------- | ||
|
||
@property | ||
|
@@ -687,6 +731,8 @@ def _is_homogeneous_type(self) -> bool: | |
... "B": np.array([1, 2], dtype=np.int64)})._is_homogeneous_type | ||
False | ||
""" | ||
if isinstance(self._mgr, ArrayManager): | ||
return False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this different? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I need to return something here, as the code below relies on BlockManager internals. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add comments where appropriate to point out places where things are temporary/POC? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For this specific case, simply updated it to do the actual correct thing (check the number of unique types). I think other edits are either OK or already have a TODO note |
||
if self._mgr.any_extension_types: | ||
return len({block.dtype for block in self._mgr.blocks}) == 1 | ||
else: | ||
|
@@ -697,6 +743,8 @@ def _can_fast_transpose(self) -> bool: | |
""" | ||
Can we transpose this DataFrame without creating any new array objects. | ||
""" | ||
if isinstance(self._mgr, ArrayManager): | ||
return False | ||
if self._mgr.any_extension_types: | ||
# TODO(EA2D) special case would be unnecessary with 2D EAs | ||
return False | ||
|
@@ -5480,7 +5528,7 @@ def sort_values( # type: ignore[override] | |
) | ||
|
||
if ignore_index: | ||
new_data.axes[1] = ibase.default_index(len(indexer)) | ||
new_data.set_axis(1, ibase.default_index(len(indexer))) | ||
jbrockmendel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
result = self._constructor(new_data) | ||
if inplace: | ||
|
@@ -6025,7 +6073,10 @@ def _dispatch_frame_op(self, right, func, axis: Optional[int] = None): | |
# fails in cases with empty columns reached via | ||
# _frame_arith_method_with_reindex | ||
|
||
bm = self._mgr.operate_blockwise(right._mgr, array_op) | ||
# TODO operate_blockwise expects a manager of the same type | ||
bm = self._mgr.operate_blockwise( | ||
right._mgr, array_op # type: ignore[arg-type] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why are you adding the type ignore? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's related to the TODO: operate_blockwise complains about getting a Union[BlockManager, ArrayManager], while it either expects a BlockManger if it gets called from a BlockManager, or ArrayManager if it gets called from ArrayManager. That's something that needs to be fixed in general (hence the TODO), but as long as the actual issue is not solved, it's easier to add a type ignore. |
||
) | ||
return type(self)(bm) | ||
|
||
elif isinstance(right, Series) and axis == 1: | ||
|
@@ -8825,11 +8876,11 @@ def func(values: np.ndarray): | |
# We only use this in the case that operates on self.values | ||
return op(values, axis=axis, skipna=skipna, **kwds) | ||
|
||
def blk_func(values): | ||
def blk_func(values, axis=1): | ||
if isinstance(values, ExtensionArray): | ||
return values._reduce(name, skipna=skipna, **kwds) | ||
else: | ||
return op(values, axis=1, skipna=skipna, **kwds) | ||
return op(values, axis=axis, skipna=skipna, **kwds) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this stays 1 for BlockManager but becomes 0 for ArrayManager? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the default stays the same. And for BlockManager, this default is used. But this way, it allows us to override it for the ArrayManager reduce method. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does axis ever actually get passed to blk_func? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, as mentioned above, the ArrayManager.reduce method uses this |
||
|
||
def _get_data() -> DataFrame: | ||
if filter_type is None: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -102,7 +102,7 @@ | |
RangeIndex, | ||
ensure_index, | ||
) | ||
from pandas.core.internals import BlockManager | ||
from pandas.core.internals import ArrayManager, BlockManager | ||
from pandas.core.missing import find_valid_index | ||
from pandas.core.ops import align_method_FRAME | ||
from pandas.core.shared_docs import _shared_docs | ||
|
@@ -175,7 +175,7 @@ class NDFrame(PandasObject, SelectionMixin, indexing.IndexingMixin): | |
_hidden_attrs: FrozenSet[str] = frozenset(["get_values", "tshift"]) | ||
_metadata: List[str] = [] | ||
_is_copy = None | ||
_mgr: BlockManager | ||
_mgr: Union[BlockManager, ArrayManager] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should make this an actual type in _typing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pls do this |
||
_attrs: Dict[Optional[Hashable], Any] | ||
_typ: str | ||
|
||
|
@@ -184,7 +184,7 @@ class NDFrame(PandasObject, SelectionMixin, indexing.IndexingMixin): | |
|
||
def __init__( | ||
self, | ||
data: BlockManager, | ||
data: Union[BlockManager, ArrayManager], | ||
copy: bool = False, | ||
attrs: Optional[Mapping[Optional[Hashable], Any]] = None, | ||
): | ||
|
@@ -201,7 +201,9 @@ def __init__( | |
object.__setattr__(self, "_flags", Flags(self, allows_duplicate_labels=True)) | ||
|
||
@classmethod | ||
def _init_mgr(cls, mgr, axes, dtype=None, copy: bool = False) -> BlockManager: | ||
def _init_mgr( | ||
cls, mgr, axes, dtype=None, copy: bool = False | ||
) -> Union[BlockManager, ArrayManager]: | ||
""" passed a manager and a axes dict """ | ||
for a, axe in axes.items(): | ||
if axe is not None: | ||
|
@@ -214,8 +216,9 @@ def _init_mgr(cls, mgr, axes, dtype=None, copy: bool = False) -> BlockManager: | |
mgr = mgr.copy() | ||
if dtype is not None: | ||
# avoid further copies if we can | ||
if len(mgr.blocks) > 1 or mgr.blocks[0].values.dtype != dtype: | ||
mgr = mgr.astype(dtype=dtype) | ||
# TODO | ||
# if len(mgr.blocks) > 1 or mgr.blocks[0].values.dtype != dtype: | ||
mgr = mgr.astype(dtype=dtype) | ||
return mgr | ||
|
||
# ---------------------------------------------------------------------- | ||
|
@@ -4529,11 +4532,11 @@ def sort_index( | |
new_data = self._mgr.take(indexer, axis=baxis, verify=False) | ||
|
||
# reconstruct axis if needed | ||
new_data.axes[baxis] = new_data.axes[baxis]._sort_levels_monotonic() | ||
new_data.set_axis(baxis, new_data.axes[baxis]._sort_levels_monotonic()) | ||
|
||
if ignore_index: | ||
axis = 1 if isinstance(self, ABCDataFrame) else 0 | ||
new_data.axes[axis] = ibase.default_index(len(indexer)) | ||
new_data.set_axis(axis, ibase.default_index(len(indexer))) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you comment on why these need to be changed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In hindsight, it's probably not needed anymore. Originally, I changed the So with that change, I could revert those edits (I think, but can check that) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just checked this again, and so this change is still needed (without it I get failures for the array manager). But based on a quick check, this seems to be the only place where we assign to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks for explaining. we should make another attempt to get axes out of FooManager before too long |
||
|
||
result = self._constructor(new_data) | ||
|
||
|
@@ -5506,6 +5509,8 @@ def _protect_consolidate(self, f): | |
Consolidate _mgr -- if the blocks have changed, then clear the | ||
cache | ||
""" | ||
if isinstance(self._mgr, ArrayManager): | ||
return f() | ||
blocks_before = len(self._mgr.blocks) | ||
result = f() | ||
if len(self._mgr.blocks) != blocks_before: | ||
|
@@ -5695,11 +5700,13 @@ def _to_dict_of_blocks(self, copy: bool_t = True): | |
Return a dict of dtype -> Constructor Types that | ||
each is a homogeneous dtype. | ||
|
||
Internal ONLY | ||
Internal ONLY - only works for BlockManager | ||
""" | ||
mgr = self._mgr | ||
mgr = cast(BlockManager, mgr) | ||
return { | ||
k: self._constructor(v).__finalize__(self) | ||
for k, v, in self._mgr.to_dict(copy=copy).items() | ||
for k, v, in mgr.to_dict(copy=copy).items() | ||
} | ||
|
||
def astype( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these have a common base class?