Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: ArrayManager -- array-based data manager for columnar store #36010

Merged
merged 42 commits into from
Jan 13, 2021
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
a51835b
POC: ArrayManager -- array-based data manager for columnar store
jorisvandenbossche Jun 1, 2020
591579b
Update with latest master + some fixes
jorisvandenbossche Aug 27, 2020
896080a
add pd.options.mode.data_manager to switch
jorisvandenbossche Sep 4, 2020
f9c4dda
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 5, 2020
d18082a
add apply_with_block workaround
jorisvandenbossche Sep 5, 2020
cf3c07a
fix alignment in apply
jorisvandenbossche Sep 5, 2020
b252c6d
reorder methods to match BlockManager
jorisvandenbossche Sep 5, 2020
0fb645e
skip json tests for now
jorisvandenbossche Sep 5, 2020
eb55fef
skip more json tests + to_csv with to_native_types
jorisvandenbossche Sep 5, 2020
d241f31
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 6, 2020
47c3ee3
support both ndarrays and ExtensionArrays
jorisvandenbossche Sep 17, 2020
75f7de2
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 17, 2020
f36e395
add unstack
jorisvandenbossche Sep 17, 2020
be20816
fix native types, skip quantile, hdf, stata tests
jorisvandenbossche Sep 17, 2020
8b7cc81
remove skip in the benchmarks
jorisvandenbossche Sep 17, 2020
a239f50
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 17, 2020
a0ccf9a
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Sep 22, 2020
dc1b190
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Oct 16, 2020
55d38be
remove manager keyword from DataFrame constructor, add _as_manager in…
jorisvandenbossche Oct 16, 2020
3dea0d7
move new ArrayManager code to separate file
jorisvandenbossche Oct 16, 2020
1a61333
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Nov 10, 2020
9751d33
de-privatize
jbrockmendel Nov 10, 2020
e45b645
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Dec 11, 2020
3749c7d
try fix up typing
jorisvandenbossche Dec 11, 2020
af53040
add pytest option + add one github actions build to run them
jorisvandenbossche Dec 11, 2020
cc45673
fix pytest marks for skipping when using array-manager
jorisvandenbossche Dec 12, 2020
27cf215
several fixes - get tests/frame/methods tests passing
jorisvandenbossche Dec 12, 2020
f6a97df
ci - only run the tests/frame/methods tests
jorisvandenbossche Dec 12, 2020
67c4c2b
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Dec 12, 2020
670ed76
mypy fix
jorisvandenbossche Dec 12, 2020
5128ad1
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Dec 18, 2020
5c73688
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Jan 8, 2021
a9a8c2d
move to internals/construction.py
jorisvandenbossche Jan 8, 2021
c7898fb
update for latest changes - fix tests/mypy
jorisvandenbossche Jan 8, 2021
3430307
fix todo
jorisvandenbossche Jan 8, 2021
1a30013
fix import in tests
jorisvandenbossche Jan 8, 2021
ef86b1e
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Jan 10, 2021
c5548d9
add union alias to typing
jorisvandenbossche Jan 10, 2021
afe8f80
updates based on review
jorisvandenbossche Jan 10, 2021
b88c757
skip json tests to avoid segfaults
jorisvandenbossche Jan 10, 2021
ddc51d0
Merge remote-tracking branch 'upstream/master' into array-manager
jorisvandenbossche Jan 12, 2021
9dc5600
fix for Label -> Hashable change in master
jorisvandenbossche Jan 12, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -132,3 +132,22 @@ jobs:
- name: Upload dev docs
run: rsync -az --delete doc/build/html/ docs@${{ secrets.server_ip }}:/usr/share/nginx/pandas/pandas-docs/dev
if: github.event_name == 'push'

data_manager:
name: Test experimental data manager
runs-on: ubuntu-latest
steps:

- name: Setting conda path
run: echo "${HOME}/miniconda3/bin" >> $GITHUB_PATH

- name: Checkout
uses: actions/checkout@v1

- name: Setup environment and build pandas
run: ci/setup_env.sh

- name: Run tests
run: |
source activate pandas-dev
pytest pandas/tests/frame/methods --array-manager
21 changes: 21 additions & 0 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,19 @@ def pytest_addoption(parser):
action="store_true",
help="Fail if a test is skipped for missing data file.",
)
parser.addoption(
"--array-manager",
"--am",
action="store_true",
help="Use the experimental ArrayManager as default data manager.",
)


def pytest_sessionstart(session):
# Note: we need to set the option here and not in pytest_runtest_setup below
# to ensure this is run before creating fixture data
if session.config.getoption("--array-manager"):
pd.options.mode.data_manager = "array"


def pytest_runtest_setup(item):
Expand Down Expand Up @@ -1454,3 +1467,11 @@ def indexer_si(request):
Parametrize over __setitem__, iloc.__setitem__
"""
return request.param


@pytest.fixture
def using_array_manager(request):
"""
Fixture to check if the array manager is being used.
"""
return pd.options.mode.data_manager == "array"
6 changes: 6 additions & 0 deletions pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -483,6 +483,12 @@ def use_inf_as_na_cb(key):
cf.register_option(
"use_inf_as_null", False, use_inf_as_null_doc, cb=use_inf_as_na_cb
)
cf.register_option(
"data_manager",
"block",
"Internal data manager type",
validator=is_one_of_factory(["block", "array"]),
)

cf.deprecate_option(
"mode.use_inf_as_null", msg=use_inf_as_null_doc, rkey="mode.use_inf_as_na"
Expand Down
43 changes: 37 additions & 6 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,13 +138,14 @@
)
from pandas.core.indexes.multi import MultiIndex, maybe_droplevels
from pandas.core.indexing import check_bool_indexer, convert_to_index_sliceable
from pandas.core.internals import BlockManager
from pandas.core.internals import ArrayManager, BlockManager
from pandas.core.internals.construction import (
arrays_to_mgr,
dataclasses_to_dicts,
init_dict,
init_ndarray,
masked_rec_array_to_mgr,
mgr_to_mgr,
nested_data_to_arrays,
reorder_arrays,
sanitize_index,
Expand Down Expand Up @@ -524,7 +525,7 @@ def __init__(
if isinstance(data, DataFrame):
data = data._mgr

if isinstance(data, BlockManager):
if isinstance(data, (BlockManager, ArrayManager)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these have a common base class?

if index is None and columns is None and dtype is None and copy is False:
# GH#33357 fastpath
NDFrame.__init__(self, data)
Expand Down Expand Up @@ -602,8 +603,31 @@ def __init__(
values, index, columns, dtype=values.dtype, copy=False
)

# ensure correct Manager type according to settings
manager = get_option("mode.data_manager")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this to the module level? otherwise this would be constantly checked

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because you need to be able to change the setting after initial import.
Tom asked about performance impact of this above (#36010 (comment)), and I did some timings then. It doesn't seem to be a problem.

mgr = mgr_to_mgr(mgr, typ=manager)

NDFrame.__init__(self, mgr)

def _as_manager(self, typ: str) -> DataFrame:
"""
Private helper function to create a DataFrame with specific manager.

Parameters
----------
typ : {"block", "array"}

Returns
-------
DataFrame
New DataFrame using specified manager type. Is not guaranteed
to be a copy or not.
"""
new_mgr: Union[BlockManager, ArrayManager]
new_mgr = mgr_to_mgr(self._mgr, typ=typ)
# fastpath of passing a manager doesn't check the option/manager class
return DataFrame(new_mgr)

# ----------------------------------------------------------------------

@property
Expand Down Expand Up @@ -676,6 +700,8 @@ def _is_homogeneous_type(self) -> bool:
... "B": np.array([1, 2], dtype=np.int64)})._is_homogeneous_type
False
"""
if isinstance(self._mgr, ArrayManager):
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to return something here, as the code below relies on BlockManager internals.
So I could "replicate" the same logic but for the ArrayManager. But I think for POC-reasons I didn't yet do the effort up to now (it's also not used in many places, and it's also not causing any bugs, just a potentially missed optimization (but which is probably not yet implemented for ArrayManager anyway), don't really exactly remember it.
Now, it shouldn't be hard to add, just checking the length of the set of dtypes of columns instead of blocks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments where appropriate to point out places where things are temporary/POC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific case, simply updated it to do the actual correct thing (check the number of unique types). I think other edits are either OK or already have a TODO note

if self._mgr.any_extension_types:
return len({block.dtype for block in self._mgr.blocks}) == 1
else:
Expand All @@ -686,6 +712,8 @@ def _can_fast_transpose(self) -> bool:
"""
Can we transpose this DataFrame without creating any new array objects.
"""
if isinstance(self._mgr, ArrayManager):
return False
if self._mgr.any_extension_types:
# TODO(EA2D) special case would be unnecessary with 2D EAs
return False
Expand Down Expand Up @@ -5507,7 +5535,7 @@ def sort_values( # type: ignore[override]
)

if ignore_index:
new_data.axes[1] = ibase.default_index(len(indexer))
new_data.set_axis(1, ibase.default_index(len(indexer)))
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved

result = self._constructor(new_data)
if inplace:
Expand Down Expand Up @@ -6052,7 +6080,10 @@ def _dispatch_frame_op(self, right, func, axis: Optional[int] = None):
# fails in cases with empty columns reached via
# _frame_arith_method_with_reindex

bm = self._mgr.operate_blockwise(right._mgr, array_op)
# TODO operate_blockwise expects a manager of the same type
bm = self._mgr.operate_blockwise(
right._mgr, array_op # type: ignore[arg-type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding the type ignore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's related to the TODO: operate_blockwise complains about getting a Union[BlockManager, ArrayManager], while it either expects a BlockManger if it gets called from a BlockManager, or ArrayManager if it gets called from ArrayManager.

That's something that needs to be fixed in general (hence the TODO), but as long as the actual issue is not solved, it's easier to add a type ignore.

)
return type(self)(bm)

elif isinstance(right, Series) and axis == 1:
Expand Down Expand Up @@ -8897,11 +8928,11 @@ def func(values: np.ndarray):
# We only use this in the case that operates on self.values
return op(values, axis=axis, skipna=skipna, **kwds)

def blk_func(values):
def blk_func(values, axis=1):
if isinstance(values, ExtensionArray):
return values._reduce(name, skipna=skipna, **kwds)
else:
return op(values, axis=1, skipna=skipna, **kwds)
return op(values, axis=axis, skipna=skipna, **kwds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this stays 1 for BlockManager but becomes 0 for ArrayManager?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the default stays the same. And for BlockManager, this default is used. But this way, it allows us to override it for the ArrayManager reduce method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does axis ever actually get passed to blk_func?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does axis ever actually get passed to blk_func?

Yes, as mentioned above, the ArrayManager.reduce method uses this


def _get_data() -> DataFrame:
if filter_type is None:
Expand Down
30 changes: 21 additions & 9 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@
RangeIndex,
ensure_index,
)
from pandas.core.internals import BlockManager
from pandas.core.internals import ArrayManager, BlockManager
from pandas.core.missing import find_valid_index
from pandas.core.ops import align_method_FRAME
from pandas.core.shared_docs import _shared_docs
Expand Down Expand Up @@ -184,7 +184,7 @@ class NDFrame(PandasObject, SelectionMixin, indexing.IndexingMixin):
)
_metadata: List[str] = []
_is_copy = None
_mgr: BlockManager
_mgr: Union[BlockManager, ArrayManager]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should make this an actual type in _typing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls do this

_attrs: Dict[Optional[Hashable], Any]
_typ: str

Expand All @@ -193,7 +193,7 @@ class NDFrame(PandasObject, SelectionMixin, indexing.IndexingMixin):

def __init__(
self,
data: BlockManager,
data: Union[BlockManager, ArrayManager],
copy: bool = False,
attrs: Optional[Mapping[Optional[Hashable], Any]] = None,
):
Expand All @@ -210,7 +210,9 @@ def __init__(
object.__setattr__(self, "_flags", Flags(self, allows_duplicate_labels=True))

@classmethod
def _init_mgr(cls, mgr, axes, dtype=None, copy: bool = False) -> BlockManager:
def _init_mgr(
cls, mgr, axes, dtype=None, copy: bool = False
) -> Union[BlockManager, ArrayManager]:
""" passed a manager and a axes dict """
for a, axe in axes.items():
if axe is not None:
Expand All @@ -223,7 +225,13 @@ def _init_mgr(cls, mgr, axes, dtype=None, copy: bool = False) -> BlockManager:
mgr = mgr.copy()
if dtype is not None:
# avoid further copies if we can
if len(mgr.blocks) > 1 or mgr.blocks[0].values.dtype != dtype:
if (
isinstance(mgr, BlockManager)
and len(mgr.blocks) == 1
and mgr.blocks[0].values.dtype == dtype
):
pass
else:
mgr = mgr.astype(dtype=dtype)
return mgr

Expand Down Expand Up @@ -4547,11 +4555,11 @@ def sort_index(
new_data = self._mgr.take(indexer, axis=baxis, verify=False)

# reconstruct axis if needed
new_data.axes[baxis] = new_data.axes[baxis]._sort_levels_monotonic()
new_data.set_axis(baxis, new_data.axes[baxis]._sort_levels_monotonic())

if ignore_index:
axis = 1 if isinstance(self, ABCDataFrame) else 0
new_data.axes[axis] = ibase.default_index(len(indexer))
new_data.set_axis(axis, ibase.default_index(len(indexer)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment on why these need to be changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight, it's probably not needed anymore. Originally, I changed the axes attribute on ArrayManager to be (index, columns) order. But afterwards, I changed this to be stored in _axes, and having an axes property that switches the internal order, to match the "public" interface of BlockManager.

So with that change, I could revert those edits (I think, but can check that)

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Jan 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked this again, and so this change is still needed (without it I get failures for the array manager).
The reason is that mgr.axes is no longer the actual "storage" of the axes for the ArrayManager, but is there stored in _axes. And because mgr.axes returns a list, it's actually updating the list in place, and not the actual storage. Which is a bit of a gotcha now with the manager's axes attribute.

But based on a quick check, this seems to be the only place where we assign to mgr.axes[i] = ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explaining. we should make another attempt to get axes out of FooManager before too long


result = self._constructor(new_data)

Expand Down Expand Up @@ -5524,6 +5532,8 @@ def _protect_consolidate(self, f):
Consolidate _mgr -- if the blocks have changed, then clear the
cache
"""
if isinstance(self._mgr, ArrayManager):
return f()
blocks_before = len(self._mgr.blocks)
result = f()
if len(self._mgr.blocks) != blocks_before:
Expand Down Expand Up @@ -5713,11 +5723,13 @@ def _to_dict_of_blocks(self, copy: bool_t = True):
Return a dict of dtype -> Constructor Types that
each is a homogeneous dtype.

Internal ONLY
Internal ONLY - only works for BlockManager
"""
mgr = self._mgr
mgr = cast(BlockManager, mgr)
return {
k: self._constructor(v).__finalize__(self)
for k, v, in self._mgr.to_dict(copy=copy).items()
for k, v, in mgr.to_dict(copy=copy).items()
}

def astype(
Expand Down
6 changes: 4 additions & 2 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1085,10 +1085,12 @@ def py_fallback(bvalues: ArrayLike) -> ArrayLike:
# in the operation. We un-split here.
result = result._consolidate()
assert isinstance(result, (Series, DataFrame)) # for mypy
assert len(result._mgr.blocks) == 1
mgr = result._mgr
assert isinstance(mgr, BlockManager)
assert len(mgr.blocks) == 1

# unwrap DataFrame to get array
result = result._mgr.blocks[0].values
result = mgr.blocks[0].values
return result

def blk_func(bvalues: ArrayLike) -> ArrayLike:
Expand Down
4 changes: 4 additions & 0 deletions pandas/core/internals/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from pandas.core.internals.array_manager import ArrayManager
from pandas.core.internals.base import DataManager
from pandas.core.internals.blocks import ( # io.pytables, io.packers
Block,
BoolBlock,
Expand Down Expand Up @@ -35,6 +37,8 @@
"TimeDeltaBlock",
"safe_reshape",
"make_block",
"DataManager",
"ArrayManager",
"BlockManager",
"SingleBlockManager",
"concatenate_block_managers",
Expand Down
Loading