Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept for Copy-on-Write implementation #41878

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
462526e
POC for Copy-on-Write
jorisvandenbossche Apr 29, 2021
41ee2b7
clean-up ArrayManager.copy implementation
jorisvandenbossche Jun 8, 2021
7a8dffc
add some comments / docstrings
jorisvandenbossche Jun 8, 2021
1c964be
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Jun 25, 2021
96b6d71
fix series slice + del operator
jorisvandenbossche Jun 25, 2021
17cedb9
fix pickle and column_setitem with dtype changes
jorisvandenbossche Jun 25, 2021
f4614c2
fix setitem on single column for full slice + update frame tests
jorisvandenbossche Jun 25, 2021
81b09c2
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Jun 26, 2021
7f183de
fix ref tracking in slicing columns
jorisvandenbossche Jun 26, 2021
693bc4f
fix series setitem -> don't update parent cache
jorisvandenbossche Jun 28, 2021
a154591
update indexing tests with new behaviour to get them passing
jorisvandenbossche Jun 28, 2021
71370c4
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Jul 12, 2021
4e785d7
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Aug 4, 2021
6741340
fix new test + skip Series.update tests for now
jorisvandenbossche Aug 4, 2021
c4527e9
remove chained assignment outside indexing tests
jorisvandenbossche Aug 4, 2021
b33aaf2
fix DataFrame.fillna + more chained setitem
jorisvandenbossche Aug 4, 2021
9fdad69
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Aug 7, 2021
f5da03f
reindex/take on columns to use copy-on-write
jorisvandenbossche Aug 8, 2021
c632757
fix some typing issues
jorisvandenbossche Aug 8, 2021
e4a5f33
fix dtype in test for windows
jorisvandenbossche Aug 8, 2021
5efad50
fix dtype in test for windows
jorisvandenbossche Aug 8, 2021
8ea48cc
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Aug 9, 2021
79b7a30
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Oct 11, 2021
cf5c7e2
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Nov 11, 2021
297662a
[ArrayManager] Array version of putmask logic
jorisvandenbossche Nov 11, 2021
a011a06
fixup copy-on-write in putmask
jorisvandenbossche Nov 11, 2021
cc09001
Update BlockManager expected results after recent behaviour changes
jorisvandenbossche Nov 11, 2021
37e7ce0
limit changes to putmask for this PR
jorisvandenbossche Nov 22, 2021
e34b9b2
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Nov 22, 2021
64377e0
Merge remote-tracking branch 'upstream/master' into am-experiment-cow
jorisvandenbossche Nov 26, 2021
0f28095
address feedback
jorisvandenbossche Nov 26, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 12 additions & 5 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3967,8 +3967,15 @@ def _set_value(
"""
try:
if takeable:
series = self._ixs(col, axis=1)
series._set_value(index, value, takeable=True)
if isinstance(self._mgr, ArrayManager):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you not do this in internals? hate leaking ArrayManager semantics here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I added here is actually to call into the internals to do it, instead of the current frame-level methods.
It's only possible for ArrayManager, though (see https://github.com/pandas-dev/pandas/pull/41878/files#r757450838 just below), which is the reason I added a check.

# with CoW, we can't use intermediate series
# with takeable=True, we know that index is positional and
# not a generic hashable label
index = cast(int, index)
self._mgr.column_setitem(col, index, value)
else:
series = self._ixs(col, axis=1)
series._set_value(index, value, takeable=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we go down this path, we should make a BlockManager.column_setitem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to look into, but this might not be that straightforward, since the current path also deals with updating the cache etc

return

series = self._get_item_cache(col)
Expand Down Expand Up @@ -4900,7 +4907,7 @@ def set_axis(self, labels, axis: Axis = 0, inplace: bool = False):
"labels",
[
("method", None),
("copy", True),
("copy", None),
("level", None),
("fill_value", np.nan),
("limit", None),
Expand Down Expand Up @@ -5084,7 +5091,7 @@ def rename(
index: Renamer | None = None,
columns: Renamer | None = None,
axis: Axis | None = None,
copy: bool = True,
copy: bool | None = None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copy=None is what changes DataFrame.rename to use a shallow copy with CoW instead of doing a full copy for the ArrayManager.
This is eventually passed to self.copy(deep=copy), and deep=None is used to signal a shallow copy for ArrayManager while for now preserving the deep copy for BlockManager.

inplace: bool = False,
level: Level | None = None,
errors: str = "ignore",
Expand Down Expand Up @@ -5900,7 +5907,7 @@ class max type
if inplace:
new_obj = self
else:
new_obj = self.copy()
new_obj = self.copy(deep=None)

new_index = default_index(len(new_obj))
if level is not None:
Expand Down
12 changes: 6 additions & 6 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -996,7 +996,7 @@ def rename(
index: Renamer | None = None,
columns: Renamer | None = None,
axis: Axis | None = None,
copy: bool_t = True,
copy: bool_t | None = None,
inplace: bool_t = False,
level: Level | None = None,
errors: str = "ignore",
Expand Down Expand Up @@ -3952,6 +3952,8 @@ def _check_setitem_copy(self, t="setting", force=False):
df.iloc[0:5]['group'] = 'a'

"""
if isinstance(self._mgr, (ArrayManager, SingleArrayManager)):
return
# return early if the check is not needed
if not (force or self._is_copy):
return
Expand Down Expand Up @@ -4906,7 +4908,7 @@ def reindex(self: NDFrameT, *args, **kwargs) -> NDFrameT:
axes, kwargs = self._construct_axes_from_arguments(args, kwargs)
method = missing.clean_reindex_fill_method(kwargs.pop("method", None))
level = kwargs.pop("level", None)
copy = kwargs.pop("copy", True)
copy = kwargs.pop("copy", None)
limit = kwargs.pop("limit", None)
tolerance = kwargs.pop("tolerance", None)
fill_value = kwargs.pop("fill_value", None)
Expand All @@ -4931,9 +4933,7 @@ def reindex(self: NDFrameT, *args, **kwargs) -> NDFrameT:
for axis, ax in axes.items()
if ax is not None
):
if copy:
return self.copy()
return self
return self.copy(deep=copy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with copy=False we no longer get a new object. i generally like this, but its an API change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. I shouldn't change this in this PR (or at least not for BlockManager-based dataframe). Now for the CoW I need to pass down the copy=None, so that will unfortunately require a check for AM vs BM then.


# check if we are a multi reindex
if self._needs_reindex_multi(axes, method, level):
Expand Down Expand Up @@ -5895,7 +5895,7 @@ def astype(
return cast(NDFrameT, result)

@final
def copy(self: NDFrameT, deep: bool_t = True) -> NDFrameT:
def copy(self: NDFrameT, deep: bool_t | None = True) -> NDFrameT:
"""
Make a copy of this object's indices and data.

Expand Down
11 changes: 11 additions & 0 deletions pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -1840,6 +1840,17 @@ def _setitem_single_column(self, loc: int, value, plane_indexer):
"""
pi = plane_indexer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#42887 would make for a good precursor

if not hasattr(self.obj._mgr, "blocks"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be an ArrayManager test? why the different semantics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a check to know if it is an ArrayManager, without actually importing it (core/indexing.py currently doesn't import anything from internals).
I think in another PR that might not have been merged, I added an attribute on the DataFrame that returns True/False for this, which might be easier to reuse in multiple places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah we should for sure have 1 canonical way of doing this check so we can grep for all the places we do it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> #44676

# ArrayManager: in this case we cannot rely on getting the column
# as a Series to mutate, but need to operated on the mgr directly
if com.is_null_slice(pi) or com.is_full_slice(pi, len(self.obj)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #44353 (doesn't need to be resolved for this PR, but will make lots of things easier)

arr = self.obj._sanitize_column(value)
self.obj._mgr.iset(loc, arr)
else:
self.obj._mgr.column_setitem(loc, plane_indexer, value)
self.obj._clear_item_cache()
return

ser = self.obj._ixs(loc, axis=1)

# perform the equivalent of a setitem on the info axis
Expand Down
Loading