Proof of concept for Copy-on-Write implementation #41878

jorisvandenbossche · 2021-06-08T19:09:17Z

An experiment to implement one of the proposals discussed in #36195, described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit

This PR adds Copy-on-Write (CoW) functionality to the DataFrame/Series when using ArrayManager.
It does this by adding a new .refs attribute to the ArrayManager that, if populated, keeps a list of weakref references (one per columns, so len(mgr.arrays) == len(mgr.refs) to the array it is viewing.
This ensures that if we are modifying an array of a child manager, we can check if it is referencing (viewing) another array, and if needed do a copy on write. And also if we are modifying an array of a parent manager, we can check if that array is being referenced by another array and if needed do a copy on write in this parent frame. (of course, a manager can both be parent and child at the same time, so those two checks always happen both)

A very brief summary of the behaviour you get (longer description at #36195 (comment)):

Any subset (so also a slice, single column access, etc) uses CoW (or is already a copy)
DataFrame methods that return a new DataFrame return shallow copies (using CoW) if applicable (for this POC I implemented that for reset_index and rename to test, needs to be expanded to other methods)

I added a test_copy_on_write.py which already tests a set of cases (which of course needs to be expanded), going through that file should give an idea of the kind of behaviours (and how it changes compared to current master/BlockManager). Link to that file in the diff: click here

(I only added new, targeted tests in that file, I didn't yet start updating existing tests, as I imagine there will be quite a lot)

This is a proof-of-concept PR, so the feedback I am looking for / what I want to get out of it:

A concrete starting point for an implementation, to stir the discussion on this topic Proposal for future copy / view semantics in indexing operations #36195 (and a concrete implementation you can play with, to get feedback on the proposed semantics)
Review of the actual implementation using weakrefs (it's quite simple at the moment, but not too simple? Will this be robust enough?)
Other corner cases that should certainly be tested
...

And to be clear, not yet everything is covered (there are some # TODO's in array_manager.py, but only fixing those while also adding tests that require it)

TODO:

Series and DataFrame constructors

cc @pandas-dev/pandas-core

jorisvandenbossche · 2021-06-08T19:12:57Z

pandas/core/frame.py

@@ -4930,7 +4930,7 @@ def rename(
        index: Renamer | None = None,
        columns: Renamer | None = None,
        axis: Axis | None = None,
-        copy: bool = True,
+        copy: bool | None = None,


The copy=None is what changes DataFrame.rename to use a shallow copy with CoW instead of doing a full copy for the ArrayManager.
This is eventually passed to self.copy(deep=copy), and deep=None is used to signal a shallow copy for ArrayManager while for now preserving the deep copy for BlockManager.

jbrockmendel · 2021-06-09T21:42:49Z

pandas/core/indexing.py

@@ -1869,6 +1869,12 @@ def _setitem_single_column(self, loc: int, value, plane_indexer):
        """
        pi = plane_indexer

+        if not getattr(self.obj._mgr, "blocks", False):


can you use explicit isinstance; e.g. at first glance empty blocks would go through here

Ah, I should have used hasattr instead of getattr here (which would avoid that empty block case).

pandas.core.indexing currently doesn't import from internals, which I suppose is the reason that I used this implicit check instead of an explicit isinstance (having it as a property on the object, which I think was added in some other (open) PR, might also help for this case)

jbrockmendel · 2021-06-10T16:37:30Z

Does this implementation have the shallow-copies issue discussed here #36195 (comment) ?

jorisvandenbossche · 2021-06-10T17:05:04Z

Yes or no, depending on how to interpret the question ;)
It doesn't have the issue, by not allowing shallow copies in the current sense. Or to put it otherwise, a shallow copy (i.e. copy(deep=False)) effectively is a "CoW shallow copy" in the PR's implementation, meaning that once you modify it, data gets copied and mutations don't propagate back to the parent dataframe. If you, for some reason, want the same shallow copy semantics as we have now (i.e. mutations will propagate), then of course that's an "issue". But as mentioned in #36195 (comment), I am not really sure this is a problem that we won't have this anymore.

jbrockmendel · 2021-07-11T22:18:54Z

pandas/core/indexing.py

@@ -1894,6 +1894,16 @@ def _setitem_single_column(self, loc: int, value, plane_indexer):
        """
        pi = plane_indexer

+        if not hasattr(self.obj._mgr, "blocks"):
+            # ArrayManager


comment on why this is AM-specific?

jbrockmendel · 2021-07-12T01:10:24Z

pandas/core/internals/managers.py

@@ -351,8 +351,12 @@ def where(self: T, other, cond, align: bool, errors: str) -> T:
            errors=errors,
        )

-    def setitem(self: T, indexer, value) -> T:
-        return self.apply("setitem", indexer=indexer, value=value)
+    def setitem(self: T, indexer, value, inplace=False) -> T:


why is this change necessary?

To follow the change that was needed in the ArrayManager (to ensure every mutation of the values happens in the manager), see #41879 for this change as a separated pre-cursor PR

jbrockmendel · 2021-07-12T01:16:34Z

Does writing to the parent trigger a copy in the child?

Update: i see in the doc it says it does (Propagating Mutation Forwards), but that doesn't appear to be implemented yet in this branch

$ hub pr checkout 41878
$ export PANDAS_DATA_MANAGER=array
$ python3

import pandas as pd

ser = pd.Series(range(10))
ser2 = ser.view("m8[ns]")

ser.iloc[0] = 10000

>>> ser2.view("i8")[0]
10000

Not sure if this is a bug or just not yet implemented in this branch.

jbrockmendel · 2021-07-12T01:18:58Z

Is there any prospect of making this independent of AM/BM?

jorisvandenbossche · 2021-07-12T07:00:05Z

Does writing to the parent trigger a copy in the child?

It triggers a copy, but in the current implementation it's always the values that are being modified that get copied. So in this case, it would trigger a copy in the parent (but so still preserving the behaviour that mutations are not propagated).
(I think this is the easiest, as there can potentially be many children and children of children etc)

but that doesn't appear to be implemented yet in this branch

Yes, I didn't yet check the view() method (which is actually using the Series constructor, which I didn't yet update to track reference to the input array). If you use a viewing indexing method for the example, (eg s2 = s[:], but which of course doesn't change the dtype), it works as expected:

In [9]: pd.options.mode.data_manager = "array"

In [10]: ser = pd.Series(range(5))
    ...: ser2 = ser[:]

In [11]: ser.iloc[0] = 10000

In [12]: ser2
Out[12]: 
0    0
1    1
2    2
3    3
4    4
dtype: int64

Is there any prospect of making this independent of AM/BM?

Yes, I don't think there is anything in the current implementation that makes it tied to the ArrayManager, and it should be possible to write a similar BlockManager version (it could have the logic of checking for weakref references / triggering a copy if needed on the Blocks. I only expect it to be a bit more complicated, at least if we want to ensure that modifying a single column doesn't necessarily triggers a copy of the full dataframe with consolidated blocks).
Even if we would move to the ArrayManager long term, we would still basically need some implementation of it in the BlockManager anyway, to be able to raise FutureWarnings about when behaviour would change (although in this case, the implementation can be simpler, as it wouldn't need the performance considerations of not copying a full block).

But, before doing this in practice, I would first want to see 1) more discussion on the actual semantics (and some more explicit agreement on that we want this), 2) more review of the implementation details (using weakrefs etc) to ensure this POC is actually possible (as this would be similar for the BM version) and 3) some discussion on how we would see an upgrade path (what kind of version bump, how do we want to provide future warnings etc)

jbrockmendel · 2021-07-12T20:53:04Z

It triggers a copy, but in the current implementation it's always the values that are being modified that get copied. So in this case, it would trigger a copy in the parent [...] (I think this is the easiest

Thanks, that makes sense. Will have to give some thought to the implications of this.

it should be possible to write a similar BlockManager version [...] But, before doing this in practice, I would first want to see [...]

Make sense.

jbrockmendel · 2021-07-12T21:03:22Z

How does this behave w/r/t frame.values? for consolidated (1 column for AM) cases this will be a view on the underlying array, but a different object.

(also Series.to_frame, but that already behaves differently for ArrayManager xref #42512)

jbrockmendel · 2021-07-13T18:21:40Z

was the as_mutable_view idea intended to be part of this? IIUC this would mean that basically nothing is ever "just a view"?

jorisvandenbossche · 2021-07-15T13:43:34Z

was the as_mutable_view idea intended to be part of this?

No, that was part of the original proposal / discussion, but I didn't include it in my updated proposal focusing on Copy-on-Write. I also don't directly see how it could be implemented with reliable semantics across all corner cases (or at least without a complex implementation).

For example, let's assume you have a df and two viewing "child" objects df_child1 and df_child2. If you would now make df_child1 in a "mutable view", how does that affect df_child2 ? Mutating df_child1 is then supposed to also mutate the parent df, but still not df_child2, in which case a mutation of df_child1 would need to trigger a copy of the internal data stored in df_child2. So then you need to recursively check children of children etc to update those ...

IIUC this would mean that basically nothing is ever "just a view"?

Indeed. Either you have identical objects (df1 is df2) and then obviously mutating the one is reflected in the other (since it are just two names pointing to the same object, eg from doing df2 = df1), or either it are not identical objects and then it always uses copy-on-write in case the data are views (meaning it will never act as a view for the user).

jbrockmendel · 2021-11-24T23:23:45Z

pandas/core/indexing.py

+        if not hasattr(self.obj._mgr, "blocks"):
+            # ArrayManager: in this case we cannot rely on getting the column
+            # as a Series to mutate, but need to operated on the mgr directly
+            if com.is_null_slice(pi) or com.is_full_slice(pi, len(self.obj)):


xref #44353 (doesn't need to be resolved for this PR, but will make lots of things easier)

jbrockmendel · 2021-11-24T23:24:54Z

pandas/core/indexing.py

@@ -1840,6 +1840,17 @@ def _setitem_single_column(self, loc: int, value, plane_indexer):
        """
        pi = plane_indexer



#42887 would make for a good precursor

pandas/core/internals/array_manager.py

jbrockmendel · 2021-11-24T23:28:02Z

pandas/core/internals/array_manager.py

+        for i in range(len(self.arrays)):
+            if not self._has_no_reference(i):
+                # if being referenced -> perform Copy-on-Write and clear the reference
+                self.arrays[i] = self.arrays[i].copy()


this is copying more than is strictly necessary (which may be harmless). to be precise would need to examine the mask and exclude columns for which not mask[i].any()

That's one of the reasons I first wanted to refactor putmask here (#44396). But as long as we are using apply_with_block here, I am not going to bother looking into such optimization (it would require part of the changes in #44396 anyway to split up the mask like that)

(it's also not a super important optimization to have in a first PR I think)

pandas/core/internals/array_manager.py

pandas/tests/frame/indexing/test_xs.py

pandas/tests/frame/methods/test_cov_corr.py

pandas/tests/indexing/test_chaining_and_caching.py

pandas/tests/indexing/test_scalar.py

jorisvandenbossche · 2021-11-26T13:10:31Z

@jbrockmendel thanks for the review!

jreback

so a few high level / general comments

this is leaking ArrayManager into the indexing at a high level (e.g you have to check this in indexing code), ideally this can be avoided
would like to break off parts of this that are pure refactorings as precursor PRs (as @jbrockmendel suggested)
you are hijacking the deep=None semantics to do this. we really don't like this keyword anyhow, but either can you add a new value, idk, maybe deep='cow' (not sure this is better) but maybe more obvious and readable. or I think @jbrockmendel suggested an enum (let's make it internal atm to signal this).
this is pretty convoluted when you have a method like .reset_index() that only has an inplace keyword (and not copy). i would say we should either: deprecate inplace and replace with copy first.
this is adding an enormous amount of complexitly to the code (likely hiding lots of cases).. not sure what to do about this, just noting it.

jreback · 2021-11-28T20:35:49Z

pandas/core/frame.py

@@ -3967,8 +3967,15 @@ def _set_value(
        """
        try:
            if takeable:
-                series = self._ixs(col, axis=1)
-                series._set_value(index, value, takeable=True)
+                if isinstance(self._mgr, ArrayManager):


can you not do this in internals? hate leaking ArrayManager semantics here

What I added here is actually to call into the internals to do it, instead of the current frame-level methods.
It's only possible for ArrayManager, though (see https://github.com/pandas-dev/pandas/pull/41878/files#r757450838 just below), which is the reason I added a check.

jreback · 2021-11-28T20:37:42Z

pandas/core/indexing.py

@@ -1840,6 +1840,17 @@ def _setitem_single_column(self, loc: int, value, plane_indexer):
        """
        pi = plane_indexer

+        if not hasattr(self.obj._mgr, "blocks"):


shouldn't this be an ArrayManager test? why the different semantics?

This is just a check to know if it is an ArrayManager, without actually importing it (core/indexing.py currently doesn't import anything from internals).
I think in another PR that might not have been merged, I added an attribute on the DataFrame that returns True/False for this, which might be easier to reuse in multiple places.

yah we should for sure have 1 canonical way of doing this check so we can grep for all the places we do it

jorisvandenbossche · 2021-11-29T08:57:29Z

Thanks for the review!

you are hijacking the deep=None semantics to do this. we really don't like this keyword anyhow, but either can you add a new value, idk, maybe deep='cow' (not sure this is better) but maybe more obvious and readable. or I think @jbrockmendel suggested an enum (let's make it internal atm to signal this).

I don't think that deep="cow" is any better or different, both are a new value that can be passed to deep. Given that eventually, the deep=None could become deep=False again (now the deep=None is only meant as a keyword that means False for ArrayManager and True for BlockManager, to preserve the BlockManager behaviour on master), I would prefer to keep the None instead of some other sentinel (we generally use None elsewhere in cases the default depends on some other context)

this is pretty convoluted when you have a method like .reset_index() that only has an inplace keyword (and not copy). i would say we should either: deprecate inplace and replace with copy first.

What do you mean exactly with "convoluted" in this case? For the case the user doesn't specify inplace=True manually (i.e. for the default usage of reset_index()), it is easy to add CoW behaviour to reset_index() without it having a copy keyword.
For the inplace=True case, it's a bit less clear, but short term we can probably continue what we do now: modify the object in place (i.e. equivalent as one would have done a setitem df["index"] = ..)

this is adding an enormous amount of complexitly to the code (likely hiding lots of cases).. not sure what to do about this, just noting it.

I think it's actually quite little code for such a big feature :) (the majority of the diff here are new/adapated tests). Now there are still many cases not covered, so it will of course still grow a bit.

jbrockmendel · 2022-11-30T02:34:09Z

@jorisvandenbossche closable?

mroeschke · 2022-12-17T19:16:13Z

Yeah since CoW is in 1.5, albeit only for blockmanger, I think we can close this for now until Arraymanager work is picked back up. Closing

jorisvandenbossche added 2 commits June 8, 2021 20:31

POC for Copy-on-Write

462526e

clean-up ArrayManager.copy implementation

41ee2b7

jorisvandenbossche added Needs Discussion Requires discussion from core team before further action Copy / view semantics labels Jun 8, 2021

jorisvandenbossche mentioned this pull request Jun 8, 2021

Proposal for future copy / view semantics in indexing operations #36195

Closed

jorisvandenbossche commented Jun 8, 2021

View reviewed changes

add some comments / docstrings

7a8dffc

jorisvandenbossche mentioned this pull request Jun 8, 2021

REF: avoid mutating Series._values directly in setitem but defer to Manager method #41879

Merged

jbrockmendel reviewed Jun 9, 2021

View reviewed changes

jorisvandenbossche added 8 commits June 25, 2021 08:14

Merge remote-tracking branch 'upstream/master' into am-experiment-cow

1c964be

fix series slice + del operator

96b6d71

fix pickle and column_setitem with dtype changes

17cedb9

fix setitem on single column for full slice + update frame tests

f4614c2

Merge remote-tracking branch 'upstream/master' into am-experiment-cow

81b09c2

fix ref tracking in slicing columns

7f183de

fix series setitem -> don't update parent cache

693bc4f

update indexing tests with new behaviour to get them passing

a154591

jbrockmendel reviewed Jul 11, 2021

View reviewed changes

jbrockmendel reviewed Jul 12, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into am-experiment-cow

71370c4

shwina mentioned this pull request Jul 15, 2021

Add struct.explode() method rapidsai/cudf#8729

Merged