Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept for Copy-on-Write implementation #41878

Closed

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jun 8, 2021

An experiment to implement one of the proposals discussed in #36195, described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit

This PR adds Copy-on-Write (CoW) functionality to the DataFrame/Series when using ArrayManager.
It does this by adding a new .refs attribute to the ArrayManager that, if populated, keeps a list of weakref references (one per columns, so len(mgr.arrays) == len(mgr.refs) to the array it is viewing.
This ensures that if we are modifying an array of a child manager, we can check if it is referencing (viewing) another array, and if needed do a copy on write. And also if we are modifying an array of a parent manager, we can check if that array is being referenced by another array and if needed do a copy on write in this parent frame. (of course, a manager can both be parent and child at the same time, so those two checks always happen both)

A very brief summary of the behaviour you get (longer description at #36195 (comment)):

  • Any subset (so also a slice, single column access, etc) uses CoW (or is already a copy)
  • DataFrame methods that return a new DataFrame return shallow copies (using CoW) if applicable (for this POC I implemented that for reset_index and rename to test, needs to be expanded to other methods)

I added a test_copy_on_write.py which already tests a set of cases (which of course needs to be expanded), going through that file should give an idea of the kind of behaviours (and how it changes compared to current master/BlockManager). Link to that file in the diff: click here

(I only added new, targeted tests in that file, I didn't yet start updating existing tests, as I imagine there will be quite a lot)

This is a proof-of-concept PR, so the feedback I am looking for / what I want to get out of it:

  • A concrete starting point for an implementation, to stir the discussion on this topic Proposal for future copy / view semantics in indexing operations #36195 (and a concrete implementation you can play with, to get feedback on the proposed semantics)
  • Review of the actual implementation using weakrefs (it's quite simple at the moment, but not too simple? Will this be robust enough?)
  • Other corner cases that should certainly be tested
  • ...

And to be clear, not yet everything is covered (there are some # TODO's in array_manager.py, but only fixing those while also adding tests that require it)

TODO:

  • Series and DataFrame constructors

cc @pandas-dev/pandas-core

@@ -4930,7 +4930,7 @@ def rename(
index: Renamer | None = None,
columns: Renamer | None = None,
axis: Axis | None = None,
copy: bool = True,
copy: bool | None = None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copy=None is what changes DataFrame.rename to use a shallow copy with CoW instead of doing a full copy for the ArrayManager.
This is eventually passed to self.copy(deep=copy), and deep=None is used to signal a shallow copy for ArrayManager while for now preserving the deep copy for BlockManager.

@@ -1869,6 +1869,12 @@ def _setitem_single_column(self, loc: int, value, plane_indexer):
"""
pi = plane_indexer

if not getattr(self.obj._mgr, "blocks", False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use explicit isinstance; e.g. at first glance empty blocks would go through here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I should have used hasattr instead of getattr here (which would avoid that empty block case).

pandas.core.indexing currently doesn't import from internals, which I suppose is the reason that I used this implicit check instead of an explicit isinstance (having it as a property on the object, which I think was added in some other (open) PR, might also help for this case)

@jbrockmendel
Copy link
Member

Does this implementation have the shallow-copies issue discussed here #36195 (comment) ?

@jorisvandenbossche
Copy link
Member Author

Yes or no, depending on how to interpret the question ;)
It doesn't have the issue, by not allowing shallow copies in the current sense. Or to put it otherwise, a shallow copy (i.e. copy(deep=False)) effectively is a "CoW shallow copy" in the PR's implementation, meaning that once you modify it, data gets copied and mutations don't propagate back to the parent dataframe. If you, for some reason, want the same shallow copy semantics as we have now (i.e. mutations will propagate), then of course that's an "issue". But as mentioned in #36195 (comment), I am not really sure this is a problem that we won't have this anymore.

@@ -1894,6 +1894,16 @@ def _setitem_single_column(self, loc: int, value, plane_indexer):
"""
pi = plane_indexer

if not hasattr(self.obj._mgr, "blocks"):
# ArrayManager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on why this is AM-specific?

@@ -351,8 +351,12 @@ def where(self: T, other, cond, align: bool, errors: str) -> T:
errors=errors,
)

def setitem(self: T, indexer, value) -> T:
return self.apply("setitem", indexer=indexer, value=value)
def setitem(self: T, indexer, value, inplace=False) -> T:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow the change that was needed in the ArrayManager (to ensure every mutation of the values happens in the manager), see #41879 for this change as a separated pre-cursor PR

@jbrockmendel
Copy link
Member

jbrockmendel commented Jul 12, 2021

Does writing to the parent trigger a copy in the child?

Update: i see in the doc it says it does (Propagating Mutation Forwards), but that doesn't appear to be implemented yet in this branch

$ hub pr checkout 41878
$ export PANDAS_DATA_MANAGER=array
$ python3

import pandas as pd

ser = pd.Series(range(10))
ser2 = ser.view("m8[ns]")

ser.iloc[0] = 10000

>>> ser2.view("i8")[0]
10000

Not sure if this is a bug or just not yet implemented in this branch.

@jbrockmendel
Copy link
Member

Is there any prospect of making this independent of AM/BM?

@jorisvandenbossche
Copy link
Member Author

Does writing to the parent trigger a copy in the child?

It triggers a copy, but in the current implementation it's always the values that are being modified that get copied. So in this case, it would trigger a copy in the parent (but so still preserving the behaviour that mutations are not propagated).
(I think this is the easiest, as there can potentially be many children and children of children etc)

but that doesn't appear to be implemented yet in this branch

Yes, I didn't yet check the view() method (which is actually using the Series constructor, which I didn't yet update to track reference to the input array). If you use a viewing indexing method for the example, (eg s2 = s[:], but which of course doesn't change the dtype), it works as expected:

In [9]: pd.options.mode.data_manager = "array"

In [10]: ser = pd.Series(range(5))
    ...: ser2 = ser[:]

In [11]: ser.iloc[0] = 10000

In [12]: ser2
Out[12]: 
0    0
1    1
2    2
3    3
4    4
dtype: int64

Is there any prospect of making this independent of AM/BM?

Yes, I don't think there is anything in the current implementation that makes it tied to the ArrayManager, and it should be possible to write a similar BlockManager version (it could have the logic of checking for weakref references / triggering a copy if needed on the Blocks. I only expect it to be a bit more complicated, at least if we want to ensure that modifying a single column doesn't necessarily triggers a copy of the full dataframe with consolidated blocks).
Even if we would move to the ArrayManager long term, we would still basically need some implementation of it in the BlockManager anyway, to be able to raise FutureWarnings about when behaviour would change (although in this case, the implementation can be simpler, as it wouldn't need the performance considerations of not copying a full block).

But, before doing this in practice, I would first want to see 1) more discussion on the actual semantics (and some more explicit agreement on that we want this), 2) more review of the implementation details (using weakrefs etc) to ensure this POC is actually possible (as this would be similar for the BM version) and 3) some discussion on how we would see an upgrade path (what kind of version bump, how do we want to provide future warnings etc)

@jbrockmendel
Copy link
Member

It triggers a copy, but in the current implementation it's always the values that are being modified that get copied. So in this case, it would trigger a copy in the parent [...] (I think this is the easiest

Thanks, that makes sense. Will have to give some thought to the implications of this.

it should be possible to write a similar BlockManager version [...] But, before doing this in practice, I would first want to see [...]

Make sense.

@jbrockmendel
Copy link
Member

jbrockmendel commented Jul 12, 2021

How does this behave w/r/t frame.values? for consolidated (1 column for AM) cases this will be a view on the underlying array, but a different object.

(also Series.to_frame, but that already behaves differently for ArrayManager xref #42512)

@jbrockmendel
Copy link
Member

was the as_mutable_view idea intended to be part of this? IIUC this would mean that basically nothing is ever "just a view"?

@jorisvandenbossche
Copy link
Member Author

was the as_mutable_view idea intended to be part of this?

No, that was part of the original proposal / discussion, but I didn't include it in my updated proposal focusing on Copy-on-Write. I also don't directly see how it could be implemented with reliable semantics across all corner cases (or at least without a complex implementation).

For example, let's assume you have a df and two viewing "child" objects df_child1 and df_child2. If you would now make df_child1 in a "mutable view", how does that affect df_child2 ? Mutating df_child1 is then supposed to also mutate the parent df, but still not df_child2, in which case a mutation of df_child1 would need to trigger a copy of the internal data stored in df_child2. So then you need to recursively check children of children etc to update those ...

IIUC this would mean that basically nothing is ever "just a view"?

Indeed. Either you have identical objects (df1 is df2) and then obviously mutating the one is reflected in the other (since it are just two names pointing to the same object, eg from doing df2 = df1), or either it are not identical objects and then it always uses copy-on-write in case the data are views (meaning it will never act as a view for the user).

if not hasattr(self.obj._mgr, "blocks"):
# ArrayManager: in this case we cannot rely on getting the column
# as a Series to mutate, but need to operated on the mgr directly
if com.is_null_slice(pi) or com.is_full_slice(pi, len(self.obj)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #44353 (doesn't need to be resolved for this PR, but will make lots of things easier)

@@ -1840,6 +1840,17 @@ def _setitem_single_column(self, loc: int, value, plane_indexer):
"""
pi = plane_indexer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#42887 would make for a good precursor

for i in range(len(self.arrays)):
if not self._has_no_reference(i):
# if being referenced -> perform Copy-on-Write and clear the reference
self.arrays[i] = self.arrays[i].copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is copying more than is strictly necessary (which may be harmless). to be precise would need to examine the mask and exclude columns for which not mask[i].any()

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Nov 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one of the reasons I first wanted to refactor putmask here (#44396). But as long as we are using apply_with_block here, I am not going to bother looking into such optimization (it would require part of the changes in #44396 anyway to split up the mask like that)

(it's also not a super important optimization to have in a first PR I think)

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel thanks for the review!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so a few high level / general comments

  • this is leaking ArrayManager into the indexing at a high level (e.g you have to check this in indexing code), ideally this can be avoided
  • would like to break off parts of this that are pure refactorings as precursor PRs (as @jbrockmendel suggested)
  • you are hijacking the deep=None semantics to do this. we really don't like this keyword anyhow, but either can you add a new value, idk, maybe deep='cow' (not sure this is better) but maybe more obvious and readable. or I think @jbrockmendel suggested an enum (let's make it internal atm to signal this).
  • this is pretty convoluted when you have a method like .reset_index() that only has an inplace keyword (and not copy). i would say we should either: deprecate inplace and replace with copy first.
  • this is adding an enormous amount of complexitly to the code (likely hiding lots of cases).. not sure what to do about this, just noting it.

@@ -3967,8 +3967,15 @@ def _set_value(
"""
try:
if takeable:
series = self._ixs(col, axis=1)
series._set_value(index, value, takeable=True)
if isinstance(self._mgr, ArrayManager):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you not do this in internals? hate leaking ArrayManager semantics here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I added here is actually to call into the internals to do it, instead of the current frame-level methods.
It's only possible for ArrayManager, though (see https://github.com/pandas-dev/pandas/pull/41878/files#r757450838 just below), which is the reason I added a check.

@@ -1840,6 +1840,17 @@ def _setitem_single_column(self, loc: int, value, plane_indexer):
"""
pi = plane_indexer

if not hasattr(self.obj._mgr, "blocks"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be an ArrayManager test? why the different semantics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a check to know if it is an ArrayManager, without actually importing it (core/indexing.py currently doesn't import anything from internals).
I think in another PR that might not have been merged, I added an attribute on the DataFrame that returns True/False for this, which might be easier to reuse in multiple places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah we should for sure have 1 canonical way of doing this check so we can grep for all the places we do it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> #44676

@jorisvandenbossche
Copy link
Member Author

Thanks for the review!

  • you are hijacking the deep=None semantics to do this. we really don't like this keyword anyhow, but either can you add a new value, idk, maybe deep='cow' (not sure this is better) but maybe more obvious and readable. or I think @jbrockmendel suggested an enum (let's make it internal atm to signal this).

I don't think that deep="cow" is any better or different, both are a new value that can be passed to deep. Given that eventually, the deep=None could become deep=False again (now the deep=None is only meant as a keyword that means False for ArrayManager and True for BlockManager, to preserve the BlockManager behaviour on master), I would prefer to keep the None instead of some other sentinel (we generally use None elsewhere in cases the default depends on some other context)

  • this is pretty convoluted when you have a method like .reset_index() that only has an inplace keyword (and not copy). i would say we should either: deprecate inplace and replace with copy first.

What do you mean exactly with "convoluted" in this case? For the case the user doesn't specify inplace=True manually (i.e. for the default usage of reset_index()), it is easy to add CoW behaviour to reset_index() without it having a copy keyword.
For the inplace=True case, it's a bit less clear, but short term we can probably continue what we do now: modify the object in place (i.e. equivalent as one would have done a setitem df["index"] = ..)

  • this is adding an enormous amount of complexitly to the code (likely hiding lots of cases).. not sure what to do about this, just noting it.

I think it's actually quite little code for such a big feature :) (the majority of the diff here are new/adapated tests). Now there are still many cases not covered, so it will of course still grow a bit.

@jbrockmendel
Copy link
Member

@jorisvandenbossche closable?

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Nov 30, 2022
@mroeschke
Copy link
Member

Yeah since CoW is in 1.5, albeit only for blockmanger, I think we can close this for now until Arraymanager work is picked back up. Closing

@mroeschke mroeschke closed this Dec 17, 2022
@jorisvandenbossche jorisvandenbossche deleted the am-experiment-cow branch May 2, 2023 18:32
@jorisvandenbossche jorisvandenbossche restored the am-experiment-cow branch May 2, 2023 18:32
@jorisvandenbossche jorisvandenbossche deleted the am-experiment-cow branch May 2, 2023 18:33
@jorisvandenbossche jorisvandenbossche restored the am-experiment-cow branch May 2, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Copy / view semantics Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants