REF: Add Manager.column_setitem to set values into a single column (without intermediate series) #47074

jorisvandenbossche · 2022-05-20T16:04:53Z

The column_setitem is a method I was adding in #46958, is something that can be broken off from that PR, and is IMO also a generally useful change.

Currently in some cases to set values into a single column (eg df.loc[idx, "A"] = value), we create an intermediate Series for that column, set values into that Series, and then update the DataFrame from that updated Series. I think it would be generally good to avoid creating this intermediate Series, and moving the actual/final "set" operation to the internals (as is done for other set operations).

…ithout intermediate series)

jreback

lgtm
cc @jbrockmendel

jbrockmendel · 2022-05-22T03:00:30Z

pandas/core/frame.py

@@ -3924,16 +3924,18 @@ def _set_value(
            Sets whether or not index/col interpreted as indexers
        """
        try:
+            # setitem will do validation that may raise TypeError,


should this comment refer to column_setitem instead of just setitem?

it also looks like the new method doesn't actually do the validation this comment refers to?

This is an existing comment (originally a few lines below), but so I suppose this comment was actually already not up to date anymore.
So before this PR the comment was about setitem_inplace, and that also doesn't do any validation.

setitem_inplace calls np_can_hold_element, which raises on failure

Ah, yes, I see. That is doing validation that can raise the LossySetitemError. So the new column_setitem calls setitem, which will also call np_can_hold_element, but there catching the LossySetitemError and coercing to the target dtype if needed.
That is something that the loc/iloc fallback below otherwise will also do, so I suppose this change is OK (but the comment is then indeed no longer correct, and we also don't need to catch LossySetitemError here)

and we also don't need to catch LossySetitemError here

yah, possibly also TypeError and ValueError

pandas/core/frame.py

jbrockmendel · 2022-05-22T03:02:49Z

pandas/tests/frame/indexing/test_setitem.py

@@ -1082,7 +1082,7 @@ def test_setitem_partial_column_inplace(self, consolidate, using_array_manager):
        tm.assert_numpy_array_equal(zvals, expected.values)
        assert np.shares_memory(zvals, df["z"]._values)
        if not consolidate:
-            assert df["z"]._values is zvals
+            assert df["z"]._values.base is zvals.base


this check could pass if both .base attrs are None

Yes, good catch (will update that). Now, I also not fully sure why this change was actually needed, will see if I can figure that out.

OK, so before we had this piece of code in indexing.py (that was removed in this PR):

if ser._values is orig_values: # The setitem happened inplace, so the DataFrame's values # were modified inplace. return

That ensures that it is actually still an identical array that is stored in df["z"].

Now this goes through column_setitem, where I didn't add such a check, but simply always call iset with the values from the SingleBlockManager. When creating the SingleBlockManager with iget, this creates a view of the values (return self.values[i] in Block.iget), so this is not anymore an identical array, only a viewing array.

Since the only important guarantee is that it shares the same data (which is already tested on the line above), I am inclined to just remove this additional assert for identical arrays.

Alternatively, I could do a similar check in column_setitem.

jbrockmendel · 2022-05-22T03:07:46Z

pandas/core/internals/array_manager.py

+        intermediate Series at the DataFrame level (`s = df[loc]; s[idx] = value`)
+        """
+        arr = self.arrays[loc]
+        # create temporary SingleArrayManager without ref to use setitem implementation


is the "without ref" here why you're not using .iget?

looks like if you used .iget here then the method could be shared between AM/BM?

is the "without ref" here why you're not using .iget?

The "without ref" comment is from copy pasting this from the initial CoW PR #41878 (so should probably remove that here).

But that is indeed the reason to not use iget in that PR.

pandas/core/indexing.py

jbrockmendel · 2022-05-22T03:13:12Z

pandas/core/internals/managers.py

+        """
+        col_mgr = self.iget(loc)
+        new_mgr = col_mgr.setitem((idx,), value)
+        self.iset(loc, new_mgr._block.values, inplace=True)


i think you can use _setitem_single here?

iset will indeed call _iset_single in this case, but it still does some useful preprocessing that we might not want to repeat here I think (eg converting loc in a blkno)

jbrockmendel · 2022-05-22T03:15:08Z

pandas/core/internals/array_manager.py

+        mgr = SingleArrayManager([arr], [self._axes[0]])
+        new_mgr = mgr.setitem((idx,), value)
+        # update existing ArrayManager in-place
+        self.arrays[loc] = new_mgr.arrays[0]


the "into" in the docstring suggests that the setting should occur into the existing array, so we shouldn't need to set a new array. am i misunderstanding?

Yes, that's right. But the setitem method does that (setting into the existing array), or more correctly might do that.

So it certainly looks a bit confusing here, as it indeed seems that I am fully replacing the array for the column in question.
But so if the setitem method changed the array inplace, this self.arrays[loc] = new_mgr.arrays[0] is assigning the original array again and thus not changing anything / a no-op.
It's only if setitem actually created a new array (eg the setitem caused a dtype change), that this line is doing something relevant.

In principle I could check if both arrays are identical, and only if that is not the case, do this self.arrays[loc] = new_mgr.arrays[0] (although it wouldn't actually change any behaviour, that might be more explicit)

In general I find it also a bit confusing in our internal API that setitem (both the Manager and Block method) is a method that works in place (sometimes), but does return a manager/block.

In general I find it also a bit confusing in our internal API that setitem (both the Manager and Block method) is a method that works in place (sometimes), but does return a manager/block.

yah, id be open to a name change (separate PR) for that. this is related to why i like setitem_inplace

…titem

jbrockmendel · 2022-05-22T17:15:04Z

pandas/core/frame.py

-            series._mgr.setitem_inplace(loc, value)
+                icol = self.columns.get_loc(col)
+                iindex = self.index.get_loc(index)
+            # error: Argument 2 to "column_setitem" of "BlockManager" has


might be unnecessary with a cast in the takeable case

I tried that before adding the ignore, but I don't really understand why it didn't work

I suppose I didn't try before to only do the cast inside the takeable branch, because that now seems to work

(note, I did iindex = cast(int, index), which is strictly speaking not correct, because the index can be anything that passes out is_integer function. Should I use ScalarIndexer = Union[int, np.integer] instead of int, or it maybe also doesn't matter much in practice?)

jbrockmendel · 2022-05-22T17:29:27Z

pandas/core/frame.py

+            # incompatible type "Union[Hashable, Sequence[Hashable]]";
+            # expected "Union[int, slice, ndarray[Any, Any]]"
+            self._mgr.column_setitem(icol, iindex, value)  # type: ignore[arg-type]
+            self._clear_item_cache()


could we make column_setitem always-inplace and make the clear_item_cache unecesssary?

API-wise i think the always-inplace method is a lot nicer than the less predictable one

or maybe could make setitem_inplace ignore CoW since it is explicitly inplace?

could we make column_setitem always-inplace and make the clear_item_cache unecesssary?

The main use case of column_setitem is in _iLocIndexer._setitem_single_column, which is used for setting with loc/iloc. And for that use case, we need this to be not inplace (i.e. having the dtype coercing behaviour), since that is what we need for loc/iloc.

The case here is only for at/iat setting.

I could add a separate inplace version or add an inplace keyword to column_setitem that could be used here. That would keep the current logic more intact, but since we fallback to loc/iloc anyway when the inplace setitem fails, I am not sure it would actually be very useful.

or maybe could make setitem_inplace ignore CoW since it is explicitly inplace?

Even something that is explicitly inplace from a usage perspective will need to take care of CoW in #46958

I pushed a version with an inplace keyword in the last commit (453eaba). I lean more towards "not worth it", but either way is fine for me.

@jbrockmendel do you have a preference on keeping this change with the inplace keyword for column_setitem or not?

Actually it seems we still need to catch TypeError, for example for a MultiIndex case where self.columns.get_loc(col) might not necessarily result in an integer.

So only removed LossySetitemError for now (and added a test case for setting with at with a MultiIndex, as that didn't yet seem to be covered in the indexing tests)

OK, can you add a comment about why TypeError is needed. what about ValueError?

Will add a comment about TypeError.
I am not fully sure about the ValueError. Is it possible for a setitem operation to raise a ValueError? (it seems that validation methods (like _validate_setitem_value) will mostly raise TypeErrors?)

Now, this catching of a ValueError alraedy was here before, so I am hesitant to remove it without looking further in detail at it. I would prefer to leave that for another PR.

So there are some cases where setitem actually raises a ValueError, eg when setting with an array-like that is not of length 1 (in this case of scalar indexer at or iat).
Now, the fallback to loc/iloc will then most likely also raise a ValueError (so catching it might not necessarily add that much). But at least in some cases it seems to change the error message.

For example:

>>> df = pd.DataFrame({'a': pd.Series([1, 2, 3], dtype="Int64"), 'b': [0.1, 0.2, 0.3]}) >>> df.iat[0, 0] = np.array([0, 1]) ... ValueError: Must have equal len keys and value when setting with an iterable

Without catching/reraising in iloc, the error would be "ValueError: setting an array element with a sequence", which is slightly less informative.

I suppose we should actually see to make this handling consistent in the different code paths (so it directly raises the more informative error message in the first place), but that's out of scope for this PR.

…titem

pandas/core/indexing.py

jbrockmendel · 2022-05-23T20:38:51Z

pandas/core/indexing.py

            else:
                # in case of slice
-                ser = value[pi]
+                value = value[pi]
+            self.obj._iset_item(loc, value)
        else:
            # set the item, first attempting to operate inplace, then


is this comment still accurate/helpful?

Hmm, not really I think (but also not before this PR). The logic of "first attempt to operate inplace, then falling back to casting if necessary" is contained within the column_setitem call (and before this PR it was contained in the setitem call)

Now, it was added in 573c063, and from that point of view: I would say the comment is still as accurate as it was when it was added :)

But I think we can also remove it now, it's the general behaviour of loc/iloc to fallback to casting

Might be worth discussing the gameplan for after the deprecation inside _setitem_single_column is enforced. What I had in mind was that inside of doing the can_hold_element check as part of take_split_path, we would do a try/except falling back to the split path. Does that idea play nicely with what you're doing here?

I would think that enforcing the deprecation would mean to remove the checks for "null/full slice" and so stop using _iset_item to replace the column in those cases, and always use column_setitem ?

pandas/core/internals/array_manager.py

This reverts commit 453eaba.

…titem

jorisvandenbossche · 2022-05-31T01:00:44Z

pandas/tests/frame/indexing/test_setitem.py

@@ -1090,8 +1090,6 @@ def test_setitem_partial_column_inplace(self, consolidate, using_array_manager):
        # check setting occurred in-place
        tm.assert_numpy_array_equal(zvals, expected.values)
        assert np.shares_memory(zvals, df["z"]._values)
-        if not consolidate:
-            assert df["z"]._values is zvals


See #47074 (comment) for comment about this removal

…titem

jorisvandenbossche · 2022-05-31T21:27:25Z

@jbrockmendel apart from the two comments you just added (will update for that), do you have any remaining substantial comments? Otherwise I would want to merge this so I can update #46958 on top of this.

…titem

jorisvandenbossche · 2022-06-07T21:39:49Z

Given there are no more comments, I am planning to merge this

…ithout intermediate series) (pandas-dev#47074)

REF: Add Manager.column_setitem to set values into a single column (w…

0e4c58e

…ithout intermediate series)

jorisvandenbossche requested a review from jbrockmendel May 20, 2022 16:04

jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation labels May 20, 2022

handle typing

a2aa8aa

jorisvandenbossche mentioned this pull request May 20, 2022

API: New copy / view semantics using Copy-on-Write #46958

Merged

jreback approved these changes May 21, 2022

View reviewed changes

jreback added this to the 1.5 milestone May 21, 2022

jbrockmendel reviewed May 22, 2022

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

jbrockmendel reviewed May 22, 2022

View reviewed changes

pandas/core/indexing.py Show resolved Hide resolved

jbrockmendel reviewed May 22, 2022

View reviewed changes

jorisvandenbossche added 2 commits May 22, 2022 11:58

Merge remote-tracking branch 'upstream/main' into internals-column-se…

ce0649b

…titem

dedent in _set_value

103d1fe

jbrockmendel reviewed May 22, 2022

View reviewed changes

jorisvandenbossche added 3 commits May 23, 2022 09:48

Merge remote-tracking branch 'upstream/main' into internals-column-se…

d20b0cb

…titem

add inplace keyword

453eaba

update typing

be740ad

jbrockmendel reviewed May 23, 2022

View reviewed changes

pandas/core/indexing.py Show resolved Hide resolved

jbrockmendel reviewed May 23, 2022

View reviewed changes

pandas/core/internals/array_manager.py Outdated Show resolved Hide resolved

jorisvandenbossche added 5 commits May 24, 2022 23:41

Revert "add inplace keyword"

e63c7f6

This reverts commit 453eaba.

Merge remote-tracking branch 'upstream/main' into internals-column-se…

025a3d4

…titem

only catch KeyError

caf7be8

only stop catching LossySetitemError

8d7ee1a

small updates

25e903b

jorisvandenbossche commented May 31, 2022

View reviewed changes

jorisvandenbossche added 3 commits May 31, 2022 14:03

fix AM case

5e30199

remove is_exact_shape_match

9d4566f

Merge remote-tracking branch 'upstream/main' into internals-column-se…

faed070

…titem

jorisvandenbossche added 3 commits June 1, 2022 00:05

add comment about catching the errors

ea063e6

Merge remote-tracking branch 'upstream/main' into internals-column-se…

3f30cab

…titem

Merge remote-tracking branch 'upstream/main' into internals-column-se…

db8e866

…titem

jorisvandenbossche merged commit b99ec4a into pandas-dev:main Jun 8, 2022

jorisvandenbossche deleted the internals-column-setitem branch June 8, 2022 20:13

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

REF: Add Manager.column_setitem to set values into a single column (w…

f814c26

…ithout intermediate series) (pandas-dev#47074)

rhshadrach mentioned this pull request Aug 29, 2022

REGR: Adding a row via DataFrame.at #48296

Closed

MarcoGorelli mentioned this pull request Sep 23, 2022

BUG: DataFrame.at fails when setting a whole row #48729

Closed

3 tasks

phofl mentioned this pull request Nov 16, 2022

PERF: assigning in a for loop #49729

Closed

3 tasks

jorisvandenbossche mentioned this pull request Nov 18, 2022

PERF: first try inplace setitem for .at indexer #49772

Merged

5 tasks

REF: Add Manager.column_setitem to set values into a single column (without intermediate series) #47074

REF: Add Manager.column_setitem to set values into a single column (without intermediate series) #47074

Conversation

jorisvandenbossche commented May 20, 2022

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche May 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 31, 2022

jorisvandenbossche commented Jun 7, 2022

jorisvandenbossche May 23, 2022 •

edited

Loading