Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Add Manager.column_setitem to set values into a single column (without intermediate series) #47074

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,6 @@
)

from pandas.core.dtypes.cast import (
LossySetitemError,
can_hold_element,
construct_1d_arraylike_from_scalar,
construct_2d_arraylike_from_scalar,
Expand Down Expand Up @@ -3945,17 +3944,18 @@ def _set_value(
"""
try:
if takeable:
series = self._ixs(col, axis=1)
loc = index
icol = col
iindex = cast(int, index)
else:
series = self._get_item_cache(col)
loc = self.index.get_loc(index)

# setitem_inplace will do validation that may raise TypeError,
# ValueError, or LossySetitemError
series._mgr.setitem_inplace(loc, value)

except (KeyError, TypeError, ValueError, LossySetitemError):
icol = self.columns.get_loc(col)
iindex = self.index.get_loc(index)
self._mgr.column_setitem(icol, iindex, value)
self._clear_item_cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we make column_setitem always-inplace and make the clear_item_cache unecesssary?

API-wise i think the always-inplace method is a lot nicer than the less predictable one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe could make setitem_inplace ignore CoW since it is explicitly inplace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we make column_setitem always-inplace and make the clear_item_cache unecesssary?

The main use case of column_setitem is in _iLocIndexer._setitem_single_column, which is used for setting with loc/iloc. And for that use case, we need this to be not inplace (i.e. having the dtype coercing behaviour), since that is what we need for loc/iloc.

The case here is only for at/iat setting.

I could add a separate inplace version or add an inplace keyword to column_setitem that could be used here. That would keep the current logic more intact, but since we fallback to loc/iloc anyway when the inplace setitem fails, I am not sure it would actually be very useful.

or maybe could make setitem_inplace ignore CoW since it is explicitly inplace?

Even something that is explicitly inplace from a usage perspective will need to take care of CoW in #46958

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a version with an inplace keyword in the last commit (453eaba). I lean more towards "not worth it", but either way is fine for me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel do you have a preference on keeping this change with the inplace keyword for column_setitem or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it seems we still need to catch TypeError, for example for a MultiIndex case where self.columns.get_loc(col) might not necessarily result in an integer.

So only removed LossySetitemError for now (and added a test case for setting with at with a MultiIndex, as that didn't yet seem to be covered in the indexing tests)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, can you add a comment about why TypeError is needed. what about ValueError?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a comment about TypeError.
I am not fully sure about the ValueError. Is it possible for a setitem operation to raise a ValueError? (it seems that validation methods (like _validate_setitem_value) will mostly raise TypeErrors?)

Now, this catching of a ValueError alraedy was here before, so I am hesitant to remove it without looking further in detail at it. I would prefer to leave that for another PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are some cases where setitem actually raises a ValueError, eg when setting with an array-like that is not of length 1 (in this case of scalar indexer at or iat).
Now, the fallback to loc/iloc will then most likely also raise a ValueError (so catching it might not necessarily add that much). But at least in some cases it seems to change the error message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:

>>> df = pd.DataFrame({'a': pd.Series([1, 2, 3], dtype="Int64"), 'b': [0.1, 0.2, 0.3]})
>>> df.iat[0, 0] = np.array([0, 1])
...
ValueError: Must have equal len keys and value when setting with an iterable

Without catching/reraising in iloc, the error would be "ValueError: setting an array element with a sequence", which is slightly less informative.

I suppose we should actually see to make this handling consistent in the different code paths (so it directly raises the more informative error message in the first place), but that's out of scope for this PR.


except (KeyError, TypeError, ValueError):
# get_loc might raise a KeyError for missing labels (falling back
# to (i)loc will do expansion of the index)
# column_setitem will do validation that may raise TypeError or ValueError
# set using a non-recursive method & reset the cache
if takeable:
self.iloc[index, col] = value
Expand Down
2 changes: 0 additions & 2 deletions pandas/core/indexers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
check_setitem_lengths,
deprecate_ndim_indexing,
is_empty_indexer,
is_exact_shape_match,
is_list_like_indexer,
is_scalar_indexer,
is_valid_positional_slice,
Expand All @@ -23,7 +22,6 @@
"check_setitem_lengths",
"validate_indices",
"maybe_convert_indices",
"is_exact_shape_match",
"length_of_indexer",
"deprecate_ndim_indexing",
"unpack_1tuple",
Expand Down
26 changes: 1 addition & 25 deletions pandas/core/indexers/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,7 @@

import numpy as np

from pandas._typing import (
AnyArrayLike,
ArrayLike,
)
from pandas._typing import AnyArrayLike
from pandas.util._exceptions import find_stack_level

from pandas.core.dtypes.common import (
Expand Down Expand Up @@ -294,27 +291,6 @@ def maybe_convert_indices(indices, n: int, verify: bool = True):
# Unsorted


def is_exact_shape_match(target: ArrayLike, value: ArrayLike) -> bool:
"""
Is setting this value into this target overwriting the entire column?

Parameters
----------
target : np.ndarray or ExtensionArray
value : np.ndarray or ExtensionArray

Returns
-------
bool
"""
return (
len(value.shape) > 0
and len(target.shape) > 0
and value.shape[0] == target.shape[0]
and value.size == target.size
)


def length_of_indexer(indexer, target=None) -> int:
"""
Return the expected length of target[indexer]
Expand Down
27 changes: 9 additions & 18 deletions pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,6 @@
from pandas.core.indexers import (
check_array_indexer,
is_empty_indexer,
is_exact_shape_match,
is_list_like_indexer,
is_scalar_indexer,
length_of_indexer,
Expand Down Expand Up @@ -1951,8 +1950,7 @@ def _setitem_single_column(self, loc: int, value, plane_indexer):
"""
pi = plane_indexer
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved

ser = self.obj._ixs(loc, axis=1)
orig_values = ser._values
orig_values = self.obj._get_column_array(loc)

# perform the equivalent of a setitem on the info axis
# as we have a null slice or a slice with full bounds
Expand All @@ -1963,7 +1961,8 @@ def _setitem_single_column(self, loc: int, value, plane_indexer):
pass
elif (
is_array_like(value)
and is_exact_shape_match(ser, value)
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
and len(value.shape) > 0
and self.obj.shape[0] == value.shape[0]
and not is_empty_indexer(pi)
):
if is_list_like(pi):
Expand All @@ -1972,31 +1971,23 @@ def _setitem_single_column(self, loc: int, value, plane_indexer):
# in case of slice
value = value[pi]
else:
# set the item, first attempting to operate inplace, then
# falling back to casting if necessary; see
# _whatsnew_130.notable_bug_fixes.setitem_column_try_inplace

orig_values = ser._values
ser._mgr = ser._mgr.setitem((pi,), value)

if ser._values is orig_values:
# The setitem happened inplace, so the DataFrame's values
# were modified inplace.
return
self.obj._iset_item(loc, ser)
# set value into the column (first attempting to operate inplace, then
# falling back to casting if necessary)
self.obj._mgr.column_setitem(loc, plane_indexer, value)
self.obj._clear_item_cache()
return

# We will not operate in-place, but will attempt to in the future.
# To determine whether we need to issue a FutureWarning, see if the
# setting in-place would work, i.e. behavior will change.
warn = can_hold_element(ser._values, value)
warn = can_hold_element(orig_values, value)
# Don't issue the warning yet, as we can still trim a few cases where
# behavior will not change.

self.obj._iset_item(loc, value)

if warn:
new_values = self.obj._ixs(loc, axis=1)._values
new_values = self.obj._get_column_array(loc)

if (
isinstance(new_values, np.ndarray)
Expand Down
16 changes: 16 additions & 0 deletions pandas/core/internals/array_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
is_datetime64_ns_dtype,
is_dtype_equal,
is_extension_array_dtype,
is_integer,
is_numeric_dtype,
is_object_dtype,
is_timedelta64_ns_dtype,
Expand Down Expand Up @@ -869,6 +870,21 @@ def iset(
self.arrays[mgr_idx] = value_arr
return

def column_setitem(self, loc: int, idx: int | slice | np.ndarray, value) -> None:
"""
Set values ("setitem") into a single column (not setting the full column).

This is a method on the ArrayManager level, to avoid creating an
intermediate Series at the DataFrame level (`s = df[loc]; s[idx] = value`)
"""
if not is_integer(loc):
raise TypeError("The column index should be an integer")
arr = self.arrays[loc]
mgr = SingleArrayManager([arr], [self._axes[0]])
new_mgr = mgr.setitem((idx,), value)
# update existing ArrayManager in-place
self.arrays[loc] = new_mgr.arrays[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "into" in the docstring suggests that the setting should occur into the existing array, so we shouldn't need to set a new array. am i misunderstanding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. But the setitem method does that (setting into the existing array), or more correctly might do that.

So it certainly looks a bit confusing here, as it indeed seems that I am fully replacing the array for the column in question.
But so if the setitem method changed the array inplace, this self.arrays[loc] = new_mgr.arrays[0] is assigning the original array again and thus not changing anything / a no-op.
It's only if setitem actually created a new array (eg the setitem caused a dtype change), that this line is doing something relevant.

In principle I could check if both arrays are identical, and only if that is not the case, do this self.arrays[loc] = new_mgr.arrays[0] (although it wouldn't actually change any behaviour, that might be more explicit)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I find it also a bit confusing in our internal API that setitem (both the Manager and Block method) is a method that works in place (sometimes), but does return a manager/block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I find it also a bit confusing in our internal API that setitem (both the Manager and Block method) is a method that works in place (sometimes), but does return a manager/block.

yah, id be open to a name change (separate PR) for that. this is related to why i like setitem_inplace


def insert(self, loc: int, item: Hashable, value: ArrayLike) -> None:
"""
Insert item at selected position.
Expand Down
11 changes: 11 additions & 0 deletions pandas/core/internals/managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1188,6 +1188,17 @@ def _iset_single(
self.blocks = new_blocks
return

def column_setitem(self, loc: int, idx: int | slice | np.ndarray, value) -> None:
"""
Set values ("setitem") into a single column (not setting the full column).

This is a method on the BlockManager level, to avoid creating an
intermediate Series at the DataFrame level (`s = df[loc]; s[idx] = value`)
"""
col_mgr = self.iget(loc)
new_mgr = col_mgr.setitem((idx,), value)
self.iset(loc, new_mgr._block.values, inplace=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you can use _setitem_single here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iset will indeed call _iset_single in this case, but it still does some useful preprocessing that we might not want to repeat here I think (eg converting loc in a blkno)


def insert(self, loc: int, item: Hashable, value: ArrayLike) -> None:
"""
Insert item at selected position.
Expand Down
2 changes: 0 additions & 2 deletions pandas/tests/frame/indexing/test_setitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -1099,8 +1099,6 @@ def test_setitem_partial_column_inplace(self, consolidate, using_array_manager):
# check setting occurred in-place
tm.assert_numpy_array_equal(zvals, expected.values)
assert np.shares_memory(zvals, df["z"]._values)
if not consolidate:
assert df["z"]._values is zvals
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #47074 (comment) for comment about this removal


def test_setitem_duplicate_columns_not_inplace(self):
# GH#39510
Expand Down
13 changes: 13 additions & 0 deletions pandas/tests/indexing/test_at.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
CategoricalDtype,
CategoricalIndex,
DataFrame,
MultiIndex,
Series,
Timestamp,
)
Expand Down Expand Up @@ -96,6 +97,18 @@ def test_at_setitem_categorical_missing(self):

tm.assert_frame_equal(df, expected)

def test_at_setitem_multiindex(self):
df = DataFrame(
np.zeros((3, 2), dtype="int64"),
columns=MultiIndex.from_tuples([("a", 0), ("a", 1)]),
)
df.at[0, "a"] = 10
expected = DataFrame(
[[10, 10], [0, 0], [0, 0]],
columns=MultiIndex.from_tuples([("a", 0), ("a", 1)]),
)
tm.assert_frame_equal(df, expected)


class TestAtSetItemWithExpansion:
def test_at_setitem_expansion_series_dt64tz_value(self, tz_naive_fixture):
Expand Down
4 changes: 3 additions & 1 deletion pandas/tests/indexing/test_partial.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@ def test_partial_setting(self):
with pytest.raises(IndexError, match=msg):
s.iat[3] = 5.0

def test_partial_setting_frame(self):
def test_partial_setting_frame(self, using_array_manager):
df_orig = DataFrame(
np.arange(6).reshape(3, 2), columns=["A", "B"], dtype="int64"
)
Expand All @@ -279,6 +279,8 @@ def test_partial_setting_frame(self):
df.iloc[4, 2] = 5.0

msg = "index 2 is out of bounds for axis 0 with size 2"
if using_array_manager:
msg = "list index out of range"
with pytest.raises(IndexError, match=msg):
df.iat[4, 2] = 5.0

Expand Down