-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Add Manager.column_setitem to set values into a single column (without intermediate series) #47074
REF: Add Manager.column_setitem to set values into a single column (without intermediate series) #47074
Changes from all commits
0e4c58e
a2aa8aa
ce0649b
103d1fe
d20b0cb
453eaba
be740ad
e63c7f6
025a3d4
caf7be8
8d7ee1a
25e903b
5e30199
9d4566f
faed070
ea063e6
3f30cab
db8e866
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -36,6 +36,7 @@ | |
is_datetime64_ns_dtype, | ||
is_dtype_equal, | ||
is_extension_array_dtype, | ||
is_integer, | ||
is_numeric_dtype, | ||
is_object_dtype, | ||
is_timedelta64_ns_dtype, | ||
|
@@ -869,6 +870,21 @@ def iset( | |
self.arrays[mgr_idx] = value_arr | ||
return | ||
|
||
def column_setitem(self, loc: int, idx: int | slice | np.ndarray, value) -> None: | ||
""" | ||
Set values ("setitem") into a single column (not setting the full column). | ||
|
||
This is a method on the ArrayManager level, to avoid creating an | ||
intermediate Series at the DataFrame level (`s = df[loc]; s[idx] = value`) | ||
""" | ||
if not is_integer(loc): | ||
raise TypeError("The column index should be an integer") | ||
arr = self.arrays[loc] | ||
mgr = SingleArrayManager([arr], [self._axes[0]]) | ||
new_mgr = mgr.setitem((idx,), value) | ||
# update existing ArrayManager in-place | ||
self.arrays[loc] = new_mgr.arrays[0] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the "into" in the docstring suggests that the setting should occur into the existing array, so we shouldn't need to set a new array. am i misunderstanding? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's right. But the So it certainly looks a bit confusing here, as it indeed seems that I am fully replacing the array for the column in question. In principle I could check if both arrays are identical, and only if that is not the case, do this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general I find it also a bit confusing in our internal API that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
yah, id be open to a name change (separate PR) for that. this is related to why i like setitem_inplace |
||
|
||
def insert(self, loc: int, item: Hashable, value: ArrayLike) -> None: | ||
""" | ||
Insert item at selected position. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1188,6 +1188,17 @@ def _iset_single( | |
self.blocks = new_blocks | ||
return | ||
|
||
def column_setitem(self, loc: int, idx: int | slice | np.ndarray, value) -> None: | ||
""" | ||
Set values ("setitem") into a single column (not setting the full column). | ||
|
||
This is a method on the BlockManager level, to avoid creating an | ||
intermediate Series at the DataFrame level (`s = df[loc]; s[idx] = value`) | ||
""" | ||
col_mgr = self.iget(loc) | ||
new_mgr = col_mgr.setitem((idx,), value) | ||
self.iset(loc, new_mgr._block.values, inplace=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think you can use _setitem_single here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
def insert(self, loc: int, item: Hashable, value: ArrayLike) -> None: | ||
""" | ||
Insert item at selected position. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1099,8 +1099,6 @@ def test_setitem_partial_column_inplace(self, consolidate, using_array_manager): | |
# check setting occurred in-place | ||
tm.assert_numpy_array_equal(zvals, expected.values) | ||
assert np.shares_memory(zvals, df["z"]._values) | ||
if not consolidate: | ||
assert df["z"]._values is zvals | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See #47074 (comment) for comment about this removal |
||
|
||
def test_setitem_duplicate_columns_not_inplace(self): | ||
# GH#39510 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we make column_setitem always-inplace and make the clear_item_cache unecesssary?
API-wise i think the always-inplace method is a lot nicer than the less predictable one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe could make setitem_inplace ignore CoW since it is explicitly inplace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main use case of
column_setitem
is in_iLocIndexer._setitem_single_column
, which is used for setting withloc
/iloc
. And for that use case, we need this to be not inplace (i.e. having the dtype coercing behaviour), since that is what we need for loc/iloc.The case here is only for
at
/iat
setting.I could add a separate inplace version or add an inplace keyword to
column_setitem
that could be used here. That would keep the current logic more intact, but since we fallback to loc/iloc anyway when the inplace setitem fails, I am not sure it would actually be very useful.Even something that is explicitly inplace from a usage perspective will need to take care of CoW in #46958
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a version with an
inplace
keyword in the last commit (453eaba). I lean more towards "not worth it", but either way is fine for me.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel do you have a preference on keeping this change with the
inplace
keyword forcolumn_setitem
or not?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it seems we still need to catch TypeError, for example for a MultiIndex case where
self.columns.get_loc(col)
might not necessarily result in an integer.So only removed LossySetitemError for now (and added a test case for setting with
at
with a MultiIndex, as that didn't yet seem to be covered in the indexing tests)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, can you add a comment about why TypeError is needed. what about ValueError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add a comment about TypeError.
I am not fully sure about the ValueError. Is it possible for a setitem operation to raise a ValueError? (it seems that validation methods (like
_validate_setitem_value
) will mostly raise TypeErrors?)Now, this catching of a ValueError alraedy was here before, so I am hesitant to remove it without looking further in detail at it. I would prefer to leave that for another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are some cases where setitem actually raises a ValueError, eg when setting with an array-like that is not of length 1 (in this case of scalar indexer
at
oriat
).Now, the fallback to loc/iloc will then most likely also raise a ValueError (so catching it might not necessarily add that much). But at least in some cases it seems to change the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example:
Without catching/reraising in iloc, the error would be "ValueError: setting an array element with a sequence", which is slightly less informative.
I suppose we should actually see to make this handling consistent in the different code paths (so it directly raises the more informative error message in the first place), but that's out of scope for this PR.