-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF/API: DataFrame.__setitem__ never operate in-place #39510
REF/API: DataFrame.__setitem__ never operate in-place #39510
Conversation
…titem-frame-no-defer
…titem-frame-no-defer
else: | ||
if isinstance(value, DataFrame): | ||
check_key_length(self.columns, key, value) | ||
for k1, k2 in zip(key, value.columns): | ||
self[k1] = value[k2] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really like to move this stuff out of here into another module, but i guess that's for another day
raise NotImplementedError | ||
|
||
if np.shape(value)[-1] != len(ilocs): | ||
raise ValueError("Columns must be same length as key") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are all of the branches covered? this is a lot of new code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not the NotImplementedError on L3276; not sure what to do there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the rest is covered? what about the previous paths this was taking, can they be removed now? (e.g. they are uncovered)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated with a test that hits the two raise ValueError("Columns must be same length as key")
lines. everything other than the NotImplementedError is reached
getting this and #39163 in will make it easier to keep ArrayManager behavior in sync with non-ArrayManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give some explanation what this PR is exactly changing? (it's hard to see that from the code changed in frame.py)
expected["c"] = expected["c"].astype(arr.dtype) | ||
expected["d"] = expected["d"].astype(arr.dtype) | ||
assert expected["c"].dtype == arr.dtype | ||
assert expected["d"].dtype == arr.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changed here? (I wouldn't expect the above test to change behaviour related to this PR, as it's adding new columns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC this was needed for the 32bit builds
There are cases where |
I think we talked about this a few weeks back. I prepared a few tests back then but did not get around to push this forward. The tests are checking, if we are really not operating inplace. Do you think it is worth adding them?
|
absolutely. will check that these pass on the branch and then update. thanks |
ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify the whatsnew a bit more about what actually changed / what the impact is for the user? (this is not really a bug fix ..)
The only part to call a bugfix is the int32 changes (see affected tests). The changed exception messages hardly seem worth noting |
The whatsnew note now says "some cases". Eg could make that more specific? |
To try to clarify what I mean. I didn't have time to review the code of this PR in enough detail to know what it changes, but I would still expect to get an idea about that based on the (whatsnew) description. But now neither the PR description nor the whatsnew gives me a clear idea of the (potential) changes. |
ATM Luckily some of the tests suggested by @phofl fails in master:
|
OK, so a consequence is that the resulting dtype can change compared to master when you are setting multiple columns. For example (using this branch):
On master, the above preserves the all-float dtypes. Do you know if that is specific to that case? (multiple columns, and setting with a scalar) From quick testing, it seems that other cases (like setting multiple columns with a dataframe, or array, .. ) didn't try to preserve the dtype if possible. |
The only place where we go through
which is pretty well summed up by "multiple columns" |
I would personally then add a small subsection to the "Notable bug fixes" part showing the example of setting multiple columns now having different dtypes as result (there is already a subsection for iloc/loc changes (actually trying inplace instead of replacing), which is quite related (but opposite) change. |
self.iloc[:, indexer] = value | ||
self._iset_not_inplace(key, value) | ||
|
||
def _iset_not_inplace(self, key, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment (or informal docstring) here stating briefly which cases this covers? (eg I think only for listlike key and value?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, LMK if you find this informative
I think comments are addressed, so this is ready for another look (once the CI runs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you merge master
# hypothetically would return obj.iloc[:, i] | ||
if isinstance(obj, np.ndarray): | ||
return obj[..., i] | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the future would be really nice to move lots of this elsewhere (e.g. indexing.py or similar)
rebased, green ex two issues on master being addressed elsewhere |
thanks @jbrockmendel nice! |
There is a perf regression between 1.2.5 and 1.3.x that points to this PR that is not shown on the asv dashboard https://pandas.pydata.org/speed/pandas/#indexing.InsertColumns.time_assign_list_like_with_setitem?python=3.8&Cython=0.29.21&jinja2=
profiling gives
before
so spends some time in _iset_not_inplace Is this expected? |
With the current implementation, yes. I think a deep(ish) refactor can avoid the perf hit, but it's not a 1-day thing. cc @phofl BTW is there an option you can pass to the asv profile command like strip_dirs? (ideally just strip everything up through "site-packages/") |
xref #38896 (comment)