-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix fillna on multi indexed Dataframe doesn't work #47774
Conversation
pandas/core/indexing.py
Outdated
@@ -1936,7 +1936,11 @@ def _setitem_with_indexer_frame_value(self, indexer, value: DataFrame, name: str | |||
|
|||
else: | |||
for loc in ilocs: | |||
item = self.obj.columns[loc] | |||
level_diff = self.obj.columns.nlevels - value.columns.nlevels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change in behavior. I am not sure, if we want this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is probably not a good idea, I just tried to avoid unexpected NAN
when setting values of a multi-indexed Dataframe
, from my observation, the unexpected NAN
may be caused by item
(column index ("x", "a")) in line 1940 is not a column index of value
(DataFrame
has columns index "a")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underlying root cause is, that getitem reduces the dimension of the MultiIndex. Not sure how to avoid this when setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. fixing the underlying issue is probably not an option for the regression fix, i.e. the issue that this PR closes listed in the OP.
A targeted PR for #47649 that could be backported would not address the underlying issue but could maybe make changes around the change made in #47327 to maybe convert the indexer to one that works #47649 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 6677 to 6679 in a7c5773
result.loc[:, k] = result[k].fillna( | |
v, limit=limit, downcast=downcast_k | |
) |
I was making some changes by adding
.values
at end of line 6679, looks like that works well. Maybe we could ignore the index in fillna
since we know result[k]
is part of result
. @simonjayhawkins
I also looked into this one and found that the issue is that somewhere downstream we explicitly check if the column names match exactly, but since we just do: So if we change the line refered to above to: if result[k].ndim > 1:
res = result[[k]].fillna(
v, limit=limit, downcast=downcast_k
)
else:
res = result[k].fillna(
v, limit=limit, downcast=downcast_k
)
result.loc[:, k] = res the tests work. Feel free to continue from here if you think that's a suitable way. |
The list indexer makes a copy and hence would avoid operating inplace I think, also we have this issue in multiple places, we have to fix this in loc itself, but not sure how to achieve this right now |
Just checked, this solution works inplace. Let me break down as I understand the problem. Suppose we do: pdf = pd.DataFrame(
{
("x", "a"): [np.nan, 2, 3, 4, np.nan, 6],
("x", "b"): [1, 2, np.nan, 4, np.nan, np.nan],
("y", "c"): [1, 2, 3, 4, np.nan, np.nan],
},
index=np.random.rand(6),
)
>>> pdf.fillna({"x": -1}) then at Line 6677 in a7c5773
we will hand over [In]: df
[Out]:
a b
0.666291 -1.0 1.0
0.874578 2.0 2.0
0.845121 3.0 -1.0
0.224814 4.0 4.0
0.442914 -1.0 -1.0
0.224514 6.0 -1.0 to pandas/pandas/core/indexing.py Lines 1938 to 1946 in a7c5773
And here in line 1940 things get off the rails. Because now we explicitly check if ('x', 'a') is in df which of course is not the case since we eliminated the multiindex by the we indexed. Fixing the indexing in case of dataframes fixes this issue.
|
The main issue is the following: df.loc[a, :] = df.loc[a, :] * 2 does not work anymore if the MultiIndex is reduced. Switching to dataframe indexers is not a general solution, hence not desirable imo. Did you check if a view on the original dataframe gets updated too? Like view = df[:] This should update the view too |
Yes, it changes a view, too. But I see your point. Just as @xr-chen already tried, every change in |
pandas/pandas/core/indexing.py Lines 1825 to 1829 in a7c5773
I found that if we call _setitem_with_indexer_2d_value instead of _setitem_with_indexer_frame_value when setting values of a multi-indexed Dataframe, df.loc[a, :] = df.loc[a, :] * 2 will work, #47649 will be solved, and all tests in \pandas\tests\indexing will pass. Is that a suitable solution?
|
I think this would also change the behavior in general, similar to what you’ve tried initially |
What about changing |
Lines 6677 to 6679 in 3a39d25
Could we do result.loc[:, k] = result[k].fillna( v, limit=limit, downcast=downcast_k ).values in fillna ? Since in this specific senario, we don't care result 's index. And we add a conveat in documentation, to change value of a multi-indexed dataframe under level a , we should use df.loc[a, :] = df.loc[[a], :] * 2 rather than df.loc[a, :] = df.loc[a, :] * 2 which would bring unexpected NAN to original DataFrame.
|
Any suggestions on this regression fix? @phofl @simonjayhawkins |
sorry @xr-chen for the slow response. milestoning 1.4.4 to keep it on my radar and will look soon |
I would be ok with the values changes for now, if you can add a TODO and link to the issue that was opened about the multiplication case. We should fix this in loc in the future. |
Like TODO: we should fix GH47649 by fixing the issue in loc mentioned in #45751 in the future? |
Something like: Revert when issue number… is fixed |
pandas/core/generic.py
Outdated
result.loc[:, k] = ( | ||
result[k].fillna(v, limit=limit, downcast=downcast_k).values | ||
) | ||
# TODO: Revert to result.loc[:, k] = result[k].fillna(...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put this right next to the todo and we want to use loc on both sides I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean make comments like this?
# TODO: result.loc[:, k] = result.loc[:, k].fillna(
# v, limit=limit, downcast=downcast_k
# )
# Revert when GH45751 is fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx
# GH 47649 | ||
pdf = DataFrame( | ||
{ | ||
("x", "a"): [np.nan, 2.0, 3.0, 4.0, np.nan, 6.0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you simplify the DataFrame? 2 rows should be sufficient.
Also please add a test with ea dtypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, could you be more specific about what dtypes we want in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int64 for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more or less ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xr-chen will merge on green.
This comment was marked as resolved.
This comment was marked as resolved.
…ame doesn't work
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.