Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix fillna on multi indexed Dataframe doesn't work #47774

Merged
merged 9 commits into from
Aug 23, 2022

Conversation

xr-chen
Copy link
Contributor

@xr-chen xr-chen commented Jul 17, 2022

@@ -1936,7 +1936,11 @@ def _setitem_with_indexer_frame_value(self, indexer, value: DataFrame, name: str

else:
for loc in ilocs:
item = self.obj.columns[loc]
level_diff = self.obj.columns.nlevels - value.columns.nlevels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in behavior. I am not sure, if we want this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is probably not a good idea, I just tried to avoid unexpected NAN when setting values of a multi-indexed Dataframe, from my observation, the unexpected NAN may be caused by item (column index ("x", "a")) in line 1940 is not a column index of value (DataFrame has columns index "a")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying root cause is, that getitem reduces the dimension of the MultiIndex. Not sure how to avoid this when setting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. fixing the underlying issue is probably not an option for the regression fix, i.e. the issue that this PR closes listed in the OP.

A targeted PR for #47649 that could be backported would not address the underlying issue but could maybe make changes around the change made in #47327 to maybe convert the indexer to one that works #47649 (comment).

Copy link
Contributor Author

@xr-chen xr-chen Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas/pandas/core/generic.py

Lines 6677 to 6679 in a7c5773

result.loc[:, k] = result[k].fillna(
v, limit=limit, downcast=downcast_k
)

I was making some changes by adding .values at end of line 6679, looks like that works well. Maybe we could ignore the index in fillna since we know result[k] is part of result. @simonjayhawkins

@xr-chen xr-chen marked this pull request as draft July 18, 2022 20:19
@CloseChoice
Copy link
Member

I also looked into this one and found that the issue is that somewhere downstream we explicitly check if the column names match exactly, but since we just do:
https://github.com/pandas-dev/pandas/blob/main/pandas/core/generic.py#L6677
we only hand the columns ['a', 'b'] to loc and not the correct multiindex.

So if we change the line refered to above to:

                    if result[k].ndim > 1:
                        res = result[[k]].fillna(
                            v, limit=limit, downcast=downcast_k
                        )
                    else:
                        res = result[k].fillna(
                            v, limit=limit, downcast=downcast_k
                        )
                    result.loc[:, k] = res

the tests work. Feel free to continue from here if you think that's a suitable way.

@phofl
Copy link
Member

phofl commented Jul 19, 2022

The list indexer makes a copy and hence would avoid operating inplace I think, also we have this issue in multiple places, we have to fix this in loc itself, but not sure how to achieve this right now

@CloseChoice
Copy link
Member

CloseChoice commented Jul 19, 2022

The list indexer makes a copy and hence would avoid operating inplace I think, also we have this issue in multiple places, we have to fix this in loc itself, but not sure how to achieve this right now

Just checked, this solution works inplace.

Let me break down as I understand the problem. Suppose we do:

pdf = pd.DataFrame(
    {
        ("x", "a"): [np.nan, 2, 3, 4, np.nan, 6],
        ("x", "b"): [1, 2, np.nan, 4, np.nan, np.nan],
        ("y", "c"): [1, 2, 3, 4, np.nan, np.nan],
    },
    index=np.random.rand(6),
)

>>> pdf.fillna({"x": -1})

then at

result.loc[:, k] = result[k].fillna(

we will hand over

[In]: df
[Out]:
            a    b
0.666291 -1.0  1.0
0.874578  2.0  2.0
0.845121  3.0 -1.0
0.224814  4.0  4.0
0.442914 -1.0 -1.0
0.224514  6.0 -1.0

to loc[:, k] which then gets further downstreamed up until

pandas/pandas/core/indexing.py

Lines 1938 to 1946 in a7c5773

for loc in ilocs:
item = self.obj.columns[loc]
if item in value:
sub_indexer[1] = item
val = self._align_series(
tuple(sub_indexer), value[item], multiindex_indexer
)
else:
val = np.nan

And here in line 1940 things get off the rails. Because now we explicitly check if ('x', 'a') is in df which of course is not the case since we eliminated the multiindex by the we indexed. Fixing the indexing in case of dataframes fixes this issue.

@phofl
Copy link
Member

phofl commented Jul 19, 2022

The main issue is the following:

df.loc[a, :] = df.loc[a, :] * 2

does not work anymore if the MultiIndex is reduced. Switching to dataframe indexers is not a general solution, hence not desirable imo.

Did you check if a view on the original dataframe gets updated too? Like

view = df[:]
df…

This should update the view too

@CloseChoice
Copy link
Member

The main issue is the following:

df.loc[a, :] = df.loc[a, :] * 2

does not work anymore if the MultiIndex is reduced. Switching to dataframe indexers is not a general solution, hence not desirable imo.

Did you check if a view on the original dataframe gets updated too? Like

view = df[:] df…

This should update the view too

Yes, it changes a view, too. But I see your point. Just as @xr-chen already tried, every change in _setitem_with_indexer_frame_value would lead to other undesired changes in behaviour. So, that's a tough one

@xr-chen
Copy link
Contributor Author

xr-chen commented Jul 19, 2022

pandas/pandas/core/indexing.py

Lines 1825 to 1829 in a7c5773

if isinstance(value, ABCDataFrame):
self._setitem_with_indexer_frame_value(indexer, value, name)
elif np.ndim(value) == 2:
self._setitem_with_indexer_2d_value(indexer, value)

I found that if we call _setitem_with_indexer_2d_value instead of _setitem_with_indexer_frame_value when setting values of a multi-indexed Dataframe, df.loc[a, :] = df.loc[a, :] * 2 will work, #47649 will be solved, and all tests in \pandas\tests\indexing will pass. Is that a suitable solution?

@phofl
Copy link
Member

phofl commented Jul 19, 2022

I think this would also change the behavior in general, similar to what you’ve tried initially

@CloseChoice
Copy link
Member

What about changing df.loc in such a fashion that it doesn't reduce the multiindex? That's also a change in behaviour but is it one we can't do?

@xr-chen
Copy link
Contributor Author

xr-chen commented Jul 19, 2022

pandas/pandas/core/generic.py

Lines 6677 to 6679 in 3a39d25

result.loc[:, k] = result[k].fillna(
v, limit=limit, downcast=downcast_k
)

Could we do
result.loc[:, k] = result[k].fillna( v, limit=limit, downcast=downcast_k ).values
in fillna? Since in this specific senario, we don't care result's index. And we add a conveat in documentation, to change value of a multi-indexed dataframe under level a, we should use df.loc[a, :] = df.loc[[a], :] * 2 rather than df.loc[a, :] = df.loc[a, :] * 2 which would bring unexpected NAN to original DataFrame.

@xr-chen xr-chen marked this pull request as ready for review July 20, 2022 15:46
@mroeschke mroeschke added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate DataFrame DataFrame data structure labels Jul 22, 2022
@xr-chen
Copy link
Contributor Author

xr-chen commented Aug 3, 2022

Any suggestions on this regression fix? @phofl @simonjayhawkins

@simonjayhawkins
Copy link
Member

sorry @xr-chen for the slow response. milestoning 1.4.4 to keep it on my radar and will look soon

@simonjayhawkins simonjayhawkins added this to the 1.4.4 milestone Aug 3, 2022
@phofl
Copy link
Member

phofl commented Aug 3, 2022

I would be ok with the values changes for now, if you can add a TODO and link to the issue that was opened about the multiplication case. We should fix this in loc in the future.

@xr-chen
Copy link
Contributor Author

xr-chen commented Aug 3, 2022

Like TODO: we should fix GH47649 by fixing the issue in loc mentioned in #45751 in the future?

@phofl
Copy link
Member

phofl commented Aug 3, 2022

Something like:

Revert when issue number… is fixed

result.loc[:, k] = (
result[k].fillna(v, limit=limit, downcast=downcast_k).values
)
# TODO: Revert to result.loc[:, k] = result[k].fillna(...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put this right next to the todo and we want to use loc on both sides I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean make comments like this?

# TODO: result.loc[:, k] = result.loc[:, k].fillna(
#     v, limit=limit, downcast=downcast_k
# )
# Revert when GH45751 is fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx

# GH 47649
pdf = DataFrame(
{
("x", "a"): [np.nan, 2.0, 3.0, 4.0, np.nan, 6.0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you simplify the DataFrame? 2 rows should be sufficient.

Also please add a test with ea dtypes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, could you be more specific about what dtypes we want in the test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int64 for example

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more or less ready

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xr-chen will merge on green.

doc/source/whatsnew/v1.4.4.rst Outdated Show resolved Hide resolved
@simonjayhawkins simonjayhawkins merged commit ae291e9 into pandas-dev:main Aug 23, 2022
@lumberbot-app

This comment was marked as resolved.

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Aug 23, 2022
simonjayhawkins added a commit that referenced this pull request Aug 23, 2022
…Dataframe doesn't work) (#48216)

Backport PR #47774: BUG: Fix fillna on multi indexed Dataframe doesn't work

Co-authored-by: Xingrong Chen <56777910+xr-chen@users.noreply.github.com>
@xr-chen xr-chen deleted the fillna branch August 23, 2022 16:29
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: multi-index df fillna with value not working after 1.4.3
6 participants