BUG: Fix fillna on multi indexed Dataframe doesn't work #47774

xr-chen · 2022-07-17T23:34:50Z

closes BUG: multi-index df fillna with value not working after 1.4.3 #47649 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

phofl · 2022-07-18T00:09:09Z

pandas/core/indexing.py

@@ -1936,7 +1936,11 @@ def _setitem_with_indexer_frame_value(self, indexer, value: DataFrame, name: str

        else:
            for loc in ilocs:
-                item = self.obj.columns[loc]
+                level_diff = self.obj.columns.nlevels - value.columns.nlevels


This is a change in behavior. I am not sure, if we want this.

Yeah, this is probably not a good idea, I just tried to avoid unexpected NAN when setting values of a multi-indexed Dataframe, from my observation, the unexpected NAN may be caused by item (column index ("x", "a")) in line 1940 is not a column index of value (DataFrame has columns index "a")

The underlying root cause is, that getitem reduces the dimension of the MultiIndex. Not sure how to avoid this when setting.

yes. fixing the underlying issue is probably not an option for the regression fix, i.e. the issue that this PR closes listed in the OP.

A targeted PR for #47649 that could be backported would not address the underlying issue but could maybe make changes around the change made in #47327 to maybe convert the indexer to one that works #47649 (comment).

pandas/pandas/core/generic.py

Lines 6677 to 6679 in a7c5773

result.loc[:, k] = result[k].fillna(

v, limit=limit, downcast=downcast_k

)

I was making some changes by adding .values at end of line 6679, looks like that works well. Maybe we could ignore the index in fillna since we know result[k] is part of result. @simonjayhawkins

CloseChoice · 2022-07-19T16:11:57Z

I also looked into this one and found that the issue is that somewhere downstream we explicitly check if the column names match exactly, but since we just do:
https://github.com/pandas-dev/pandas/blob/main/pandas/core/generic.py#L6677
we only hand the columns ['a', 'b'] to loc and not the correct multiindex.

So if we change the line refered to above to:

                    if result[k].ndim > 1:
                        res = result[[k]].fillna(
                            v, limit=limit, downcast=downcast_k
                        )
                    else:
                        res = result[k].fillna(
                            v, limit=limit, downcast=downcast_k
                        )
                    result.loc[:, k] = res

the tests work. Feel free to continue from here if you think that's a suitable way.

phofl · 2022-07-19T16:38:40Z

The list indexer makes a copy and hence would avoid operating inplace I think, also we have this issue in multiple places, we have to fix this in loc itself, but not sure how to achieve this right now

CloseChoice · 2022-07-19T17:21:29Z

The list indexer makes a copy and hence would avoid operating inplace I think, also we have this issue in multiple places, we have to fix this in loc itself, but not sure how to achieve this right now

Just checked, this solution works inplace.

Let me break down as I understand the problem. Suppose we do:

pdf = pd.DataFrame(
    {
        ("x", "a"): [np.nan, 2, 3, 4, np.nan, 6],
        ("x", "b"): [1, 2, np.nan, 4, np.nan, np.nan],
        ("y", "c"): [1, 2, 3, 4, np.nan, np.nan],
    },
    index=np.random.rand(6),
)

>>> pdf.fillna({"x": -1})

then at

pandas/pandas/core/generic.py

Line 6677 in a7c5773

result.loc[:, k] = result[k].fillna(

we will hand over

[In]: df
[Out]:
            a    b
0.666291 -1.0  1.0
0.874578  2.0  2.0
0.845121  3.0 -1.0
0.224814  4.0  4.0
0.442914 -1.0 -1.0
0.224514  6.0 -1.0

to loc[:, k] which then gets further downstreamed up until

pandas/pandas/core/indexing.py

Lines 1938 to 1946 in a7c5773

    
           for loc in ilocs: 
        
               item = self.obj.columns[loc] 
        
               if item in value: 
        
                   sub_indexer[1] = item 
        
                   val = self._align_series( 
        
                       tuple(sub_indexer), value[item], multiindex_indexer 
        
                   ) 
        
               else: 
        
                   val = np.nan

And here in line 1940 things get off the rails. Because now we explicitly check if ('x', 'a') is in df which of course is not the case since we eliminated the multiindex by the we indexed. Fixing the indexing in case of dataframes fixes this issue.

phofl · 2022-07-19T17:24:38Z

The main issue is the following:

df.loc[a, :] = df.loc[a, :] * 2

does not work anymore if the MultiIndex is reduced. Switching to dataframe indexers is not a general solution, hence not desirable imo.

Did you check if a view on the original dataframe gets updated too? Like

view = df[:]
df…

This should update the view too

CloseChoice · 2022-07-19T17:34:21Z

The main issue is the following:

df.loc[a, :] = df.loc[a, :] * 2

does not work anymore if the MultiIndex is reduced. Switching to dataframe indexers is not a general solution, hence not desirable imo.

Did you check if a view on the original dataframe gets updated too? Like

view = df[:] df…

This should update the view too

Yes, it changes a view, too. But I see your point. Just as @xr-chen already tried, every change in _setitem_with_indexer_frame_value would lead to other undesired changes in behaviour. So, that's a tough one

xr-chen · 2022-07-19T18:26:13Z

pandas/pandas/core/indexing.py

Lines 1825 to 1829 in a7c5773

    
           if isinstance(value, ABCDataFrame): 
        
               self._setitem_with_indexer_frame_value(indexer, value, name) 
        
           elif np.ndim(value) == 2: 
        
               self._setitem_with_indexer_2d_value(indexer, value)

I found that if we call _setitem_with_indexer_2d_value instead of _setitem_with_indexer_frame_value when setting values of a multi-indexed Dataframe, df.loc[a, :] = df.loc[a, :] * 2 will work, #47649 will be solved, and all tests in \pandas\tests\indexing will pass. Is that a suitable solution?

phofl · 2022-07-19T18:28:10Z

I think this would also change the behavior in general, similar to what you’ve tried initially

CloseChoice · 2022-07-19T19:22:32Z

What about changing df.loc in such a fashion that it doesn't reduce the multiindex? That's also a change in behaviour but is it one we can't do?

xr-chen · 2022-07-19T19:26:29Z

pandas/pandas/core/generic.py

Lines 6677 to 6679 in 3a39d25

    
           result.loc[:, k] = result[k].fillna( 
        
               v, limit=limit, downcast=downcast_k 
        
           )

Could we do
result.loc[:, k] = result[k].fillna( v, limit=limit, downcast=downcast_k ).values
in fillna? Since in this specific senario, we don't care result's index. And we add a conveat in documentation, to change value of a multi-indexed dataframe under level a, we should use df.loc[a, :] = df.loc[[a], :] * 2 rather than df.loc[a, :] = df.loc[a, :] * 2 which would bring unexpected NAN to original DataFrame.

xr-chen · 2022-08-03T16:05:22Z

Any suggestions on this regression fix? @phofl @simonjayhawkins

simonjayhawkins · 2022-08-03T16:16:55Z

sorry @xr-chen for the slow response. milestoning 1.4.4 to keep it on my radar and will look soon

phofl · 2022-08-03T17:06:44Z

I would be ok with the values changes for now, if you can add a TODO and link to the issue that was opened about the multiplication case. We should fix this in loc in the future.

xr-chen · 2022-08-03T18:05:04Z

Like TODO: we should fix GH47649 by fixing the issue in loc mentioned in #45751 in the future?

phofl · 2022-08-03T18:08:05Z

Something like:

Revert when issue number… is fixed

phofl · 2022-08-03T21:41:25Z

pandas/core/generic.py

                    result.loc[:, k] = (
                        result[k].fillna(v, limit=limit, downcast=downcast_k).values
                    )
+                    # TODO: Revert to result.loc[:, k] = result[k].fillna(...)


Please put this right next to the todo and we want to use loc on both sides I think

You mean make comments like this?

# TODO: result.loc[:, k] = result.loc[:, k].fillna( # v, limit=limit, downcast=downcast_k # ) # Revert when GH45751 is fixed

phofl · 2022-08-12T10:21:18Z

pandas/tests/frame/methods/test_fillna.py

+        # GH 47649
+        pdf = DataFrame(
+            {
+                ("x", "a"): [np.nan, 2.0, 3.0, 4.0, np.nan, 6.0],


Could you simplify the DataFrame? 2 rows should be sufficient.

Also please add a test with ea dtypes

Sure, could you be more specific about what dtypes we want in the test?

Int64 for example

phofl

This is more or less ready

simonjayhawkins

Thanks @xr-chen will merge on green.

doc/source/whatsnew/v1.4.4.rst

…ame doesn't work

…Dataframe doesn't work) (#48216) Backport PR #47774: BUG: Fix fillna on multi indexed Dataframe doesn't work Co-authored-by: Xingrong Chen <56777910+xr-chen@users.noreply.github.com>

…7774)

fix 47649

462134d

phofl reviewed Jul 18, 2022

View reviewed changes

xr-chen marked this pull request as draft July 18, 2022 20:19

fix fillna ignore underlying issue

1bcdaef

xr-chen marked this pull request as ready for review July 20, 2022 15:46

fix pre commit

5409961

mroeschke added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate DataFrame DataFrame data structure labels Jul 22, 2022

xr-chen requested review from simonjayhawkins and phofl July 29, 2022 00:43

simonjayhawkins added this to the 1.4.4 milestone Aug 3, 2022

add TODO

764f0ab

phofl reviewed Aug 3, 2022

View reviewed changes

update TODO

f3d79a9

phofl reviewed Aug 12, 2022

View reviewed changes

phofl mentioned this pull request Aug 12, 2022

REGR: fix eval with inplace=True to correctly update column values inplace #47550

Merged

5 tasks

Add small changes

e384fd6

phofl reviewed Aug 18, 2022

View reviewed changes

Merge branch 'main' into fillna

336f92d

phofl approved these changes Aug 19, 2022

View reviewed changes

Merge branch 'main' into fillna

79a542b

simonjayhawkins approved these changes Aug 23, 2022

View reviewed changes

jorisvandenbossche approved these changes Aug 23, 2022

View reviewed changes

doc/source/whatsnew/v1.4.4.rst Outdated Show resolved Hide resolved

Update doc/source/whatsnew/v1.4.4.rst

15db9ab

simonjayhawkins merged commit ae291e9 into pandas-dev:main Aug 23, 2022

This comment was marked as resolved.

Sign in to view

lumberbot-app bot added the Still Needs Manual Backport label Aug 23, 2022

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Aug 23, 2022

Backport PR pandas-dev#47774: BUG: Fix fillna on multi indexed Datafr…

43c0f71

…ame doesn't work

simonjayhawkins mentioned this pull request Aug 23, 2022

Backport PR #47774 on branch 1.4.x (BUG: Fix fillna on multi indexed Dataframe doesn't work) #48216

Merged

simonjayhawkins added MultiIndex and removed Still Needs Manual Backport labels Aug 23, 2022

xr-chen deleted the fillna branch August 23, 2022 16:29

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022

BUG: Fix fillna on multi indexed Dataframe doesn't work (pandas-dev#4…

56d07f8

…7774)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix fillna on multi indexed Dataframe doesn't work #47774

BUG: Fix fillna on multi indexed Dataframe doesn't work #47774

xr-chen commented Jul 17, 2022 •

edited

Loading

phofl Jul 18, 2022

xr-chen Jul 18, 2022

phofl Jul 18, 2022

simonjayhawkins Jul 20, 2022

xr-chen Jul 26, 2022 •

edited

Loading

CloseChoice commented Jul 19, 2022

phofl commented Jul 19, 2022

CloseChoice commented Jul 19, 2022 •

edited

Loading

phofl commented Jul 19, 2022

CloseChoice commented Jul 19, 2022

xr-chen commented Jul 19, 2022

phofl commented Jul 19, 2022

CloseChoice commented Jul 19, 2022

xr-chen commented Jul 19, 2022 •

edited

Loading

xr-chen commented Aug 3, 2022

simonjayhawkins commented Aug 3, 2022

phofl commented Aug 3, 2022

xr-chen commented Aug 3, 2022

phofl commented Aug 3, 2022

phofl Aug 3, 2022

xr-chen Aug 3, 2022

phofl Aug 3, 2022

phofl Aug 12, 2022

xr-chen Aug 12, 2022

phofl Aug 18, 2022

phofl left a comment

simonjayhawkins left a comment

This comment was marked as resolved.

	result.loc[:, k] = result[k].fillna(
	v, limit=limit, downcast=downcast_k
	)

BUG: Fix fillna on multi indexed Dataframe doesn't work #47774

BUG: Fix fillna on multi indexed Dataframe doesn't work #47774

Conversation

xr-chen commented Jul 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xr-chen Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

CloseChoice commented Jul 19, 2022

phofl commented Jul 19, 2022

CloseChoice commented Jul 19, 2022 • edited Loading

phofl commented Jul 19, 2022

CloseChoice commented Jul 19, 2022

xr-chen commented Jul 19, 2022

phofl commented Jul 19, 2022

CloseChoice commented Jul 19, 2022

xr-chen commented Jul 19, 2022 • edited Loading

xr-chen commented Aug 3, 2022

simonjayhawkins commented Aug 3, 2022

phofl commented Aug 3, 2022

xr-chen commented Aug 3, 2022

phofl commented Aug 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl left a comment

Choose a reason for hiding this comment

simonjayhawkins left a comment

Choose a reason for hiding this comment

This comment was marked as resolved.

xr-chen commented Jul 17, 2022 •

edited

Loading

xr-chen Jul 26, 2022 •

edited

Loading

CloseChoice commented Jul 19, 2022 •

edited

Loading

xr-chen commented Jul 19, 2022 •

edited

Loading