Filtering dataframe with sparse column leads to NAs in sparse column #27781

stelsemeyer · 2019-08-06T15:00:41Z

Code Sample, a copy-pastable example if possible

import pandas as pd

df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]})
# df1_filtered will have NAs in column A
df1_filtered = df1.loc[df1['B'] != 2]

df2 = pd.DataFrame({"A": pd.SparseArray([0, 1, 0]), 'B': [1,2,3]})
# df2_filtered has no NAs in column A
df2_filtered = df2.loc[df2['B'] != 2]

where df1_filtered will look like

	A	B
0	NaN	1
2	NaN	3

and df2_filtered like

	A	B
0	0	1
2	0	3

Problem description

Filtering a dataframe with an all-zero sparse column can lead to NAs in the sparse column.

Expected Output

Both data frames should be the same, as filtering a dataframe with non-missing data should not lead to missing data.

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit : None

pandas : 0.25.0
numpy : 1.16.2
pytz : 2019.1
dateutil : 2.8.0
pip : 19.2.1
setuptools : 39.1.0
Cython : None
pytest : 4.3.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.6.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : 0.2.1
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
s3fs : 0.2.0
scipy : 1.2.1
sqlalchemy : 1.3.5
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2019-08-06T20:45:57Z

@stelsemeyer Thanks for the report!
This seems to be a regression compared to 0.23 (compared to SparseSeries).

This is a bug in the SparseArray.take implementation, when allow_fill=True it uses a wrong fill value (nan instead of 0):

In [3]: df1.A.array.take([0, 2]) 
Out[3]: 
[0, 0]
Fill: 0
BlockIndex
Block locations: array([], dtype=int32)
Block lengths: array([], dtype=int32)

In [4]: df1.A.array.take([0, 2], allow_fill=True) 
Out[4]: 
[nan, nan]
Fill: 0
IntIndex
Indices: array([0, 1], dtype=int32)

(both above should give the same result)

Always welcome to take a look to see how it could be fixed!

stelsemeyer · 2019-08-07T14:54:34Z

@jorisvandenbossche: Thanks for investigating!
I checked the SparseArray, a naive solution would be to use self.fill_value if fill_value is None in _take_with_fill, here:

pandas/pandas/core/arrays/sparse.py

Lines 1173 to 1174 in d1accd0

    
           if fill_value is None: 
        
               fill_value = self.dtype.na_value

TomAugspurger · 2019-08-19T17:16:23Z

Removing from the 0.25.1 milestone, but if anyone is working on this LMK and we can probably get it in.

@stelsemeyer your proposal looks reasonable.

scottgigante · 2019-09-17T01:43:28Z

I think this is a related issue:

>>> import pandas as pd
>>> X = pd.DataFrame([[0,1,0], [1,0,0], [1,1,0]]).astype(pd.SparseDtype(float, fill_value=0.0))
>>> X
     0    1    2
0  0.0  1.0  0.0
1  1.0  0.0  0.0
2  1.0  1.0  0.0
>>> X.loc[0]
0    0.0
1    1.0
2    0.0
Name: 0, dtype: Sparse[float64, 0.0]
>>> X.loc[[0,1]]
     0    1   2
0  0.0  1.0 NaN
1  1.0  0.0 NaN
>>> X.iloc[[0,1]]
     0    1   2
0  0.0  1.0 NaN
1  1.0  0.0 NaN

scottgigante · 2019-09-21T18:54:00Z

I edited the proposed line, but to no avail. The error in @jorisvandenbossche's answer is resolved:

>>> import pandas as pd
>>> df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]})
>>> df1.A.array.take([0, 2])
[0, 0]
Fill: 0
BlockIndex
Block locations: array([], dtype=int32)
Block lengths: array([], dtype=int32)
>>> df1.A.array.take([0, 2], allow_fill=True)
[0, 0]
Fill: 0
IntIndex
Indices: array([], dtype=int32)

but my and @stelsemeyer's issues remain.

>>> df1.loc[df1['B'] != 2]
    A  B
0 NaN  1
2 NaN  3
>>> 
>>> X = pd.DataFrame([[0,1,0], [1,0,0], [1,1,0]]).astype(pd.SparseDtype(float, fill_value=0.0))
>>> X.loc[[0,1]]
     0    1   2
0  0.0  1.0 NaN
1  1.0  0.0 NaN

Seems to me that there is another problem here:

pandas/pandas/core/internals/managers.py

Line 1262 in a45760f

fill_value if fill_value is not None else blk.fill_value,

>>> blk = X._data.blocks[2]
>>> blk.take_nd(indexer=np.array([0,1]), axis=1).values
[0.0, 0.0]
Fill: 0.0
IntIndex
Indices: array([], dtype=int32)
>>> blk.take_nd(indexer=np.array([0,1]), axis=1, fill_tuple=(blk.fill_value,)).values
[nan, nan]
Fill: 0.0
IntIndex
Indices: array([0, 1], dtype=int32)

which is because of a discrepancy between blk.fill_value and blk.dtype.fill_value

>>> blk.fill_value
nan
>>> blk.dtype.fill_value
0.0

I don't know if we should a) reference blk.dtype.fill_value or b) make blk.dtype.fill_value consistent with blk.fill_value.

scottgigante · 2019-09-28T01:25:27Z

@TomAugspurger any thoughts on this? I'm happy to write the PR, just need some guidance.

TomAugspurger · 2019-10-02T22:55:38Z

Mmm I'm not sure I understand the issue. But note that doing a .take which introduces missing values via a -1 in the indices should result in a NaN in the output, regardless of the fill value. Not sure if that helps or not.

scottgigante · 2019-10-04T16:12:38Z

Can you give an example of how/why that would happen? I don't understand quite how we should get a NaN when the fill_value is not nan.

scottgigante · 2019-10-04T17:19:52Z

Here's my proposed solution:

Replace

    @property
    def fill_value(self):
        # Used in reindex_indexer
        return self.values.dtype.na_value

from https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L1730 with

    @property
    def fill_value(self):
        # Used in reindex_indexer
        try:
            return self.values.dtype.fill_value
        except AttributeError:
            return self.values.dtype.na_value

Thoughts?

…7781

akdor1154 · 2019-12-13T01:10:56Z

Pretty sure my dataset is showing this. Seems to apply to some columns but not all.
strange_test.pickle.gz

If anyone wants a large real-world set to test this on:

import pandas as pd
test = pd.read_pickle('strange_test.pickle.gz')
any(test.DistinctSKUs_SEPB.isna())
> False
any(test.loc[lambda _: _.IsBTS].DistinctSKUs_SEPB.isna())
> True # !?!

scottgigante · 2019-12-13T06:31:24Z

@akdor1154 can you try this monkey patch and see if it solves your issue?

def fill_value(self):
    # Used in reindex_indexer
    try:
        return self.values.dtype.fill_value
    except AttributeError:
        return self.values.dtype.na_value

from pandas.core.internals.blocks import ExtensionBlock

setattr(ExtensionBlock, "fill_value", property(fill_value))

mroeschke · 2020-04-20T00:37:14Z

As mentioned #29321 (comment), it may be an issue as well when the sparse series matches the fill value

scottgigante · 2020-04-28T21:25:20Z

Pretty sure my monkey patch works. I can write a PR if I can get approval from @TomAugspurger or @jorisvandenbossche

>>> import pandas as pd
>>> df1 = pd.DataFrame({"A": pd.arrays.SparseArray([0, 0, 0]), 'B': [1,2,3]})
>>> df1.loc[df1['B'] != 2]
    A  B
0 NaN  1
2 NaN  3
>>> def fill_value(self):
...     # Used in reindex_indexer
...     try:
...         return self.values.dtype.fill_value
...     except AttributeError:
...         return self.values.dtype.na_value
...
>>> from pandas.core.internals.blocks import ExtensionBlock
>>>
>>> setattr(ExtensionBlock, "fill_value", property(fill_value))
>>> df1.loc[df1['B'] != 2]
   A  B
0  0  1
2  0  3

>>> import pandas as pd
>>> X = pd.DataFrame([[0,1,0], [1,0,0], [1,1,0]]).astype(
...     pd.SparseDtype(float, fill_value=0.0))
>>> X.loc[[0,1]]
     0    1   2
0  0.0  1.0 NaN
1  1.0  0.0 NaN
>>> def fill_value(self):
...     # Used in reindex_indexer
...     try:
...         return self.values.dtype.fill_value
...     except AttributeError:
...         return self.values.dtype.na_value
...
>>> from pandas.core.internals.blocks import ExtensionBlock
>>>
>>> setattr(ExtensionBlock, "fill_value", property(fill_value))
>>> X.loc[[0,1]]
     0    1    2
0  0.0  1.0  0.0
1  1.0  0.0  0.0

>>> import pandas as pd
>>> test = pd.read_pickle('strange_test.pickle.gz')
>>> any(test.DistinctSKUs_SEPB.isna())
False
>>> any(test.loc[lambda _: _.IsBTS].DistinctSKUs_SEPB.isna())
True
>>> def fill_value(self):
...     # Used in reindex_indexer
...     try:
...         return self.values.dtype.fill_value
...     except AttributeError:
...         return self.values.dtype.na_value
...
>>> from pandas.core.internals.blocks import ExtensionBlock
>>>
>>> setattr(ExtensionBlock, "fill_value", property(fill_value))
>>> any(test.loc[lambda _: _.IsBTS].DistinctSKUs_SEPB.isna())
False

akdor1154 · 2020-04-29T04:59:55Z

@scottgigante sorry, from memory I tested it at the time and it worked, thanks.

connesy · 2020-05-13T07:05:44Z

@scottgigante Just tested your monkey patch, it works for me in pandas 1.0.2.

…as-dev#29563

Co-authored-by: Jeff Reback <jeff@reback.net>

pandas-dev#29563 (pandas-dev#34158)" This reverts commit a94b13a.

TomAugspurger · 2020-07-15T14:13:51Z

The fix here is being revereted in #35287. Some discussion on a potential fix at #35286 (comment).

cc @scottgigante if you want to take another shot :)

mroeschke · 2021-07-10T19:27:35Z

This looks fixed on master. Could use a test

In [4]: df1_filtered
Out[4]:
   A  B
0  0  1
2  0  3

In [5]: df2_filtered
Out[5]:
   A  B
0  0  1
2  0  3

EnerH · 2023-02-28T09:55:19Z

What about this:

df1_filtered['A'] = df1_filtered['A'].fillna(0)

Similarly, to change the NaN values in column 'A' of df2_filtered to 0, you can use the same method:

df2_filtered['A'] = df2_filtered['A'].fillna(0)

ConnorMcKinley · 2023-03-29T18:26:46Z

Take

jorisvandenbossche added Bug Sparse Sparse Data Type Regression Functionality that used to work in a prior pandas version labels Aug 6, 2019

jorisvandenbossche added this to the 0.25.1 milestone Aug 6, 2019

TomAugspurger modified the milestones: 0.25.1, Contributions Welcome Aug 19, 2019

scottgigante added a commit to KrishnaswamyLab/scprep that referenced this issue Oct 4, 2019

patch pandas sparse dataframe loc, monkey patches pandas-dev/pandas#2…

3ba22b9

…7781

TomAugspurger mentioned this issue Nov 7, 2019

Pandas dataframe sub-setting returns NaN's when whole column matches fill value #29321

Closed

memeplex mentioned this issue Nov 12, 2019

Sparse DataFrame with all-zeros column returns NA for fancy index #29563

Closed

TomAugspurger mentioned this issue Dec 4, 2019

Indexing fully 0-filled SparseArrays produces Nan-filled SparseArrays #30047

Closed

scottgigante added a commit to scottgigante/pandas that referenced this issue May 13, 2020

patch ExtensionBlock fill_value, closes pandas-dev#27781

022d44a

scottgigante added a commit to scottgigante/pandas that referenced this issue May 13, 2020

patch ExtensionBlock fill_value, closes pandas-dev#27781

4c1aa8c

scottgigante added a commit to scottgigante/pandas that referenced this issue May 13, 2020

patch ExtensionBlock fill_value, closes pandas-dev#27781, closes pand…

a4744ea

…as-dev#29563

scottgigante mentioned this issue May 13, 2020

REGR: use dtype.fill_value in ExtensionBlock.fill_value where available #34158

Merged

5 tasks

jreback removed this from the Contributions Welcome milestone May 14, 2020

jreback added this to the 1.1 milestone May 14, 2020

scottgigante added a commit to scottgigante/pandas that referenced this issue May 14, 2020

patch ExtensionBlock fill_value, closes pandas-dev#27781, closes pand…

3c96b60

…as-dev#29563

scottgigante added a commit to scottgigante/pandas that referenced this issue May 17, 2020

patch ExtensionBlock fill_value, closes pandas-dev#27781, closes pand…

0e8c204

…as-dev#29563

scottgigante added a commit to scottgigante/pandas that referenced this issue May 17, 2020

patch ExtensionBlock fill_value, closes pandas-dev#27781, closes pand…

e8d7648

…as-dev#29563

scottgigante added a commit to scottgigante/pandas that referenced this issue May 18, 2020

patch ExtensionBlock fill_value, closes pandas-dev#27781, closes pand…

c161ae9

…as-dev#29563

scottgigante added a commit to scottgigante/pandas that referenced this issue May 20, 2020

patch ExtensionBlock fill_value, closes pandas-dev#27781, closes pand…

5188e03

…as-dev#29563

jreback closed this as completed in #34158 Jun 1, 2020

jreback added a commit that referenced this issue Jun 1, 2020

patch ExtensionBlock fill_value, closes #27781, closes #29563 (#34158)

a94b13a

Co-authored-by: Jeff Reback <jeff@reback.net>

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jul 15, 2020

Revert "patch ExtensionBlock fill_value, closes pandas-dev#27781, closes

58dccd3

pandas-dev#29563 (pandas-dev#34158)" This reverts commit a94b13a.

TomAugspurger reopened this Jul 15, 2020

TomAugspurger modified the milestones: 1.1, Contributions Welcome Jul 15, 2020

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Jul 10, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

github-actions bot assigned ConnorMcKinley Mar 29, 2023

ConnorMcKinley mentioned this issue Apr 19, 2023

TST: Add regression test for NaN sparse columns #52772

Merged

2 tasks

mroeschke closed this as completed in #52772 Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering dataframe with sparse column leads to NAs in sparse column #27781

Filtering dataframe with sparse column leads to NAs in sparse column #27781

stelsemeyer commented Aug 6, 2019 •

edited

Loading

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

jorisvandenbossche commented Aug 6, 2019

stelsemeyer commented Aug 7, 2019

TomAugspurger commented Aug 19, 2019 •

edited

Loading

scottgigante commented Sep 17, 2019

scottgigante commented Sep 21, 2019

scottgigante commented Sep 28, 2019

TomAugspurger commented Oct 2, 2019

scottgigante commented Oct 4, 2019

scottgigante commented Oct 4, 2019

akdor1154 commented Dec 13, 2019 •

edited

Loading

scottgigante commented Dec 13, 2019

mroeschke commented Apr 20, 2020

scottgigante commented Apr 28, 2020

akdor1154 commented Apr 29, 2020

connesy commented May 13, 2020

TomAugspurger commented Jul 15, 2020

mroeschke commented Jul 10, 2021

EnerH commented Feb 28, 2023 •

edited

Loading

ConnorMcKinley commented Mar 29, 2023

Filtering dataframe with sparse column leads to NAs in sparse column #27781

Filtering dataframe with sparse column leads to NAs in sparse column #27781

Comments

stelsemeyer commented Aug 6, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

jorisvandenbossche commented Aug 6, 2019

stelsemeyer commented Aug 7, 2019

TomAugspurger commented Aug 19, 2019 • edited Loading

scottgigante commented Sep 17, 2019

scottgigante commented Sep 21, 2019

scottgigante commented Sep 28, 2019

TomAugspurger commented Oct 2, 2019

scottgigante commented Oct 4, 2019

scottgigante commented Oct 4, 2019

akdor1154 commented Dec 13, 2019 • edited Loading

scottgigante commented Dec 13, 2019

mroeschke commented Apr 20, 2020

scottgigante commented Apr 28, 2020

akdor1154 commented Apr 29, 2020

connesy commented May 13, 2020

TomAugspurger commented Jul 15, 2020

mroeschke commented Jul 10, 2021

EnerH commented Feb 28, 2023 • edited Loading

ConnorMcKinley commented Mar 29, 2023

stelsemeyer commented Aug 6, 2019 •

edited

Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

TomAugspurger commented Aug 19, 2019 •

edited

Loading

akdor1154 commented Dec 13, 2019 •

edited

Loading

EnerH commented Feb 28, 2023 •

edited

Loading