Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering dataframe with sparse column leads to NAs in sparse column #27781

Closed
stelsemeyer opened this issue Aug 6, 2019 · 19 comments · Fixed by #34158 or #52772
Closed

Filtering dataframe with sparse column leads to NAs in sparse column #27781

stelsemeyer opened this issue Aug 6, 2019 · 19 comments · Fixed by #34158 or #52772
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version Sparse Sparse Data Type

Comments

@stelsemeyer
Copy link

stelsemeyer commented Aug 6, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd

df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]})
# df1_filtered will have NAs in column A
df1_filtered = df1.loc[df1['B'] != 2]

df2 = pd.DataFrame({"A": pd.SparseArray([0, 1, 0]), 'B': [1,2,3]})
# df2_filtered has no NAs in column A
df2_filtered = df2.loc[df2['B'] != 2]

where df1_filtered will look like

	A	B
0	NaN	1
2	NaN	3

and df2_filtered like

	A	B
0	0	1
2	0	3

Problem description

Filtering a dataframe with an all-zero sparse column can lead to NAs in the sparse column.

Expected Output

Both data frames should be the same, as filtering a dataframe with non-missing data should not lead to missing data.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit : None

pandas : 0.25.0
numpy : 1.16.2
pytz : 2019.1
dateutil : 2.8.0
pip : 19.2.1
setuptools : 39.1.0
Cython : None
pytest : 4.3.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.6.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : 0.2.1
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
s3fs : 0.2.0
scipy : 1.2.1
sqlalchemy : 1.3.5
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@jorisvandenbossche jorisvandenbossche added Bug Sparse Sparse Data Type Regression Functionality that used to work in a prior pandas version labels Aug 6, 2019
@jorisvandenbossche jorisvandenbossche added this to the 0.25.1 milestone Aug 6, 2019
@jorisvandenbossche
Copy link
Member

@stelsemeyer Thanks for the report!
This seems to be a regression compared to 0.23 (compared to SparseSeries).

This is a bug in the SparseArray.take implementation, when allow_fill=True it uses a wrong fill value (nan instead of 0):

In [3]: df1.A.array.take([0, 2]) 
Out[3]: 
[0, 0]
Fill: 0
BlockIndex
Block locations: array([], dtype=int32)
Block lengths: array([], dtype=int32)

In [4]: df1.A.array.take([0, 2], allow_fill=True) 
Out[4]: 
[nan, nan]
Fill: 0
IntIndex
Indices: array([0, 1], dtype=int32)

(both above should give the same result)

Always welcome to take a look to see how it could be fixed!

@stelsemeyer
Copy link
Author

@jorisvandenbossche: Thanks for investigating!
I checked the SparseArray, a naive solution would be to use self.fill_value if fill_value is None in _take_with_fill, here:

if fill_value is None:
fill_value = self.dtype.na_value

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 19, 2019

Removing from the 0.25.1 milestone, but if anyone is working on this LMK and we can probably get it in.

@stelsemeyer your proposal looks reasonable.

@TomAugspurger TomAugspurger modified the milestones: 0.25.1, Contributions Welcome Aug 19, 2019
@scottgigante
Copy link
Contributor

I think this is a related issue:

>>> import pandas as pd
>>> X = pd.DataFrame([[0,1,0], [1,0,0], [1,1,0]]).astype(pd.SparseDtype(float, fill_value=0.0))
>>> X
     0    1    2
0  0.0  1.0  0.0
1  1.0  0.0  0.0
2  1.0  1.0  0.0
>>> X.loc[0]
0    0.0
1    1.0
2    0.0
Name: 0, dtype: Sparse[float64, 0.0]
>>> X.loc[[0,1]]
     0    1   2
0  0.0  1.0 NaN
1  1.0  0.0 NaN
>>> X.iloc[[0,1]]
     0    1   2
0  0.0  1.0 NaN
1  1.0  0.0 NaN

@scottgigante
Copy link
Contributor

I edited the proposed line, but to no avail. The error in @jorisvandenbossche's answer is resolved:

>>> import pandas as pd
>>> df1 = pd.DataFrame({"A": pd.SparseArray([0, 0, 0]), 'B': [1,2,3]})
>>> df1.A.array.take([0, 2])
[0, 0]
Fill: 0
BlockIndex
Block locations: array([], dtype=int32)
Block lengths: array([], dtype=int32)
>>> df1.A.array.take([0, 2], allow_fill=True)
[0, 0]
Fill: 0
IntIndex
Indices: array([], dtype=int32)

but my and @stelsemeyer's issues remain.

>>> df1.loc[df1['B'] != 2]
    A  B
0 NaN  1
2 NaN  3
>>> 
>>> X = pd.DataFrame([[0,1,0], [1,0,0], [1,1,0]]).astype(pd.SparseDtype(float, fill_value=0.0))
>>> X.loc[[0,1]]
     0    1   2
0  0.0  1.0 NaN
1  1.0  0.0 NaN

Seems to me that there is another problem here:

fill_value if fill_value is not None else blk.fill_value,

>>> blk = X._data.blocks[2]
>>> blk.take_nd(indexer=np.array([0,1]), axis=1).values
[0.0, 0.0]
Fill: 0.0
IntIndex
Indices: array([], dtype=int32)
>>> blk.take_nd(indexer=np.array([0,1]), axis=1, fill_tuple=(blk.fill_value,)).values
[nan, nan]
Fill: 0.0
IntIndex
Indices: array([0, 1], dtype=int32)

which is because of a discrepancy between blk.fill_value and blk.dtype.fill_value

>>> blk.fill_value
nan
>>> blk.dtype.fill_value
0.0

I don't know if we should a) reference blk.dtype.fill_value or b) make blk.dtype.fill_value consistent with blk.fill_value.

@scottgigante
Copy link
Contributor

@TomAugspurger any thoughts on this? I'm happy to write the PR, just need some guidance.

@TomAugspurger
Copy link
Contributor

Mmm I'm not sure I understand the issue. But note that doing a .take which introduces missing values via a -1 in the indices should result in a NaN in the output, regardless of the fill value. Not sure if that helps or not.

@scottgigante
Copy link
Contributor

Can you give an example of how/why that would happen? I don't understand quite how we should get a NaN when the fill_value is not nan.

@scottgigante
Copy link
Contributor

Here's my proposed solution:

Replace

    @property
    def fill_value(self):
        # Used in reindex_indexer
        return self.values.dtype.na_value

from https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L1730 with

    @property
    def fill_value(self):
        # Used in reindex_indexer
        try:
            return self.values.dtype.fill_value
        except AttributeError:
            return self.values.dtype.na_value

Thoughts?

@akdor1154
Copy link

akdor1154 commented Dec 13, 2019

Pretty sure my dataset is showing this. Seems to apply to some columns but not all.
strange_test.pickle.gz

If anyone wants a large real-world set to test this on:

import pandas as pd
test = pd.read_pickle('strange_test.pickle.gz')
any(test.DistinctSKUs_SEPB.isna())
> False
any(test.loc[lambda _: _.IsBTS].DistinctSKUs_SEPB.isna())
> True # !?!

@scottgigante
Copy link
Contributor

@akdor1154 can you try this monkey patch and see if it solves your issue?

def fill_value(self):
    # Used in reindex_indexer
    try:
        return self.values.dtype.fill_value
    except AttributeError:
        return self.values.dtype.na_value

from pandas.core.internals.blocks import ExtensionBlock

setattr(ExtensionBlock, "fill_value", property(fill_value))

@mroeschke
Copy link
Member

As mentioned #29321 (comment), it may be an issue as well when the sparse series matches the fill value

@scottgigante
Copy link
Contributor

Pretty sure my monkey patch works. I can write a PR if I can get approval from @TomAugspurger or @jorisvandenbossche

>>> import pandas as pd
>>> df1 = pd.DataFrame({"A": pd.arrays.SparseArray([0, 0, 0]), 'B': [1,2,3]})
>>> df1.loc[df1['B'] != 2]
    A  B
0 NaN  1
2 NaN  3
>>> def fill_value(self):
...     # Used in reindex_indexer
...     try:
...         return self.values.dtype.fill_value
...     except AttributeError:
...         return self.values.dtype.na_value
...
>>> from pandas.core.internals.blocks import ExtensionBlock
>>>
>>> setattr(ExtensionBlock, "fill_value", property(fill_value))
>>> df1.loc[df1['B'] != 2]
   A  B
0  0  1
2  0  3
>>> import pandas as pd
>>> X = pd.DataFrame([[0,1,0], [1,0,0], [1,1,0]]).astype(
...     pd.SparseDtype(float, fill_value=0.0))
>>> X.loc[[0,1]]
     0    1   2
0  0.0  1.0 NaN
1  1.0  0.0 NaN
>>> def fill_value(self):
...     # Used in reindex_indexer
...     try:
...         return self.values.dtype.fill_value
...     except AttributeError:
...         return self.values.dtype.na_value
...
>>> from pandas.core.internals.blocks import ExtensionBlock
>>>
>>> setattr(ExtensionBlock, "fill_value", property(fill_value))
>>> X.loc[[0,1]]
     0    1    2
0  0.0  1.0  0.0
1  1.0  0.0  0.0
>>> import pandas as pd
>>> test = pd.read_pickle('strange_test.pickle.gz')
>>> any(test.DistinctSKUs_SEPB.isna())
False
>>> any(test.loc[lambda _: _.IsBTS].DistinctSKUs_SEPB.isna())
True
>>> def fill_value(self):
...     # Used in reindex_indexer
...     try:
...         return self.values.dtype.fill_value
...     except AttributeError:
...         return self.values.dtype.na_value
...
>>> from pandas.core.internals.blocks import ExtensionBlock
>>>
>>> setattr(ExtensionBlock, "fill_value", property(fill_value))
>>> any(test.loc[lambda _: _.IsBTS].DistinctSKUs_SEPB.isna())
False

@akdor1154
Copy link

@scottgigante sorry, from memory I tested it at the time and it worked, thanks.

@connesy
Copy link
Contributor

connesy commented May 13, 2020

@scottgigante Just tested your monkey patch, it works for me in pandas 1.0.2.

scottgigante added a commit to scottgigante/pandas that referenced this issue May 13, 2020
scottgigante added a commit to scottgigante/pandas that referenced this issue May 13, 2020
@jreback jreback removed this from the Contributions Welcome milestone May 14, 2020
@jreback jreback added this to the 1.1 milestone May 14, 2020
jreback added a commit that referenced this issue Jun 1, 2020
Co-authored-by: Jeff Reback <jeff@reback.net>
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jul 15, 2020
@TomAugspurger
Copy link
Contributor

The fix here is being revereted in #35287. Some discussion on a potential fix at #35286 (comment).

cc @scottgigante if you want to take another shot :)

@TomAugspurger TomAugspurger reopened this Jul 15, 2020
@TomAugspurger TomAugspurger modified the milestones: 1.1, Contributions Welcome Jul 15, 2020
@mroeschke
Copy link
Member

This looks fixed on master. Could use a test

In [4]: df1_filtered
Out[4]:
   A  B
0  0  1
2  0  3

In [5]: df2_filtered
Out[5]:
   A  B
0  0  1
2  0  3

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Jul 10, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@EnerH
Copy link

EnerH commented Feb 28, 2023

What about this:

df1_filtered['A'] = df1_filtered['A'].fillna(0)

Similarly, to change the NaN values in column 'A' of df2_filtered to 0, you can use the same method:

df2_filtered['A'] = df2_filtered['A'].fillna(0)

@ConnorMcKinley
Copy link
Contributor

Take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version Sparse Sparse Data Type
Projects
None yet
10 participants