-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering dataframe with sparse column leads to NAs in sparse column #27781
Comments
@stelsemeyer Thanks for the report! This is a bug in the
(both above should give the same result) Always welcome to take a look to see how it could be fixed! |
@jorisvandenbossche: Thanks for investigating! pandas/pandas/core/arrays/sparse.py Lines 1173 to 1174 in d1accd0
|
Removing from the 0.25.1 milestone, but if anyone is working on this LMK and we can probably get it in. @stelsemeyer your proposal looks reasonable. |
I think this is a related issue:
|
I edited the proposed line, but to no avail. The error in @jorisvandenbossche's answer is resolved:
but my and @stelsemeyer's issues remain.
Seems to me that there is another problem here: pandas/pandas/core/internals/managers.py Line 1262 in a45760f
which is because of a discrepancy between
I don't know if we should a) reference |
@TomAugspurger any thoughts on this? I'm happy to write the PR, just need some guidance. |
Mmm I'm not sure I understand the issue. But note that doing a |
Can you give an example of how/why that would happen? I don't understand quite how we should get a NaN when the fill_value is not nan. |
Here's my proposed solution: Replace
from https://github.com/pandas-dev/pandas/blob/master/pandas/core/internals/blocks.py#L1730 with
Thoughts? |
Pretty sure my dataset is showing this. Seems to apply to some columns but not all. If anyone wants a large real-world set to test this on: import pandas as pd
test = pd.read_pickle('strange_test.pickle.gz')
any(test.DistinctSKUs_SEPB.isna())
> False
any(test.loc[lambda _: _.IsBTS].DistinctSKUs_SEPB.isna())
> True # !?! |
@akdor1154 can you try this monkey patch and see if it solves your issue?
|
As mentioned #29321 (comment), it may be an issue as well when the sparse series matches the fill value |
Pretty sure my monkey patch works. I can write a PR if I can get approval from @TomAugspurger or @jorisvandenbossche
|
@scottgigante sorry, from memory I tested it at the time and it worked, thanks. |
@scottgigante Just tested your monkey patch, it works for me in |
pandas-dev#29563 (pandas-dev#34158)" This reverts commit a94b13a.
The fix here is being revereted in #35287. Some discussion on a potential fix at #35286 (comment). cc @scottgigante if you want to take another shot :) |
This looks fixed on master. Could use a test
|
What about this:
Similarly, to change the NaN values in column 'A' of df2_filtered to 0, you can use the same method:
|
Take |
Code Sample, a copy-pastable example if possible
where
df1_filtered
will look likeand
df2_filtered
likeProblem description
Filtering a dataframe with an all-zero sparse column can lead to NAs in the sparse column.
Expected Output
Both data frames should be the same, as filtering a dataframe with non-missing data should not lead to missing data.
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit : None
pandas : 0.25.0
numpy : 1.16.2
pytz : 2019.1
dateutil : 2.8.0
pip : 19.2.1
setuptools : 39.1.0
Cython : None
pytest : 4.3.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.6.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : 0.2.1
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
s3fs : 0.2.0
scipy : 1.2.1
sqlalchemy : 1.3.5
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
The text was updated successfully, but these errors were encountered: