Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nan in the multi index get cast into string after a subtraction #7031

Closed
alphaho opened this issue May 4, 2014 · 6 comments
Closed

Nan in the multi index get cast into string after a subtraction #7031

alphaho opened this issue May 4, 2014 · 6 comments
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex
Milestone

Comments

@alphaho
Copy link

alphaho commented May 4, 2014

related #6322

Hi,

I've found this accidentally that after the subtraction between df1 and df2, the nan in the multi index has turned to a string rather than the original float type. And it's also weird that this issue only happens with the parse_dates set to False.

The following is the code to reproduce:

import StringIO
from pandas import DataFrame

index = ['ID', 'Index']

input1 = '''"value","ID","Index"
"0.0","B9",""
"0.0","B",""
'''

input2 = '''"value","ID","Index"
"1.5375969","M","Blah"
'''


df1 = DataFrame.from_csv(StringIO.StringIO(input1), index_col=index, parse_dates=False)
df2 = DataFrame.from_csv(StringIO.StringIO(input2), index_col=index, parse_dates=False)
## The following two lines can work as expected
#df1 = DataFrame.from_csv(file1, index_col=index, )
#df2 = DataFrame.from_csv(file2, index_col=index, )

diff = df2['value'] - df1['value']

print diff.index[0], diff.index[1]  # => ('B', 'nan') ('B9', 'nan')
print df1.index[0], df1.index[1]  # => ('B9', nan) ('B', nan)

versions of pandas and it's dependencies:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.5.final.0
python-bits: 64
OS: Darwin
OS-release: 13.1.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: zh_HK.UTF-8

pandas: 0.13.1
Cython: None
numpy: 1.8.1
scipy: None
statsmodels: None
IPython: 1.1.0
sphinx: None
patsy: None
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.2
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
sqlalchemy: None
lxml: None
bs4: None
html5lib: None
bq: None
apiclient: None
@jreback
Copy link
Contributor

jreback commented May 4, 2014

I see the result you are getting
but in general you SHOULD not have nan's in the index (or multi-index) at all, except possibly as an intermediate operation. It really doesn't make sense for most operations. In fact you can't even reset_index on df1 as its unclear what to do with the 2nd level.

Bottom line, this is a 'feature' that maybe can be implemented at some point, but its quite tricky.

So avoid for now; I'll leave the issue open (and link to another).

In [28]: df2['value'] - df1['value']
Out[28]: 
ID  Index
B   nan     NaN
B9  nan     NaN
M   Blah    NaN
Name: value, dtype: float64

In [29]: df2
Out[29]: 
             value
ID Index          
M  Blah   1.537597

[1 rows x 1 columns]

In [30]: df1
Out[30]: 
          value
ID Index       
B9 NaN        0
B  NaN        0

[2 rows x 1 columns]

@alphaho
Copy link
Author

alphaho commented May 8, 2014

Thanks, I'll avoid using the nan as index for now.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@dsm054
Copy link
Contributor

dsm054 commented Jun 7, 2017

As of 0.20.2, this no longer seems to be an issue:

In [19]: diff.index
Out[19]: 
MultiIndex(levels=[['B', 'B9', 'M'], ['Blah']],
           labels=[[0, 1, 2], [-1, -1, 0]],
           names=['ID', 'Index'],
           sortorder=0)

In [20]: diff.index[0][1] is np.nan
Out[20]: True

In [21]: diff.index[1][1] is np.nan
Out[21]: True

So I think this can be closed.

@jreback
Copy link
Contributor

jreback commented Jun 7, 2017

can u do a pr with this test?

@dsm054
Copy link
Contributor

dsm054 commented Jun 7, 2017

Sure, will do.

@jorisvandenbossche
Copy link
Member

Closed by #16625

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex
Projects
None yet
Development

No branches or pull requests

4 participants