Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: raise on invalid coulmns using a fixed HDFStore #13492

Closed
amanhanda opened this issue Jun 20, 2016 · 12 comments · Fixed by #16444
Closed

ERR: raise on invalid coulmns using a fixed HDFStore #13492

amanhanda opened this issue Jun 20, 2016 · 12 comments · Fixed by #16444
Labels
Error Reporting Incorrect or improved errors from pandas IO HDF5 read_hdf, HDFStore
Milestone

Comments

@amanhanda
Copy link

Code Sample

idx = pd.Index(pd.to_datetime([datetime.date(2000, 1, 1), datetime.date(2000, 1, 2)]), name='cols')
idx1 = pd.Index(pd.to_datetime([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2)]), name='rows')
s = pd.DataFrame(np.arange(4).reshape(2,2), columns=idx, index=idx1)
print type(s.index.name)
# The type is str
<type 'str'>
s.reset_index()
cols       rows  2000-01-01 00:00:00  2000-01-02 00:00:00
0    2010-01-01                    0                    1
1    2010-01-02                    2                    3
with pd.HDFStore("/logs/tmp/test.h5", "w") as store:
    store.put("test", s, "fixed")
# When reading the data from HDF5, the index name comes back as a numpy.string_

with pd.HDFStore("/logs/tmp/test.h5", "r") as store:
    s1 = store["test"]
type(s1.index.name)
numpy.string_
# numpy.concatenate throws a ValueError, 
# which the code does not catch to convert the column to type object from DatetimeIndex, and fails

s1.reset_index()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-93-f61766d7f5c1> in <module>()
----> 1 s1.reset_index()

/auto/energymdl2/anaconda/envs/commod_20160516/lib/python2.7/site-packages/pandas/core/frame.pyc in reset_index(self, level, drop, inplace, col_level, col_fill)
   2731                     name = tuple(name_lst)
   2732             values = _maybe_casted_values(self.index)
-> 2733             new_obj.insert(0, name, values)
   2734
   2735         new_obj.index = new_index

/auto/energymdl2/anaconda/envs/commod_20160516/lib/python2.7/site-packages/pandas/core/frame.pyc in insert(self, loc, column, value, allow_duplicates)
   2228         value = self._sanitize_column(column, value)
   2229         self._data.insert(
-> 2230             loc, column, value, allow_duplicates=allow_duplicates)
   2231
   2232     def assign(self, **kwargs):

/auto/energymdl2/anaconda/envs/commod_20160516/lib/python2.7/site-packages/pandas/core/internals.pyc in insert(self, loc, item, value, allow_duplicates)
   3100             self._blknos = np.insert(self._blknos, loc, len(self.blocks))
   3101
-> 3102         self.axes[0] = self.items.insert(loc, item)
   3103
   3104         self.blocks += (block,)

/auto/energymdl2/anaconda/envs/commod_20160516/lib/python2.7/site-packages/pandas/tseries/index.pyc in insert(self, loc, item)
   1505             item = _to_m8(item, tz=self.tz)
   1506         try:
-> 1507             new_dates = np.concatenate((self[:loc].asi8, [item.view(np.int64)],
   1508                                         self[loc:].asi8))
   1509             if self.tz is not None:

ValueError: new type not compatible with array.

# The exception caluse does not catch ValueError

.../pandas/tseries/index.py
   1720         freq = None
   1721
   1722         if isinstance(item, (datetime, np.datetime64)):
   1723             self._assert_can_do_op(item)
   1724             if not self._has_same_tz(item):
   1725                 raise ValueError(
   1726                     'Passed item and index have different timezone')
   1727             # check freq can be preserved on edge cases
   1728             if self.size and self.freq is not None:
   1729                 if ((loc == 0 or loc == -len(self)) and
   1730                         item + self.freq == self[0]):
   1731                     freq = self.freq
   1732                 elif (loc == len(self)) and item - self.freq == self[-1]:
   1733                     freq = self.freq
   1734             item = _to_m8(item, tz=self.tz)
   1735         try:
1> 1736             new_dates = np.concatenate((self[:loc].asi8, [item.view(np.int64)],
   1737                                         self[loc:].asi8))
   1738             if self.tz is not None:
   1739                 new_dates = tslib.tz_convert(new_dates, 'UTC', self.tz)
   1740             return DatetimeIndex(new_dates, name=self.name, freq=freq,
   1741                                  tz=self.tz)
   1742
   1743         except (AttributeError, TypeError):
   1744
   1745             # fall back to object index
   1746             if isinstance(item, compat.string_types):
   1747                 return self.asobject.insert(loc, item)
   1748             raise TypeError(
   1749                 "cannot insert DatetimeIndex with incompatible label")

Expected Output

cols       rows  2000-01-01 00:00:00  2000-01-02 00:00:00
0    2010-01-01                    0                    1
1    2010-01-02                    2                    3

output of pd.show_versions()

# Problem occurs in 0.16.2 and 0.18.1

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-573.7.1.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.24
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: 0.7.2
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.5.2
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.4.3
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None


@jreback
Copy link
Contributor

jreback commented Jun 20, 2016

not really sure what you are doing.

pls show an exact reproduction.

In [7]: idx = pd.Index(pd.to_datetime([datetime.date(2000, 1, 1), datetime.date(2000, 1, 2)]), name='cols')

In [8]: idx1 = pd.Index(pd.to_datetime([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2)]), name='rows')

In [9]: s = pd.DataFrame(np.arange(4).reshape(2,2), columns=idx, index=idx1)

In [10]: s.to_hdf('test.h5','df',mode='w',format='table')                                          

In [11]: pd.read_hdf('test.h5','df')                                                               
Out[11]: 
cols        2000-01-01  2000-01-02
rows                              
2010-01-01           0           1
2010-01-02           2           3

In [12]: s.to_hdf('test.h5','df',mode='w',format='fixed')                                          

In [13]: pd.read_hdf('test.h5','df')
Out[13]: 
cols        2000-01-01  2000-01-02
rows                              
2010-01-01           0           1
2010-01-02           2           3

In [14]: pd.__version__
Out[14]: u'0.18.1'

@jreback jreback added Can't Repro IO HDF5 read_hdf, HDFStore labels Jun 20, 2016
@amanhanda
Copy link
Author

I am using the HDFStore interface. With your code snippet, please try and reset_index() on the returned frame, when the format="fixed"


In [36]: s.to_hdf('test.h5','df',mode='w', format="fixed")

In [37]: s1 = pd.read_hdf('test.h5','df')

In [38]: s1.reset_index()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-38-f61766d7f5c1> in <module>()
----> 1 s1.reset_index()

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/core/frame.pyc in reset_index(self, level, drop, inplace, col_level, col_fill)
   2959                     name = tuple(name_lst)
   2960             values = _maybe_casted_values(self.index)
-> 2961             new_obj.insert(0, name, values)
   2962
   2963         new_obj.index = new_index

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/core/frame.pyc in insert(self, loc, column, value, allow_duplicates)
   2447         value = self._sanitize_column(column, value)
   2448         self._data.insert(loc, column, value,
-> 2449                           allow_duplicates=allow_duplicates)
   2450
   2451     def assign(self, **kwargs):

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/core/internals.pyc in insert(self, loc, item, value, allow_duplicates)
   3514
   3515         # insert to the axis; this could possibly raise a TypeError
-> 3516         new_axis = self.items.insert(loc, item)
   3517
   3518         block = make_block(values=value, ndim=self.ndim,

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/tseries/index.pyc in insert(self, loc, item)
   1734             item = _to_m8(item, tz=self.tz)
   1735         try:
-> 1736             new_dates = np.concatenate((self[:loc].asi8, [item.view(np.int64)],
   1737                                         self[loc:].asi8))
   1738             if self.tz is not None:

ValueError: new type not compatible with array.


@jreback
Copy link
Contributor

jreback commented Jun 20, 2016

I c. Well that's not really supported; you must have strings for column names. We did a fix for tables IIRC.
#10098, this is related (but the check isn't there).

want to do a pull-request?

@jreback jreback added Difficulty Novice Error Reporting Incorrect or improved errors from pandas and removed Can't Repro labels Jun 20, 2016
@jreback jreback added this to the Next Major Release milestone Jun 20, 2016
@jreback jreback changed the title DataFrame reset_index() fails when data frame read from HDF5. ERR: raise on invalid coulmns using a fixed HDFStore Jun 20, 2016
@amanhanda
Copy link
Author

The index name is string in the source data frame. Storing it to hdf5 and retrieving it back is when the type changes to numpy.string_.
The column name is "cols" and index name is "rows". Both strings.

I have not done a pull request before. This would be my first. Will give it a shot.

@jreback
Copy link
Contributor

jreback commented Jun 21, 2016

fixed is not very respectful of attributes like this
table generally works in a smoother way

@makmanalp
Copy link
Contributor

Hi! I'm at the sprints at pycon and am looking to pick this up! Managed to reproduce the issue even though for the type I get:

In [27]: type(s1.index.name)
Out[27]: numpy.str_

instead of numpy.string_ but perhaps that's a naming difference across numpy versions ('1.12.1' here).

Same issue arises when reading the table with read_hdf instead of HDFStore and doing a reset_index().

In terms of expected behavior, I'm not entirely certain what we want here - should we be casting the numpy.str_ to a string? (seems reasonable - unsure why they're incompatible in the first place).

@makmanalp
Copy link
Contributor

Also can confirm that this doesn't happen with table.

@TomAugspurger
Copy link
Contributor

@makmanalp yeah, I think the best thing to do would be to cast np.str_ to a python string. Hopefully we don't hit any encoding issues... It's not clear to me whether np.str_ is a python 3 str (unicode) or a python 2 str (bytes)

@makmanalp
Copy link
Contributor

On my python3 installation, I'm finding that np.string_ is just the same as np.bytes_, which is different from np.str_. So perhaps there is some py2/3 trickiness here. I'll give it a first stab and perhaps try it on both somehow.

@TomAugspurger
Copy link
Contributor

Ugh that's unfortunate. I guess we should know the encoding inside the HDF reader.

@makmanalp
Copy link
Contributor

makmanalp commented May 22, 2017

Single-file example for easy reproduction:

import pandas as pd
import numpy as np
import datetime

idx = pd.Index(pd.to_datetime([datetime.date(2000, 1, 1), datetime.date(2000, 1, 2)]), name='cols')
idx1 = pd.Index(pd.to_datetime([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2)]), name='rows')
s = pd.DataFrame(np.arange(4).reshape(2,2), columns=idx, index=idx1)

with pd.HDFStore("test.h5", "w") as store:
    store.put("test", s, "fixed")

with pd.HDFStore("test.h5", "r") as store:
    s1 = store["test"]

# s1.reset_index()

@makmanalp
Copy link
Contributor

So, I just made a PR, it's just a first stab at the issue but hopefully it's in the right direction! Please let me know how happy you are with this fix and what I can do to get it release-ready!

makmanalp added a commit to makmanalp/pandas that referenced this issue Jun 1, 2017
makmanalp added a commit to makmanalp/pandas that referenced this issue Jun 1, 2017
makmanalp added a commit to makmanalp/pandas that referenced this issue Jun 1, 2017
makmanalp added a commit to makmanalp/pandas that referenced this issue Jun 1, 2017
@jreback jreback modified the milestones: 0.20.2, Next Major Release Jun 2, 2017
TomAugspurger pushed a commit that referenced this issue Jun 4, 2017
* BUG: Handle numpy strings in index names in HDF5 #13492

* REF: refactor to _ensure_str
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue Jun 4, 2017
…ndas-dev#16444)

* BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492

* REF: refactor to _ensure_str

(cherry picked from commit 18c316b)
TomAugspurger pushed a commit that referenced this issue Jun 4, 2017
* BUG: Handle numpy strings in index names in HDF5 #13492

* REF: refactor to _ensure_str

(cherry picked from commit 18c316b)
Kiv pushed a commit to Kiv/pandas that referenced this issue Jun 11, 2017
…ndas-dev#16444)

* BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492

* REF: refactor to _ensure_str
stangirala pushed a commit to stangirala/pandas that referenced this issue Jun 11, 2017
…ndas-dev#16444)

* BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492

* REF: refactor to _ensure_str
yarikoptic added a commit to neurodebian/pandas that referenced this issue Jul 12, 2017
Version 0.20.2

* tag 'v0.20.2': (68 commits)
  RLS: v0.20.2
  DOC: Update release.rst
  DOC: Whatsnew fixups (pandas-dev#16596)
  ERRR: Raise error in usecols when column doesn't exist but length matches (pandas-dev#16460)
  BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pandas-dev#16444)
  PERF: vectorize _interp_limit (pandas-dev#16592)
  DOC: whatsnew 0.20.2 edits (pandas-dev#16587)
  API: Make is_strictly_monotonic_* private (pandas-dev#16576)
  BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565)
  Strictly monotonic (pandas-dev#16555)
  ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pandas-dev#14026)
  fix linting
  BUG: Incorrect handling of rolling.cov with offset window (pandas-dev#16244)
  BUG: select_as_multiple doesn't respect start/stop kwargs GH16209 (pandas-dev#16317)
  return empty MultiIndex for symmetrical difference on equal MultiIndexes (pandas-dev#16486)
  BUG: Bug in .resample() and .groupby() when aggregating on integers (pandas-dev#16549)
  BUG: Fixed tput output on windows (pandas-dev#16496)
  Strictly monotonic (pandas-dev#16555)
  BUG: fixed wrong order of ordered labels in pd.cut()
  BUG: Fixed to_html ignoring index_names parameter
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants