ERR: raise on invalid coulmns using a fixed HDFStore #13492

amanhanda · 2016-06-20T21:15:19Z

Code Sample

idx = pd.Index(pd.to_datetime([datetime.date(2000, 1, 1), datetime.date(2000, 1, 2)]), name='cols')
idx1 = pd.Index(pd.to_datetime([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2)]), name='rows')
s = pd.DataFrame(np.arange(4).reshape(2,2), columns=idx, index=idx1)
print type(s.index.name)
# The type is str
<type 'str'>
s.reset_index()
cols       rows  2000-01-01 00:00:00  2000-01-02 00:00:00
0    2010-01-01                    0                    1
1    2010-01-02                    2                    3
with pd.HDFStore("/logs/tmp/test.h5", "w") as store:
    store.put("test", s, "fixed")
# When reading the data from HDF5, the index name comes back as a numpy.string_

with pd.HDFStore("/logs/tmp/test.h5", "r") as store:
    s1 = store["test"]
type(s1.index.name)
numpy.string_
# numpy.concatenate throws a ValueError, 
# which the code does not catch to convert the column to type object from DatetimeIndex, and fails

s1.reset_index()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-93-f61766d7f5c1> in <module>()
----> 1 s1.reset_index()

/auto/energymdl2/anaconda/envs/commod_20160516/lib/python2.7/site-packages/pandas/core/frame.pyc in reset_index(self, level, drop, inplace, col_level, col_fill)
   2731                     name = tuple(name_lst)
   2732             values = _maybe_casted_values(self.index)
-> 2733             new_obj.insert(0, name, values)
   2734
   2735         new_obj.index = new_index

/auto/energymdl2/anaconda/envs/commod_20160516/lib/python2.7/site-packages/pandas/core/frame.pyc in insert(self, loc, column, value, allow_duplicates)
   2228         value = self._sanitize_column(column, value)
   2229         self._data.insert(
-> 2230             loc, column, value, allow_duplicates=allow_duplicates)
   2231
   2232     def assign(self, **kwargs):

/auto/energymdl2/anaconda/envs/commod_20160516/lib/python2.7/site-packages/pandas/core/internals.pyc in insert(self, loc, item, value, allow_duplicates)
   3100             self._blknos = np.insert(self._blknos, loc, len(self.blocks))
   3101
-> 3102         self.axes[0] = self.items.insert(loc, item)
   3103
   3104         self.blocks += (block,)

/auto/energymdl2/anaconda/envs/commod_20160516/lib/python2.7/site-packages/pandas/tseries/index.pyc in insert(self, loc, item)
   1505             item = _to_m8(item, tz=self.tz)
   1506         try:
-> 1507             new_dates = np.concatenate((self[:loc].asi8, [item.view(np.int64)],
   1508                                         self[loc:].asi8))
   1509             if self.tz is not None:

ValueError: new type not compatible with array.

# The exception caluse does not catch ValueError

.../pandas/tseries/index.py
   1720         freq = None
   1721
   1722         if isinstance(item, (datetime, np.datetime64)):
   1723             self._assert_can_do_op(item)
   1724             if not self._has_same_tz(item):
   1725                 raise ValueError(
   1726                     'Passed item and index have different timezone')
   1727             # check freq can be preserved on edge cases
   1728             if self.size and self.freq is not None:
   1729                 if ((loc == 0 or loc == -len(self)) and
   1730                         item + self.freq == self[0]):
   1731                     freq = self.freq
   1732                 elif (loc == len(self)) and item - self.freq == self[-1]:
   1733                     freq = self.freq
   1734             item = _to_m8(item, tz=self.tz)
   1735         try:
1> 1736             new_dates = np.concatenate((self[:loc].asi8, [item.view(np.int64)],
   1737                                         self[loc:].asi8))
   1738             if self.tz is not None:
   1739                 new_dates = tslib.tz_convert(new_dates, 'UTC', self.tz)
   1740             return DatetimeIndex(new_dates, name=self.name, freq=freq,
   1741                                  tz=self.tz)
   1742
   1743         except (AttributeError, TypeError):
   1744
   1745             # fall back to object index
   1746             if isinstance(item, compat.string_types):
   1747                 return self.asobject.insert(loc, item)
   1748             raise TypeError(
   1749                 "cannot insert DatetimeIndex with incompatible label")

Expected Output

cols       rows  2000-01-01 00:00:00  2000-01-02 00:00:00
0    2010-01-01                    0                    1
1    2010-01-02                    2                    3

output of `pd.show_versions()`

# Problem occurs in 0.16.2 and 0.18.1

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-573.7.1.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.24
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: 0.7.2
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.5.2
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.4.3
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-06-20T21:35:46Z

not really sure what you are doing.

pls show an exact reproduction.

In [7]: idx = pd.Index(pd.to_datetime([datetime.date(2000, 1, 1), datetime.date(2000, 1, 2)]), name='cols')

In [8]: idx1 = pd.Index(pd.to_datetime([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2)]), name='rows')

In [9]: s = pd.DataFrame(np.arange(4).reshape(2,2), columns=idx, index=idx1)

In [10]: s.to_hdf('test.h5','df',mode='w',format='table')                                          

In [11]: pd.read_hdf('test.h5','df')                                                               
Out[11]: 
cols        2000-01-01  2000-01-02
rows                              
2010-01-01           0           1
2010-01-02           2           3

In [12]: s.to_hdf('test.h5','df',mode='w',format='fixed')                                          

In [13]: pd.read_hdf('test.h5','df')
Out[13]: 
cols        2000-01-01  2000-01-02
rows                              
2010-01-01           0           1
2010-01-02           2           3

In [14]: pd.__version__
Out[14]: u'0.18.1'

amanhanda · 2016-06-20T21:46:46Z

I am using the HDFStore interface. With your code snippet, please try and reset_index() on the returned frame, when the format="fixed"


In [36]: s.to_hdf('test.h5','df',mode='w', format="fixed")

In [37]: s1 = pd.read_hdf('test.h5','df')

In [38]: s1.reset_index()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-38-f61766d7f5c1> in <module>()
----> 1 s1.reset_index()

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/core/frame.pyc in reset_index(self, level, drop, inplace, col_level, col_fill)
   2959                     name = tuple(name_lst)
   2960             values = _maybe_casted_values(self.index)
-> 2961             new_obj.insert(0, name, values)
   2962
   2963         new_obj.index = new_index

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/core/frame.pyc in insert(self, loc, column, value, allow_duplicates)
   2447         value = self._sanitize_column(column, value)
   2448         self._data.insert(loc, column, value,
-> 2449                           allow_duplicates=allow_duplicates)
   2450
   2451     def assign(self, **kwargs):

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/core/internals.pyc in insert(self, loc, item, value, allow_duplicates)
   3514
   3515         # insert to the axis; this could possibly raise a TypeError
-> 3516         new_axis = self.items.insert(loc, item)
   3517
   3518         block = make_block(values=value, ndim=self.ndim,

/auto/energymdl2/anaconda/envs/commod_20160516_pd18/lib/python2.7/site-packages/pandas/tseries/index.pyc in insert(self, loc, item)
   1734             item = _to_m8(item, tz=self.tz)
   1735         try:
-> 1736             new_dates = np.concatenate((self[:loc].asi8, [item.view(np.int64)],
   1737                                         self[loc:].asi8))
   1738             if self.tz is not None:

ValueError: new type not compatible with array.

jreback · 2016-06-20T22:15:24Z

I c. Well that's not really supported; you must have strings for column names. We did a fix for tables IIRC.
#10098, this is related (but the check isn't there).

want to do a pull-request?

amanhanda · 2016-06-21T13:07:32Z

The index name is string in the source data frame. Storing it to hdf5 and retrieving it back is when the type changes to numpy.string_.
The column name is "cols" and index name is "rows". Both strings.

I have not done a pull request before. This would be my first. Will give it a shot.

jreback · 2016-06-21T13:09:29Z

fixed is not very respectful of attributes like this
table generally works in a smoother way

makmanalp · 2017-05-22T19:29:18Z

Hi! I'm at the sprints at pycon and am looking to pick this up! Managed to reproduce the issue even though for the type I get:

In [27]: type(s1.index.name)
Out[27]: numpy.str_

instead of numpy.string_ but perhaps that's a naming difference across numpy versions ('1.12.1' here).

Same issue arises when reading the table with read_hdf instead of HDFStore and doing a reset_index().

In terms of expected behavior, I'm not entirely certain what we want here - should we be casting the numpy.str_ to a string? (seems reasonable - unsure why they're incompatible in the first place).

makmanalp · 2017-05-22T19:33:27Z

Also can confirm that this doesn't happen with table.

TomAugspurger · 2017-05-22T19:47:46Z

@makmanalp yeah, I think the best thing to do would be to cast np.str_ to a python string. Hopefully we don't hit any encoding issues... It's not clear to me whether np.str_ is a python 3 str (unicode) or a python 2 str (bytes)

makmanalp · 2017-05-22T20:45:00Z

On my python3 installation, I'm finding that np.string_ is just the same as np.bytes_, which is different from np.str_. So perhaps there is some py2/3 trickiness here. I'll give it a first stab and perhaps try it on both somehow.

TomAugspurger · 2017-05-22T20:48:26Z

Ugh that's unfortunate. I guess we should know the encoding inside the HDF reader.

makmanalp · 2017-05-22T21:21:56Z

Single-file example for easy reproduction:

import pandas as pd
import numpy as np
import datetime

idx = pd.Index(pd.to_datetime([datetime.date(2000, 1, 1), datetime.date(2000, 1, 2)]), name='cols')
idx1 = pd.Index(pd.to_datetime([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2)]), name='rows')
s = pd.DataFrame(np.arange(4).reshape(2,2), columns=idx, index=idx1)

with pd.HDFStore("test.h5", "w") as store:
    store.put("test", s, "fixed")

with pd.HDFStore("test.h5", "r") as store:
    s1 = store["test"]

# s1.reset_index()

makmanalp · 2017-05-23T00:20:23Z

So, I just made a PR, it's just a first stab at the issue but hopefully it's in the right direction! Please let me know how happy you are with this fix and what I can do to get it release-ready!

* BUG: Handle numpy strings in index names in HDF5 #13492 * REF: refactor to _ensure_str

…ndas-dev#16444) * BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492 * REF: refactor to _ensure_str (cherry picked from commit 18c316b)

* BUG: Handle numpy strings in index names in HDF5 #13492 * REF: refactor to _ensure_str (cherry picked from commit 18c316b)

…ndas-dev#16444) * BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492 * REF: refactor to _ensure_str

Version 0.20.2 * tag 'v0.20.2': (68 commits) RLS: v0.20.2 DOC: Update release.rst DOC: Whatsnew fixups (pandas-dev#16596) ERRR: Raise error in usecols when column doesn't exist but length matches (pandas-dev#16460) BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pandas-dev#16444) PERF: vectorize _interp_limit (pandas-dev#16592) DOC: whatsnew 0.20.2 edits (pandas-dev#16587) API: Make is_strictly_monotonic_* private (pandas-dev#16576) BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565) Strictly monotonic (pandas-dev#16555) ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pandas-dev#14026) fix linting BUG: Incorrect handling of rolling.cov with offset window (pandas-dev#16244) BUG: select_as_multiple doesn't respect start/stop kwargs GH16209 (pandas-dev#16317) return empty MultiIndex for symmetrical difference on equal MultiIndexes (pandas-dev#16486) BUG: Bug in .resample() and .groupby() when aggregating on integers (pandas-dev#16549) BUG: Fixed tput output on windows (pandas-dev#16496) Strictly monotonic (pandas-dev#16555) BUG: fixed wrong order of ordered labels in pd.cut() BUG: Fixed to_html ignoring index_names parameter ...

jreback added Can't Repro IO HDF5 read_hdf, HDFStore labels Jun 20, 2016

jreback added Difficulty Novice Error Reporting Incorrect or improved errors from pandas and removed Can't Repro labels Jun 20, 2016

jreback added this to the Next Major Release milestone Jun 20, 2016

jreback changed the title ~~DataFrame reset_index() fails when data frame read from HDF5.~~ ERR: raise on invalid coulmns using a fixed HDFStore Jun 20, 2016

makmanalp added a commit to makmanalp/pandas that referenced this issue May 23, 2017

BUG: Handle np strings in index.name in HDF pandas-dev#13492

48b91c5

makmanalp mentioned this issue May 23, 2017

BUG: Handle numpy strings in index names in HDF #13492 #16444

Merged

4 tasks

makmanalp added a commit to makmanalp/pandas that referenced this issue Jun 1, 2017

BUG: Handle np strings in index.name in HDF pandas-dev#13492

dbd8b4c

makmanalp added a commit to makmanalp/pandas that referenced this issue Jun 1, 2017

BUG: Handle np strings in index.name in HDF pandas-dev#13492

a90b215

makmanalp added a commit to makmanalp/pandas that referenced this issue Jun 1, 2017

BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492

ab75d27

makmanalp added a commit to makmanalp/pandas that referenced this issue Jun 1, 2017

BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492

90f63b0

jreback modified the milestones: 0.20.2, Next Major Release Jun 2, 2017

TomAugspurger closed this as completed in #16444 Jun 4, 2017

TomAugspurger pushed a commit that referenced this issue Jun 4, 2017

BUG: convert numpy strings in index names in HDF #13492 (#16444)

18c316b

* BUG: Handle numpy strings in index names in HDF5 #13492 * REF: refactor to _ensure_str

TomAugspurger pushed a commit that referenced this issue Jun 4, 2017

BUG: convert numpy strings in index names in HDF #13492 (#16444)

7286bc7

* BUG: Handle numpy strings in index names in HDF5 #13492 * REF: refactor to _ensure_str (cherry picked from commit 18c316b)

Kiv pushed a commit to Kiv/pandas that referenced this issue Jun 11, 2017

BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pa…

a0174eb

…ndas-dev#16444) * BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492 * REF: refactor to _ensure_str

stangirala pushed a commit to stangirala/pandas that referenced this issue Jun 11, 2017

BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pa…

05e41e4

…ndas-dev#16444) * BUG: Handle numpy strings in index names in HDF5 pandas-dev#13492 * REF: refactor to _ensure_str

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERR: raise on invalid coulmns using a fixed HDFStore #13492

ERR: raise on invalid coulmns using a fixed HDFStore #13492

amanhanda commented Jun 20, 2016

jreback commented Jun 20, 2016

amanhanda commented Jun 20, 2016

jreback commented Jun 20, 2016

amanhanda commented Jun 21, 2016

jreback commented Jun 21, 2016

makmanalp commented May 22, 2017

makmanalp commented May 22, 2017

TomAugspurger commented May 22, 2017

makmanalp commented May 22, 2017

TomAugspurger commented May 22, 2017

makmanalp commented May 22, 2017 •

edited

Loading

makmanalp commented May 23, 2017

ERR: raise on invalid coulmns using a fixed HDFStore #13492

ERR: raise on invalid coulmns using a fixed HDFStore #13492

Comments

amanhanda commented Jun 20, 2016

Code Sample

Expected Output

output of pd.show_versions()

jreback commented Jun 20, 2016

amanhanda commented Jun 20, 2016

jreback commented Jun 20, 2016

amanhanda commented Jun 21, 2016

jreback commented Jun 21, 2016

makmanalp commented May 22, 2017

makmanalp commented May 22, 2017

TomAugspurger commented May 22, 2017

makmanalp commented May 22, 2017

TomAugspurger commented May 22, 2017

makmanalp commented May 22, 2017 • edited Loading

makmanalp commented May 23, 2017

output of `pd.show_versions()`

makmanalp commented May 22, 2017 •

edited

Loading