Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas pytables interface doesn't create empty table datasets #13016

Open
damionw opened this issue Apr 28, 2016 · 27 comments
Open

Pandas pytables interface doesn't create empty table datasets #13016

damionw opened this issue Apr 28, 2016 · 27 comments
Labels
Bug IO HDF5 read_hdf, HDFStore

Comments

@damionw
Copy link

damionw commented Apr 28, 2016

Pandas used to allow the writing of empty HDF5 datasets through its pytables interface code. However, after upgrading to 0.17 (from 0.11), we've discovered that this behaviour is intentionally
short circuited. The library behaves as though the dataset is being written, but simply ignores the request and the resulting HDF5 file doesn't contain the requested table.

The offending code is in pandas/io/pytables.py:_write_to_group()

    # we don't want to store a table node at all if are object is 0-len
    # as there are not dtypes
    if getattr(value, 'empty', None) and (format == 'table' or append):
        return

We've worked around it by patching our installed copy of pandas, but we'd like to know the provocation behind this code before submitting a pull request. The comment implies that the lack of dtypes in the dataset is the cause, however each pandas column has type information even if empty.

Any clarification would be appreciated

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

pls show a copy pastable example as well as show versions for old/new (esp pytables versions)

@jreback jreback added the IO HDF5 read_hdf, HDFStore label Apr 28, 2016
@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

I don't believe this was ever supported in PyTables for table format.

@damionw
Copy link
Author

damionw commented Apr 28, 2016

Hard to believe since we were using it :-)

In 2012, Wes McKinnon merged a patch into pandas related to this issue:

#1707

Seen here:

603e5ae

And, as to the current behaviour, why is it not an error ? It doesn't fail, and it just doesn't write any of the supporting structures

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

you are refering to fixed format. Pls show an example and version, including a clean file with generated meta data.

@damionw
Copy link
Author

damionw commented Apr 28, 2016

Example

#! /usr/bin/env python
import pandas as pd

h5 = pd.HDFStore("/tmp/test.h5", mode="w")

try:
df = pd.DataFrame([[_x, _x * 2] for _x in range(12)], columns=['one', 'two'], index=None)

h5.put("full", df[:], format='table', data_columns=list(df.keys())) # Write full dataset
h5.put("empty", df[0:0], format='table', data_columns=list(df.keys())) # Write empty dataset

finally:
h5.close()

h5 = pd.HDFStore("/tmp/test.h5", mode="r")

for key in ["full", "empty"]:
print "*** Examining table [{}] ***".format(key)

try:
    print h5[key].head()
except KeyError as _exception:
    print "{} does not exist in hdf5 file".format(key)

print

h5.close()

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: '0.11.0'

In [4]: import numpy as np

In [5]: np.__version__
Out[5]: '1.7.1'

In [6]: import tables

In [7]: tables.__version__
Out[7]: '2.4.0'

In [8]: df = pd.DataFrame([[_x, _x * 2] for _x in range(12)], columns=['one', 'two'], index=None)

In [9]: 

In [9]: df
Out[9]: 
    one  two
0     0    0
1     1    2
2     2    4
3     3    6
4     4    8
5     5   10
6     6   12
7     7   14
8     8   16
9     9   18
10   10   20
11   11   22

In [10]: h5 = pd.HDFStore("test.h5", mode="w")
In [11]: h5.put("full", df[:], format='table', data_columns=list(df.keys())) # Write full dataset

In [12]: h5.put("empty", df[0:0], format='table', data_columns=list(df.keys())) # Write empty dataset

In [13]: h5.close()

In [14]: h5 = pd.HDFStore("test.h5", mode="r")

In [15]: h5
Out[15]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/empty            frame        (shape->[1,2]) 
/full             frame        (shape->[12,2])

In [16]: h5.root.empty
Out[16]: 
/empty (Group) ''
  children := ['block0_items' (Array), 'block0_values' (Array), 'axis0' (Array), 'axis1' (Array)]

In [20]: h5['empty']
ValueError: Shape of passed values is (2, 0), indices imply (2, 1)

In [22]: h5['full']
Out[22]: 
    one  two
0     0    0
1     1    2
2     2    4
3     3    6
4     4    8
5     5   10
6     6   12
7     7   14
8     8   16
9     9   18
10   10   20
11   11   22

So this looks broken in that version.

So you will have to be more specific on what you are doing.

@damionw
Copy link
Author

damionw commented Apr 28, 2016

The above example writes the underlying data structures in version 0.11, but doesn't understand the dimensions (y==0) during reading

In >0.16 the write is ignored

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

not sure I understand then. you said it worked in 0.11, what does that have to do with it then?

Hard to believe since we were using it :-)

@damionw
Copy link
Author

damionw commented Apr 28, 2016

We're writing a data structure that can be empty. Then we're reading the data structure in another program. The current method silently elides the existence of the table, so the reading program would have to catch an exception and fake the data structure.

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

and from the comments in the PR

fixed this, though required a bit of a hackjob (pytables doesn't like zero-length objects)

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

@damionw you are saying that it worked though. I want you to prove it. I don't recall this EVER working for table=True (different option back then), because of the PyTables limitation; it was a work-around for fixed.

@damionw
Copy link
Author

damionw commented Apr 28, 2016

It worked because we aren't reading the data structure using pandas (in this use case). However, that should work in pandas too, shouldn't it ? Simply ignoring the write request doesn't seem to me to be the way to handle it.

I'll happily fix it if there isn't a reason not to, which is the question I'm asking

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

I am still not clear on what you aim to fix or even what's broken. Pls show a complete example. Yes you can't read back the empty in that version, but that has been fixed since.

In [19]: pd.read_hdf('../test.h5','empty').dtypes
Out[19]: 
one    int64
two    int64
dtype: object

In [20]: pd.__version__
Out[20]: '0.18.0+176.gb13ddd5'

@damionw
Copy link
Author

damionw commented Apr 28, 2016

Pandas normally writes supporting information for tables into the HDF5, the table name is actually used for the group that contains the various indices and data tables. In version 0.11 this information is written to the HDF5 file whether the data is empty or not. Subsequently, it is not written at all.

The problem being addressed is that, now, the very existence of the desired table is prevented when it is empty.

Your above example will read the entry once its created, but won't allow it to be written (at least on 0.17). Please advise if 0.18 now allows writing empty tables (using table=True)

I'm unsure what more of an example you require for showing the behaviour. Could you help me understand what else I need to provide ?

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

@damionw you need to show a complete example of what you are expecting given a certain input set, on a certain version. I showed that above. That works. So show something that does what you think you want.

@damionw
Copy link
Author

damionw commented Apr 28, 2016

The same example I gave works as expected when pandas is patched by removing the described return statement.

@damionw
Copy link
Author

damionw commented Apr 28, 2016

I can provide the resulting hdf5 files for all cases offline if that would help

@damionw
Copy link
Author

damionw commented Apr 28, 2016

Correction: For 0.16 and beyond, the table=True format creates a group with the desired table name and a table named "table" underneath. All of that is omitted when the dataset is empty.

@jreback
Copy link
Contributor

jreback commented Apr 28, 2016

@damionw pls write a test, using the current version of pandas that fails.

@damionw
Copy link
Author

damionw commented May 19, 2016

Will do. Thanks

DKW

On Thu, Apr 28, 2016 at 12:45 PM, Jeff Reback notifications@github.com
wrote:

@damionw https://github.com/damionw pls write a test, using the current
version of pandas that fails.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#13016 (comment)

@lafrech
Copy link

lafrech commented Oct 30, 2017

I think I'm facing the same issue.

Here's how to reproduce:

import pandas as pd
from pandas import HDFStore


# Prints 0.20.3
print(pd.__version__)

emptydf = pd.DataFrame({'col_1': [], 'col_2': []}, index=[])

with HDFStore("test.h5", 'w') as store:

    assert not store.keys()

    # append -> no table created
    store.append('empty', emptydf)
    assert not store.keys()

    # put, 'table' format -> no table created
    store.put('empty', emptydf, format='table')
    # No table created
    assert not store.keys()

    # put, default format -> array created
    store.put('empty', emptydf)
    assert store.keys() == ['/empty']

store.close()

My use case

I'm writing an API to store timeseries and I would like to separate creation/deletion of timeseries ID and data write/delete in a timeseries.

In other words, I want to be able to do

# Returns empty list []
list_ids()

# Raises "ID does not exist" exception
save(new_id, new_data)

# Creates new timeseries ID
create(new_id)

# Returns [new_id, ]
list_ids()

# Writes data (this time, ID exists)
save(new_id, new_data)

but I don't know how to create an empty timeseries because it won't be written in the file. I could allow save to auto create timeseries, but this wouldn't solve the issue of the ID not being listed until there actually is data in it, therefore not being advertised in the list.

The only workaround I see is to maintain an ID list somewhere else, which I'd rather avoid.

@damionw
Copy link
Author

damionw commented Oct 30, 2017 via email

@vss888
Copy link

vss888 commented Mar 22, 2019

Could you please let me know, what the conclusion is: will it be fixed? I can see that it is scheduled for the "Next Major Release" milestone, that has been deleted by now, which (I guess) means the fix is not scheduled anymore, even though it is a "Effort Low" issue.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2019

this is a problem with PyTables and would have to be fixed there

we have 3000 issues so things get fixed when someone submits a patch

@vss888
Copy link

vss888 commented Mar 22, 2019

In this case, a patch has been proposed above:

pytables.py#L1365

explicitly skips writing an object if object.empty is True. All we need to do is to delete the condition (assuming that it does not break anything else).

@damionw
Copy link
Author

damionw commented Mar 22, 2019 via email

@arw2019
Copy link
Member

arw2019 commented Oct 11, 2020

xref PyTables/PyTables#592

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

No branches or pull requests

7 participants