-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas pytables interface doesn't create empty table datasets #13016
Comments
pls show a copy pastable example as well as show versions for old/new (esp pytables versions) |
I don't believe this was ever supported in PyTables for table format. |
you are refering to fixed format. Pls show an example and version, including a clean file with generated meta data. |
Example#! /usr/bin/env python h5 = pd.HDFStore("/tmp/test.h5", mode="w") try:
finally: h5 = pd.HDFStore("/tmp/test.h5", mode="r") for key in ["full", "empty"]:
h5.close() |
So this looks broken in that version. So you will have to be more specific on what you are doing. |
The above example writes the underlying data structures in version 0.11, but doesn't understand the dimensions (y==0) during reading In >0.16 the write is ignored |
not sure I understand then. you said it worked in 0.11, what does that have to do with it then?
|
We're writing a data structure that can be empty. Then we're reading the data structure in another program. The current method silently elides the existence of the table, so the reading program would have to catch an exception and fake the data structure. |
and from the comments in the PR
|
@damionw you are saying that it worked though. I want you to prove it. I don't recall this EVER working for |
It worked because we aren't reading the data structure using pandas (in this use case). However, that should work in pandas too, shouldn't it ? Simply ignoring the write request doesn't seem to me to be the way to handle it. I'll happily fix it if there isn't a reason not to, which is the question I'm asking |
I am still not clear on what you aim to fix or even what's broken. Pls show a complete example. Yes you can't read back the empty in that version, but that has been fixed since.
|
Pandas normally writes supporting information for tables into the HDF5, the table name is actually used for the group that contains the various indices and data tables. In version 0.11 this information is written to the HDF5 file whether the data is empty or not. Subsequently, it is not written at all. The problem being addressed is that, now, the very existence of the desired table is prevented when it is empty. Your above example will read the entry once its created, but won't allow it to be written (at least on 0.17). Please advise if 0.18 now allows writing empty tables (using table=True) I'm unsure what more of an example you require for showing the behaviour. Could you help me understand what else I need to provide ? |
@damionw you need to show a complete example of what you are expecting given a certain input set, on a certain version. I showed that above. That works. So show something that does what you think you want. |
The same example I gave works as expected when pandas is patched by removing the described return statement. |
I can provide the resulting hdf5 files for all cases offline if that would help |
Correction: For 0.16 and beyond, the table=True format creates a group with the desired table name and a table named "table" underneath. All of that is omitted when the dataset is empty. |
@damionw pls write a test, using the current version of pandas that fails. |
Will do. Thanks DKW On Thu, Apr 28, 2016 at 12:45 PM, Jeff Reback notifications@github.com
|
I think I'm facing the same issue. Here's how to reproduce: import pandas as pd
from pandas import HDFStore
# Prints 0.20.3
print(pd.__version__)
emptydf = pd.DataFrame({'col_1': [], 'col_2': []}, index=[])
with HDFStore("test.h5", 'w') as store:
assert not store.keys()
# append -> no table created
store.append('empty', emptydf)
assert not store.keys()
# put, 'table' format -> no table created
store.put('empty', emptydf, format='table')
# No table created
assert not store.keys()
# put, default format -> array created
store.put('empty', emptydf)
assert store.keys() == ['/empty']
store.close() My use case I'm writing an API to store timeseries and I would like to separate creation/deletion of timeseries ID and data write/delete in a timeseries. In other words, I want to be able to do # Returns empty list []
list_ids()
# Raises "ID does not exist" exception
save(new_id, new_data)
# Creates new timeseries ID
create(new_id)
# Returns [new_id, ]
list_ids()
# Writes data (this time, ID exists)
save(new_id, new_data) but I don't know how to create an empty timeseries because it won't be written in the file. I could allow save to auto create timeseries, but this wouldn't solve the issue of the ID not being listed until there actually is data in it, therefore not being advertised in the list. The only workaround I see is to maintain an ID list somewhere else, which I'd rather avoid. |
I have an old changeset that allows this, which we're using in production.
Unfortunately, I haven't had the cycles to construct the tests, etc to make
it palatable enough to make an acceptable pull request.
Also, since I did it in April 2016, it's fallen behind quite a few pandas
updates.
https://github.com/damionw/pandas on branch allow-empty-hdf5-datasets
Damion K. Wilson
…On Mon, Oct 30, 2017 at 12:56 PM, Jérôme Lafréchoux < ***@***.***> wrote:
I think I'm facing the same issue.
Here's how to reproduce:
import pandas as pdfrom pandas import HDFStore
# Prints 0.20.3print(pd.__version__)
emptydf = pd.DataFrame({'col_1': [], 'col_2': []}, index=[])
with HDFStore("test.h5", 'w') as store:
assert not store.keys()
# append -> no table created
store.append('empty', emptydf)
assert not store.keys()
# put, 'table' format -> no table created
store.put('empty', emptydf, format='table')
# No table created
assert not store.keys()
# put, default format -> array created
store.put('empty', emptydf)
assert store.keys() == ['/empty']
store.close()
------------------------------
*My use case*
I'm writing an API to store timeseries and I would like to separate
creation/deletion of timeseries ID and data write/delete in a timeseries.
In other words, I want to be able to do
# Returns empty list []
list_ids()
# Raises "ID does not exist" exception
save(new_id, new_data)
# Creates new timeseries ID
create(new_id)
# Returns [new_id, ]
list_ids()
# Writes data (this time, ID exists)
save(new_id, new_data)
but I don't know how to create an empty timeseries because it won't be
written in the file. I could allow save to auto create timeseries, but this
wouldn't solve the issue of the ID not being listed until there actually is
data in it, therefore not being advertised in the list.
The only workaround I see is to maintain an ID list somewhere else, which
I'd rather avoid.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13016 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACEblNw27iqhP0SdVnQVep4sH3RwMAI2ks5sxfG4gaJpZM4IR73g>
.
|
Could you please let me know, what the conclusion is: will it be fixed? I can see that it is scheduled for the "Next Major Release" milestone, that has been deleted by now, which (I guess) means the fix is not scheduled anymore, even though it is a "Effort Low" issue. |
this is a problem with PyTables and would have to be fixed there we have 3000 issues so things get fixed when someone submits a patch |
In this case, a patch has been proposed above: explicitly skips writing an object |
Understood, thanks.
I started preparing a patch and never got around to finalising and
submitting it. We've been successfully using the patched version ever
since, though
Damion K. Wilson
…On Fri, Mar 22, 2019 at 2:07 PM Jeff Reback ***@***.***> wrote:
this is a problem with PyTables and would have to be fixed there
we have 3000 issues so things get fixed when someone submits a patch
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13016 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACEblMf4V9quS8jBUQ0rQeQ1un0OoGrBks5vZQ3KgaJpZM4IR73g>
.
|
Pandas used to allow the writing of empty HDF5 datasets through its pytables interface code. However, after upgrading to 0.17 (from 0.11), we've discovered that this behaviour is intentionally
short circuited. The library behaves as though the dataset is being written, but simply ignores the request and the resulting HDF5 file doesn't contain the requested table.
The offending code is in pandas/io/pytables.py:_write_to_group()
We've worked around it by patching our installed copy of pandas, but we'd like to know the provocation behind this code before submitting a pull request. The comment implies that the lack of dtypes in the dataset is the cause, however each pandas column has type information even if empty.
Any clarification would be appreciated
The text was updated successfully, but these errors were encountered: