-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: write Table meta-data (non_index_axes) as a CArray (rather than as meta-data) #6245
Comments
There is a limit on the amount of meta-data that can be stored within a node, and for example a string of the columns is stored (along with lots of other things). So its not a column limit per-se, but an internal limit. As we have discussed before, this is in general not a good idea to store very wide tables. You might want to explore making these 3-d actually (a Panel); Then these become long tables (which is quite efficient). maybe something like: http://pandas.pydata.org/pandas-docs/dev/reshaping.html#reshaping-by-melt might help. e.g. 1500 * 4 characters * 8 bytes character = 48kb (plus other meta data that I don't know exactly how big) |
But this does not happen when you pre-pend 'A' to each integer column in the DF above and write the same data to a HDF5 table. Which is strange.. Maybe the string representation of the column name is smaller than the int64s... So we are limited to use wide HDF5 tables because the column names might be too long? I'll look into the performance of storing the data as a panel. I guess I can also break up the data frame into multiple nodes (not so wide chunks), but then have to pay the penalty of concatenating those columns on read... |
Why is the write HDF (non-table just serialized) not limited by this? |
the IIRC. I did this so that when you |
That would be fantastic Jeff! I know very wide tables are not as efficient. Sometimes for commonalty analysis and optimizations I have no choice but to use very wide data. Thanks! |
don't think its very difficult to do; will just have to have to deal with some new version stuff is the only issue (e.g. the version of the table that is written will need to be updated from the current 0.10.1) |
I was actually gonna ask about the version thing a week ago but forgot . |
yep...hasn't been changed since then. Its exactly for a change like this. will be easily backward compatible, but an older version of pandas would not be able to read (but at least can 'figure' it out from the version string that it won't work). e.g. new version would be 0.14, so 0.14 would be able to read it (and prior versions); writing to an existing store would have to respect that store's version (e.g. don't auto upgrade). maybe have to add a 'compat' flag to write in the original format. but that's why I did it in the first place, to enable changes. |
Good deal thanks for the explanation. 👍 |
Where in the code should I look to write/read Table column meta data as an CArray itself? I'm finally in the position where I to need to create long and wide HDF5 append-able tables. Would love to see this limit disappear. I know for backward compatibility to older HDF5 table formats we have to be careful. Any advice is appreciated. |
https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L2895 then in the next function you can read it in (and determine if its a CArray, convert to a list). Shouldn't be difficult to change (even with backward compat), lmk. |
@jreback I'm interested in picking this up - to summarize, the approach is
|
@hhuuggoo ohhh gr8!.
I think you need to worry about the ones that are lists ATM, e.g. I wouldn't change the names that would make backward compat really difficult. So:
For testing, I would make a couple of tests tables in 0.16.1., that comprise different dtypes/indexes, maybe with compression, etc. all can be small. To ensure that things can be read from prior versions. This will be a forward-compat break, IOW, an older version of pandas cannot read a 0.17.0 created HDFStore, but I don't think that's a big deal. |
@jreback
I get this very strange error when writing very wide HDF5 tables. In this case a random float32 array with 1500 columns cant be written as an HDF5 table. However if rename the columns it seems to write fine...
Not sure whats going on.
-Gagi
The text was updated successfully, but these errors were encountered: