Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: issue in HDFStore with too many selectors in a where #2755

Merged
merged 4 commits into from
Feb 10, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jan 25, 2013

  • this is very hard to reproduce, and can give a ValueError or just crash (as somewhere in numexpr or hdf5 it has trouble parsing the expression, and can run out of memory), easiest solution is just to limit the number of selectors

    just to be clear, you really have to do something like this:

    select.('df',[ Term('index',['A','B','C'......]) ])
    where the term has more than 31 selectors (and less than 61) to produce this error and only sometimes
    it triggers.

    This doesn't change functionaility at all, as if there are more than the specified number of selectors, it just
    does a filter (eg. bring in the whole table and then just reindex), and this is really only an issue
    if you actually try to specifiy many vaues that the index can be (which usually isn't a string anyhow)

    but I did hit it! so its a 'buglet'

    heres the issue in PyTables (but it is actually somewhere in numexpr)
    http://sourceforge.net/mailarchive/message.php?msg_id=30390757

  • added cleanup code to force removal of output files if the testing is interrupted by ctrl-c

  • added dotted (attribute) acces to stores, e.g. store.df == store['df']

@alvorithm
Copy link

@jreback I read about the facility to map object->(e.g.)int16 in io.rst.

Will this happen through a dtype header argument to HDFStore.put/append?

This is crucial for my project where I have incoming data rates of ~12Gb/hr at 2 bytes per data point. Storing these integers using 8 bytes (int64) is the greatest barrier for me using HDFStore instead of a custom-made interface to PyTables.

EDIT: I see now that this note is in reference to separate PR #2708. Thanks

@jreback
Copy link
Contributor Author

jreback commented Feb 7, 2013

if you install PR #2708, then PyTables WILL support int16 (it will store whatever dtypes are passed in). Keep in mind that certain operations on ints may upcast (see Issue #2794). I did a pseudo-example in #2759. Can you give a mini-example of your workflow?

@alvorithm
Copy link

Thanks @jreback
I am currently checking PR #2708. Where would I have to look to help improve support of int16 re. upcasting?

My workflow:

the heavy (raw or preprocessed) data

  • I am ingesting (hopefully via HDFStore) up to daily ~70Gb of binary interleaved 20 or 30kHz 64-channel electrophysiology data in a HDF5 file (flat group structure). These voltages are int16.
  • the "primary key" to these time series is a time column spanning up to a day with 20 or 30kHz resolution.
  • some of the channels can be meaningfully grouped in fours, but only rarely will operations occur on rows across four columns (median/avg); most of the processing is on a single time series, the whole column.
  • the most frequent operation is to obtain ranges of channel data as defined by the time column (which is for now just int64, but should eventually become integer-based datetime64's with microsecond resolution). Better yet if this time index, which is regularly spaced, could be stored generatively (and serialized as a leaf attribute instead of a whole column; this is something the indirection through HDFStore could support).
  • I got the impression that append_to|select_as_multiple could help in this scenario: all channels referenced to a single index.
  • A fairly frequent operation is to run a multithreaded (via FFTW) FFTs or FFT-based convolution, and cache it in another HDF5 file for future use
    • at the time each channel is a carray and operations are on a single chunk, which taxes memory too much (also casts occur in FFTW or when calling numexpr).
    • blosc compression attains factor 1.8, does not improve going from complevel 5 to complevel 9. Blosc is way faster than anything else and speed is more important here than compression.
  • values are all sacrosanct, i.e. because experimental, once written they should not be modified. In a sense what I need is a column database.
  • another operation is downsampling by factor 4.
  • most useful operations on this raw-data: thresholding, chunk-wise averages/SDs.

the metadata and computed tables

  • need to store far more (relationally, dtype-wise) complex tables that are either metadata or computation results, but these can go elsewhere, even to SQL stores if needed. The 'typical worst' of these tables may have about 1E7 rows and 2-5 columns.
  • but these tables are constantly supplemented with synchronized computed columns, so again I think the right approach is to add these computed columns referenced to the original table as selector.

For computed tables, I have a derived class of HDFStore that catches a failed getitem and looks up the appropriate compute function. This way you can empty your 'cache' and have the results reconstructed. Ideally these functions should report the git SHA1 of the repo they live in along with parameters used somewhere (not implemented).

the resources

Single machine with 12 cores, 32GB of ram, data on SATA3-connected spinning disks.

As to how I can help: solo developer for this application (scientific, not compsci background :/), grok python, functional bias, willing to contribute to pandas on HDFStore, relational extensions, interval algebra and spectral tools and currently working out how the git workflow.

what next?

Sorry for the long post. I contemplated enhancing HDFStore in Q3 2012 and am extremely happy to see your progress (and grateful for this work!). Please tell me if you'd like to move some of these points for private / pydata list discussion.

@jreback
Copy link
Contributor Author

jreback commented Feb 7, 2013

sounds interesting....prob should move this discussion off-list....contact me jeff@reback.net

here's some thoughts about data organization which is of course the key thing in PyTables:

your 'channels' are an index with a small number of columns. this makes sense to group each
in a single node. that way its easy to select by time range or read in all.

I have found that have very 'wide' tables, meaning number of columns reduces performance;
that is the reason for append_to_multiple/select_from_multiple - to essentially have a 'sharded'
setup

try out with the dtypes PR; I think you should be able to read in and store your tables in
int16.

What I mean about manipulations is essentially this: if you do an operation which (possibily) introduces
a nan, then there will be upcasting (because pandas doesn't currently have interger NaN, and must
cast to float for various things). I have tried to mitigate this, but depending on exactly what you are
doing you still may get upcasting. That said, there are various ways to deal with this. Best to
take a smallish set of data and try out your workflow, checking at each step.

I believe blosc only actually has either an on/off mode. It doesn't actually do variable compression (I read
this somewhere on pytables website - don't really remember); but it is a great compressor (of
course the underlying data really determines how it much it helps).

Always glad to have help on expanding HDFStore; I use it on a daily basis, but I don't really
do out-of-core type computations (nor do I use integer types), so all feedback/changes are welcomed.

Jeff

@scottkidder
Copy link

Please do not move this conversation to a private location. I would like to be involved and have a somewhat similar workflow to meteore and am continuing to hammer on pandas pytables interface.

@jreback
Copy link
Contributor Author

jreback commented Feb 7, 2013

you are right!

@jreback
Copy link
Contributor Author

jreback commented Feb 10, 2013

@wesm I think you can merge this PR then cherry pick meteore changes from #2824 after

wesm added a commit that referenced this pull request Feb 10, 2013
@wesm wesm merged commit 7065ff0 into pandas-dev:master Feb 10, 2013
@wesm
Copy link
Member

wesm commented Feb 10, 2013

merged, thank you sirs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants