-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: support for msgpack serialization/deserialization #3525
Conversation
@y-p can we add msgpack to the travis build (the full one?) |
wait a minute, why aren't you using |
hahah...you do keep pushing on tox....breaking down and installing tox..... I just did |
This would be so great. I keep getting unreproducible pickle failures (SystemError) telling me no exception is set. I would love to be able to serialize without using pytables/pickle/npy. |
I started on avro, but seemed harder to do...this is pretty straightforward |
avro looks kind of insane |
what about names space?
to keep consistent? |
next question, I 'borrowed' the numpy impl (though I modified it) from this: https://github.com/lebedov/msgpack_numpy, license looks ok.....do we include a reference (in our code)? should I send him an e-mail? |
That looks fine to me (the namespace). Looks like the impl calls |
Here's why its a prototype
|
Hm. I love coding in Cython and it looks like there's a C interface, so I could hack an interface for you if u want. Not sure if it's worth it though. Maybe for 1e6 ord of mag # of rows could dispatch to Cython... |
well, Also, need to tread very lightly here, data serialization bugs are a big deal, we can't just Or maybe add a "safe=True" option that reads back the data and performs a comparison, |
so out-of-thebox msgpack IS faster..its my datetimes that are screwing me, pickle is probably tuple of values, plus tz string? string index
datetimeindex
|
re licensing: IANL, but the license is 2 clause BSD so that's compatible with pandas, In any case, a copy of LICENSE goes in LICENSES/, and as a courtesy drop the author an |
Better....had to avoid all of the boxing/unboxing in the index......
|
hmmm....ok...I already integrated the code (I had to change it a bit)....so I'll put the license file and the banner at the top....thxs |
@y-p how do I get an e-mail address for someone on github? |
clone a repo they've committed to and use |
This is really awesome by the way. I'm +1000 on having a pickle-independent serialization format. Was waiting for the superhero (you) to do it. I'm gonna try to find some time in the next few days to play with this a bit to look at some of the low level details. It might be worth bringing in the python-msgpack cython code so we have full control over things |
@wesm definitley worth investigation. I pushed some docs updates; will finish up the rest of the missing types soon. |
strongly urge 0.12. big change + more time for tire kicking in master. |
@y-p changed to 0.12 @wesm now supports all pandas types (incl Sparse), docs included, added iterator support and added still open on whether there can implement some sort of random accessability...... go for perf improvements! |
DOC: install.rst mention DOC: added license from msgpack_numpy PERF: changed Timestamp and DatetimeIndex serialization for speedups add vb_suite benchmarks ENH: added to_msgpack method in generic.py, and default import into pandas TST: all packers to always be imported, fail on usage with no msgpack installed
ENH: provide automatic list if multiple args passed to to_msgpack DOC: changed docs to 0.12 ENH: iterator support for stream unpacking
…,IntIndex,BlockIndex
This is so cool. Can you save arbitrary objects, e.g., a dict of dicts of DataFrames? |
With 1e7 rows there is sadness. Also, not sure if this is expected but df = DataFrame(rand(1e7, 16))
pd.to_msgpack('huge.msg', df) # takes a long time :(
!du huge.msg # 1795796000
df.values.nbytes == 1280000000
# difference of ~500 MB? |
yes |
I think we need to use the msgpack array type to store the ndarrays |
Are the extra bytes from converting to a |
not sure (ndarrays r stored as 1-dim lists), so could be |
I'm not sure that's strictly true. Conceptually yes, but lists are bulkier because they can be heterogeneously typed, whereas ndarrays are almost always homogeneously typed. Anyway, I'm not trying to be pedantic here, just trying to get to the bottom of efficiently storing arrays. |
going to try with an array type, its clear lists are done element by element are prob inefficient in terms of storage size (as they need markers bettween elements), and speed, e.g. 1-by one.... |
Where do you see arrays in the Python impl? I only see it in the C impl. |
pack_array_header is a def, so I should be able to call it from python..... |
Ah, yes. silly me. |
I think definitly need to drop into the cython code to handle ndarrays. THe issue is that they are converted to lists (which are then packed via arrays It hink). However, can skip this by just definiing a numpy type in the packer |
Yep I was actually thinking the same thing. Probably need to iterate
|
might need to be careful here. the c code uses the somewhat classic misinterpretation of unions, namely that union X {
double x;
int64_t y;
};
union X u;
u.x = 1.0;
printf("%d", u.y); is technically not correct. union members can only be read from in order. see here for the occurrence. Though, I'm not sure if something like |
so I put up a commit, 5a02cdf, which implements on a testing basis, compression using zlib (included in python dist), and block (pip installable), specified with the I am compress just the non-object numpy arrays at the very lowest level (via This is writing a byte string, which I think msgpack handles in cython code directly..... This is the 10m rows dataset.
|
profile pickls vs using blosc for compression
|
for comparison to writing in HDF5 (no compression) This is pretty comparable (format wise)
This is slow (but appendable and in a queryable format)
Sizes are comparable
|
The HDF5 reads are blazing fast......
|
I think an additional feateure can add here is to support pickle objects as a savable type (right now this will that way if something doesn't exist natively you could then save it (e.g. say you want to save an XLS file or something)....shouldn't be too hard to do |
closing in favor of #3831 |
warning: prototype!
msgpack serialization/deseriiization
Here are 2 features which I think msgpack supports, but have to look further
change in the schema (which is not directly stored), and this is MUCH more
transparent than pickle
usage is exactly like pickle (aside from that its in a different namespace), allowing
arbitrary combinations of storage, e.g. this supports the added storage of pandas objects,
but obviously can store
{ 'frame1' : df1, 'frame2' : df2 }
etcstorage example
DataFrame(np.random.rand(1000,10))
on my machine stores in 128k file size.and scales pretty well, e.g. 10k rows is 1.26mb
Not completely happy with the
packers
name, any suggestions?closes #686