ENH: support for msgpack serialization/deserialization #3525

jreback · 2013-05-03T19:43:54Z

warning: prototype!

msgpack serialization/deseriiization

support all pandas objects: Timestamp,Period,all index types,Series,DataFrame,Panel,Sparse suite
docs included (in io.rst)
iterator support
top-level api support

Here are 2 features which I think msgpack supports, but have to look further

no support for compression directly, but can compress the file (e.g. gzip)
access is sequential
versioning, is not that hard because its pretty easy to deal with a
change in the schema (which is not directly stored), and this is MUCH more
transparent than pickle

usage is exactly like pickle (aside from that its in a different namespace), allowing
arbitrary combinations of storage, e.g. this supports the added storage of pandas objects,
but obviously can store { 'frame1' : df1, 'frame2' : df2 } etc

storage example DataFrame(np.random.rand(1000,10)) on my machine stores in 128k file size.
and scales pretty well, e.g. 10k rows is 1.26mb

Not completely happy with the packers name, any suggestions?

closes #686

In [1]: df = DataFrame(randn(10,2),
   ...:                      columns=list('AB'),
   ...:                      index=date_range('20130101',periods=10))

In [2]: pd.to_msgpack('foo.msg',df)

In [3]: pd.read_msgpack('foo.msg')
Out[3]: 
                   A         B
2013-01-01  0.676700 -1.702599
2013-01-02 -0.070164 -1.368716
2013-01-03 -0.877145 -1.427964
2013-01-04 -0.295715 -0.176954
2013-01-05  0.566986  0.588918
2013-01-06 -0.307070  1.541773
2013-01-07  1.302388  0.689701
2013-01-08  0.165292  0.273496
2013-01-09 -3.492113 -1.178075
2013-01-10 -1.069521  0.848614

jreback · 2013-05-03T19:44:13Z

@y-p can we add msgpack to the travis build (the full one?)

ghost · 2013-05-03T19:50:59Z

wait a minute, why aren't you using scripts/use_build_cache.py + tox/detox?
It'll make you so much more productive. You may in fact prove that there's a
schwarzschild radius for coding and turn into a black hole.

jreback · 2013-05-03T20:02:31Z

hahah...you do keep pushing on tox....breaking down and installing tox.....

I just did tox in my main pandas repo and it looks likes its intalling stuff.....

cpcloud · 2013-05-03T20:03:04Z

This would be so great. I keep getting unreproducible pickle failures (SystemError) telling me no exception is set. I would love to be able to serialize without using pytables/pickle/npy.

jreback · 2013-05-03T20:05:46Z

I started on avro, but seemed harder to do...this is pretty straightforward

cpcloud · 2013-05-03T20:07:18Z

avro looks kind of insane

jreback · 2013-05-03T20:07:52Z

what about names space?

df.to_msgpack(path)
com.to_msgpack(path,dict/list/obj of stuff)
read_msgpack(path)

to keep consistent?

jreback · 2013-05-03T20:10:32Z

next question, I 'borrowed' the numpy impl (though I modified it) from this: https://github.com/lebedov/msgpack_numpy, license looks ok.....do we include a reference (in our code)? should I send him an e-mail?

cpcloud · 2013-05-03T20:12:46Z

That looks fine to me (the namespace). Looks like the impl calls ndarray.tolist which will be horribly slow for large arrays.

jreback · 2013-05-03T20:23:56Z

Here's why its a prototype
slow and bigger than pickle.....hmm...

In [11]: df
Out[11]: 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10000 entries, 2000-01-01 00:00:00 to 2027-05-18 00:00:00
Freq: D
Data columns (total 10 columns):
0    10000  non-null values
1    10000  non-null values
2    10000  non-null values
3    10000  non-null values
4    10000  non-null values
5    10000  non-null values
6    10000  non-null values
7    10000  non-null values
8    10000  non-null values
9    10000  non-null values
dtypes: float64(10)

In [12]: %timeit packers.save('foo.msg',df)
10 loops, best of 3: 59.1 ms per loop

In [13]: %timeit df.save('foo.pickle')
100 loops, best of 3: 11 ms per loop

-rw-rw-r--  1 jreback users  880622 May  3 16:21 foo.pickle
-rw-rw-r--  1 jreback users 1300971 May  3 16:21 foo.msg

cpcloud · 2013-05-03T20:29:36Z

Hm. I love coding in Cython and it looks like there's a C interface, so I could hack an interface for you if u want. Not sure if it's worth it though. Maybe for 1e6 ord of mag # of rows could dispatch to Cython...

ghost · 2013-05-03T20:39:05Z

well, save/load suggest a privileged status, but we'd like to nudge users
towards the new format when this becomes fully operational, so not sure about the naming.

Also, need to tread very lightly here, data serialization bugs are a big deal, we can't just
play cowboy with people's data. Maybe add an option to embed a version of the pickle data
in the msgpack file for a major version or two? that way people can recover from errors
at the cost of increased storage space.

Or maybe add a "safe=True" option that reads back the data and performs a comparison,
to warn the user if something goes wrong?

jreback · 2013-05-03T20:49:14Z

so out-of-thebox msgpack IS faster..its my datetimes that are screwing me, pickle is probably
a lot smarter...I was just doing isoformat().... what's fastes way of serializing datestime?

tuple of values, plus tz string?
some ordinal trick?

string index

In [1]: index = [rands(10) for _ in xrange(25000)]

In [2]: df = DataFrame({'float1' : randn(25000),
   ...:                 'float2' : randn(25000)},
   ...:                index=index)

In [3]: %timeit df.to_msgpack('foo.msg')
100 loops, best of 3: 14.2 ms per loop

In [4]: %timeit df.save('foo.pickle')
10 loops, best of 3: 20.1 ms per loop

datetimeindex

In [9]: %timeit df.to_msgpack('foo.msg')
10 loops, best of 3: 120 ms per loop

In [10]: %timeit df.save('foo.pickle')
100 loops, best of 3: 8.2 ms per loop

ghost · 2013-05-03T21:01:49Z

re licensing: IANL, but the license is 2 clause BSD so that's compatible with pandas,
if you're using practically the entire thing, perhaps just make it a dependency, if only parts,
try to put them in a separate file, with the licensing banner at the top.

In any case, a copy of LICENSE goes in LICENSES/, and as a courtesy drop the author an
email to let him know his code is useful and that it's being integrated into a fairly
high-profile library, as a form of thanks.

jreback · 2013-05-03T21:09:55Z

Better....had to avoid all of the boxing/unboxing in the index......

In [13]: index=date_range('20000101',periods=25000,freq='H')

In [14]: df = DataFrame({'float1' : randn(25000),
                'float2' : randn(25000)},
               index=index)

In [15]: %timeit df.save('foo.pickle')
100 loops, best of 3: 9.38 ms per loop

In [16]: %timeit pd.to_msgpack('foo.msg',df)
100 loops, best of 3: 13 ms per loop

jreback · 2013-05-03T21:11:03Z

hmmm....ok...I already integrated the code (I had to change it a bit)....so I'll put the license file and the banner at the top....thxs

jreback · 2013-05-03T22:03:32Z

@y-p how do I get an e-mail address for someone on github?

wesm · 2013-05-03T22:06:01Z

clone a repo they've committed to and use git log

wesm · 2013-05-03T22:15:25Z

This is really awesome by the way. I'm +1000 on having a pickle-independent serialization format. Was waiting for the superhero (you) to do it. I'm gonna try to find some time in the next few days to play with this a bit to look at some of the low level details. It might be worth bringing in the python-msgpack cython code so we have full control over things

jreback · 2013-05-03T23:44:38Z

@wesm definitley worth investigation. I pushed some docs updates; will finish up the rest of the missing types soon.

ghost · 2013-05-04T00:24:26Z

strongly urge 0.12. big change + more time for tire kicking in master.

jreback · 2013-05-04T15:23:47Z

@y-p changed to 0.12

@wesm now supports all pandas types (incl Sparse), docs included, added iterator support and added
ability to append to msgpack (you can actually do this with pickle to FYI)...you need to open the file with a+b

still open on whether there can implement some sort of random accessability......

go for perf improvements!

DOC: install.rst mention DOC: added license from msgpack_numpy PERF: changed Timestamp and DatetimeIndex serialization for speedups add vb_suite benchmarks ENH: added to_msgpack method in generic.py, and default import into pandas TST: all packers to always be imported, fail on usage with no msgpack installed

ENH: provide automatic list if multiple args passed to to_msgpack DOC: changed docs to 0.12 ENH: iterator support for stream unpacking

…,IntIndex,BlockIndex

cpcloud · 2013-05-04T20:26:11Z

This is so cool. Can you save arbitrary objects, e.g., a dict of dicts of DataFrames?

cpcloud · 2013-05-04T20:39:12Z

With 1e7 rows there is sadness. Also, not sure if this is expected but

df = DataFrame(rand(1e7, 16))
pd.to_msgpack('huge.msg', df) # takes a long time :(
!du huge.msg # 1795796000
df.values.nbytes == 1280000000
# difference of ~500 MB?

jreback · 2013-05-04T21:03:02Z

yes

jreback · 2013-05-04T21:03:55Z

I think we need to use the msgpack array type to store the ndarrays
but I have found an example of its use

cpcloud · 2013-05-04T21:05:10Z

Are the extra bytes from converting to a list?

jreback · 2013-05-04T21:07:06Z

not sure (ndarrays r stored as 1-dim lists), so could be

cpcloud · 2013-05-04T21:10:49Z

I'm not sure that's strictly true. Conceptually yes, but lists are bulkier because they can be heterogeneously typed, whereas ndarrays are almost always homogeneously typed. Anyway, I'm not trying to be pedantic here, just trying to get to the bottom of efficiently storing arrays.

jreback · 2013-05-04T21:12:28Z

going to try with an array type, its clear lists are done element by element are prob inefficient in terms of storage size (as they need markers bettween elements), and speed, e.g. 1-by one....

cpcloud · 2013-05-04T21:14:06Z

Where do you see arrays in the Python impl? I only see it in the C impl.

jreback · 2013-05-04T21:15:41Z

pack_array_header is a def, so I should be able to call it from python.....

cpcloud · 2013-05-04T21:16:40Z

Ah, yes. silly me.

jreback · 2013-05-04T22:44:48Z

I think definitly need to drop into the cython code to handle ndarrays. THe issue is that they are converted to lists (which are then packed via arrays It hink). However, can skip this by just definiing a numpy type in the packer
and then write it directly.

cpcloud · 2013-05-04T22:49:14Z

Yep I was actually thinking the same thing. Probably need to iterate
explicitly with integers as well
On May 4, 2013 6:44 PM, "jreback" notifications@github.com wrote:

I think definitly need to drop into the cython code to handle ndarrays.
THe issue is that they are converted to lists (which are then packed via
arrays It hink). However, can skip this by just definiing a numpy type in
the packer
and then write it directly.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3525#issuecomment-17442698
.

cpcloud · 2013-05-05T00:39:45Z

might need to be careful here. the c code uses the somewhat classic misinterpretation of unions, namely that

union X {
    double x;
    int64_t y;
};

union X u;
u.x = 1.0;
printf("%d", u.y);

is technically not correct. union members can only be read from in order. see here for the occurrence. Though, I'm not sure if something like uint64_t v = *(int64_t*)&my_double; is safer. I'm pretty sure that numpy does it in the same way as the latter, since it calls PyArray_NewFromDescr which takes a raw char* to the data and is then interpreted as the new type.

jreback · 2013-05-06T00:32:54Z

so I put up a commit, 5a02cdf, which implements on a testing basis, compression using zlib (included in python dist), and block (pip installable), specified with the compress keyword to to_msgpack. unfortunately as I cannot pass the compression variable very easily it is a giant hack to see if this makes any sense.

I am compress just the non-object numpy arrays at the very lowest level (via .tostring, the compression)
see this: https://github.com/FrancescAlted/python-blosc/wiki/Quick-User's-Guide

This is writing a byte string, which I think msgpack handles in cython code directly.....

This is the 10m rows dataset.

In [1]: df = DataFrame({'float1' : randn(10000000),'float2' : randn(10000000)},index=date_range('20000101',periods=10000000,freq='s'))

In [2]: %timeit df.save('foo.pickle')
1 loops, best of 3: 2.45 s per loop

In [3]: %timeit df.to_msgpack('foo.msg.no_compress')
1 loops, best of 3: 4.24 s per loop

In [4]: %timeit df.to_msgpack('foo.msg.zlib',compress='zlib')
1 loops, best of 3: 9.8 s per loop

In [6]: %timeit df.to_msgpack('foo.msg.zlib',compress='zlib')
1 loops, best of 3: 9.68 s per loop

In [7]: %timeit df.to_msgpack('foo.msg.blosc',compress='blosc')
1 loops, best of 3: 3.21 s per loop

In [8]: %timeit df.load('foo.pickle')
1 loops, best of 3: 281 ms per loop

In [9]: %timeit pd.read_msgpack('foo.msg.zlib')
1 loops, best of 3: 2.31 s per loop

In [10]: %timeit pd.read_msgpack('foo.msg.blosc')
1 loops, best of 3: 1.61 s per loop

In [11]: %timeit pd.read_msgpack('foo.msg.no_compress')
1 loops, best of 3: 3.36 s per loop

In [12]: quit()

[goat-jreback-~/pandas] ls -ltr foo.*
-rw-rw-r-- 1 jreback users 240000602 May  5 20:20 foo.pickle
-rw-rw-r-- 1 jreback users 270000301 May  5 20:21 foo.msg.no_compress
-rw-rw-r-- 1 jreback users 243708088 May  5 20:23 foo.msg.zlib
-rw-rw-r-- 1 jreback users 244501486 May  5 20:23 foo.msg.blosc
``
So pickle is still the king, though blosc is pretty close.

I had to turn off the encoding (not even sure it should be set anyhow.....), and the unicode tests fail now.

jreback · 2013-05-06T00:36:48Z

profile pickls vs using blosc for compression

In [3]: %prun df.save('foo.pickle')
         18 function calls in 2.382 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.171    2.171    2.171    2.171 {method 'close' of 'file' objects}
        1    0.159    0.159    0.191    0.191 {cPickle.dump}
        1    0.032    0.032    0.032    0.032 index.py:519(__reduce__)
        1    0.020    0.020    0.020    0.020 {open}
        2    0.000    0.000    0.000    0.000 index.py:330(__reduce__)

In [2]: %prun df.to_msgpack('foo.msg.blosc',compress='blosc')
         76 function calls in 3.274 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.230    2.230    2.230    2.230 {method 'close' of 'file' objects}
        1    0.509    0.509    0.871    0.871 {method 'pack' of 'msgpack._packer.Packer' objects}
        1    0.156    0.156    0.156    0.156 {method 'tolist' of 'numpy.ndarray' objects}
        1    0.135    0.135    0.135    0.135 {blosc.blosc_extension.compress}
        1    0.090    0.090    0.090    0.090 {method 'write' of 'file' objects}
        1    0.065    0.065    0.065    0.065 {method 'tostring' of 'numpy.ndarray' objects}
        1    0.062    0.062    0.062    0.062 {open}
        1    0.011    0.011    3.274    3.274 packers.py:85(to_msgpack)

jreback · 2013-05-06T00:49:32Z

for comparison to writing in HDF5 (no compression)

This is pretty comparable (format wise)

In [11]: %timeit df.to_hdf('foo.h5.no_table','df')
1 loops, best of 3: 2.31 s per loop

This is slow (but appendable and in a queryable format)

In [12]: %timeit df.to_hdf('foo.h5','df',table=True)
1 loops, best of 3: 25.2 s per loop

Sizes are comparable

goat-jreback-~/pandas] ls -ltr foo.h*
-rw-rw-r-- 1 jreback users 249344000 May  5 20:43 foo.h5
-rw-rw-r-- 1 jreback users 240007312 May  5 20:44 foo.h5.no_table

jreback · 2013-05-06T00:52:07Z

The HDF5 reads are blazing fast......

In [7]: %timeit pd.read_hdf('foo.h5','df')
1 loops, best of 3: 302 ms per loop

In [8]: %timeit pd.read_hdf('foo.h5.no_table','df')
10 loops, best of 3: 90 ms per loop

jreback · 2013-05-09T19:07:04Z

I think an additional feateure can add here is to support pickle objects as a savable type (right now this will
raise that it doesn't know how to pack it)

that way if something doesn't exist natively you could then save it (e.g. say you want to save an XLS file or something)....shouldn't be too hard to do

jreback · 2013-06-10T14:11:27Z

closing in favor of #3831

jreback added 3 commits May 4, 2013 11:24

DOC: added mentions in release notes, v0.11.1, basics

4870ad9

ENH: provide automatic list if multiple args passed to to_msgpack DOC: changed docs to 0.12 ENH: iterator support for stream unpacking

ENH: added support for Panel,SparseSeries,SparseDataFrame,SparsePanel…

c9a9e3e

…,IntIndex,BlockIndex

ENH: handle np.datetime64,np.timedelta64,date,timedelta types

a55e7e4

TST: added compression (zlib/blosc) via big hack

5a02cdf

jreback closed this Jun 10, 2013

jreback mentioned this pull request Oct 1, 2013

Create efficient binary storage format alternative to pickle #686

Closed

lJoublanc mentioned this pull request Dec 29, 2014

CLN: Refactor *SON RFC #9166

Closed

ahasha mentioned this pull request Dec 3, 2015

Add a backend for apache Avro blaze/odo#370

Open

jreback mentioned this pull request Dec 3, 2015

ENH: avro to/from serialization #11752

Closed

ENH: support for msgpack serialization/deserialization #3525

ENH: support for msgpack serialization/deserialization #3525

Conversation

jreback commented May 3, 2013

jreback commented May 3, 2013

ghost commented May 3, 2013

jreback commented May 3, 2013

cpcloud commented May 3, 2013

jreback commented May 3, 2013

cpcloud commented May 3, 2013

jreback commented May 3, 2013

jreback commented May 3, 2013

cpcloud commented May 3, 2013

jreback commented May 3, 2013

cpcloud commented May 3, 2013

ghost commented May 3, 2013

jreback commented May 3, 2013

ghost commented May 3, 2013

jreback commented May 3, 2013

jreback commented May 3, 2013

jreback commented May 3, 2013

wesm commented May 3, 2013

wesm commented May 3, 2013

jreback commented May 3, 2013

ghost commented May 4, 2013

jreback commented May 4, 2013

cpcloud commented May 4, 2013

cpcloud commented May 4, 2013

jreback commented May 4, 2013

jreback commented May 4, 2013

cpcloud commented May 4, 2013

jreback commented May 4, 2013

cpcloud commented May 4, 2013

jreback commented May 4, 2013

cpcloud commented May 4, 2013

jreback commented May 4, 2013

cpcloud commented May 4, 2013

jreback commented May 4, 2013

cpcloud commented May 4, 2013

cpcloud commented May 5, 2013

jreback commented May 6, 2013

jreback commented May 6, 2013

jreback commented May 6, 2013

jreback commented May 6, 2013

jreback commented May 9, 2013

jreback commented Jun 10, 2013