Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support for msgpack serialization/deserialization #3525

Closed
wants to merge 5 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented May 3, 2013

warning: prototype!

msgpack serialization/deseriiization

  • support all pandas objects: Timestamp,Period,all index types,Series,DataFrame,Panel,Sparse suite
  • docs included (in io.rst)
  • iterator support
  • top-level api support

Here are 2 features which I think msgpack supports, but have to look further

  • no support for compression directly, but can compress the file (e.g. gzip)
  • access is sequential
  • versioning, is not that hard because its pretty easy to deal with a
    change in the schema (which is not directly stored), and this is MUCH more
    transparent than pickle

usage is exactly like pickle (aside from that its in a different namespace), allowing
arbitrary combinations of storage, e.g. this supports the added storage of pandas objects,
but obviously can store { 'frame1' : df1, 'frame2' : df2 } etc

storage example DataFrame(np.random.rand(1000,10)) on my machine stores in 128k file size.
and scales pretty well, e.g. 10k rows is 1.26mb

Not completely happy with the packers name, any suggestions?

closes #686

In [1]: df = DataFrame(randn(10,2),
   ...:                      columns=list('AB'),
   ...:                      index=date_range('20130101',periods=10))

In [2]: pd.to_msgpack('foo.msg',df)

In [3]: pd.read_msgpack('foo.msg')
Out[3]: 
                   A         B
2013-01-01  0.676700 -1.702599
2013-01-02 -0.070164 -1.368716
2013-01-03 -0.877145 -1.427964
2013-01-04 -0.295715 -0.176954
2013-01-05  0.566986  0.588918
2013-01-06 -0.307070  1.541773
2013-01-07  1.302388  0.689701
2013-01-08  0.165292  0.273496
2013-01-09 -3.492113 -1.178075
2013-01-10 -1.069521  0.848614

@jreback
Copy link
Contributor Author

jreback commented May 3, 2013

@y-p can we add msgpack to the travis build (the full one?)

@ghost
Copy link

ghost commented May 3, 2013

wait a minute, why aren't you using scripts/use_build_cache.py + tox/detox?
It'll make you so much more productive. You may in fact prove that there's a
schwarzschild radius for coding and turn into a black hole.

@jreback
Copy link
Contributor Author

jreback commented May 3, 2013

hahah...you do keep pushing on tox....breaking down and installing tox.....

I just did tox in my main pandas repo and it looks likes its intalling stuff.....

@cpcloud
Copy link
Member

cpcloud commented May 3, 2013

This would be so great. I keep getting unreproducible pickle failures (SystemError) telling me no exception is set. I would love to be able to serialize without using pytables/pickle/npy.

@jreback
Copy link
Contributor Author

jreback commented May 3, 2013

I started on avro, but seemed harder to do...this is pretty straightforward

@cpcloud
Copy link
Member

cpcloud commented May 3, 2013

avro looks kind of insane

@jreback
Copy link
Contributor Author

jreback commented May 3, 2013

what about names space?

df.to_msgpack(path)
com.to_msgpack(path,dict/list/obj of stuff)
read_msgpack(path)

to keep consistent?

@jreback
Copy link
Contributor Author

jreback commented May 3, 2013

next question, I 'borrowed' the numpy impl (though I modified it) from this: https://github.com/lebedov/msgpack_numpy, license looks ok.....do we include a reference (in our code)? should I send him an e-mail?

@cpcloud
Copy link
Member

cpcloud commented May 3, 2013

That looks fine to me (the namespace). Looks like the impl calls ndarray.tolist which will be horribly slow for large arrays.

@jreback
Copy link
Contributor Author

jreback commented May 3, 2013

Here's why its a prototype
slow and bigger than pickle.....hmm...

In [11]: df
Out[11]: 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10000 entries, 2000-01-01 00:00:00 to 2027-05-18 00:00:00
Freq: D
Data columns (total 10 columns):
0    10000  non-null values
1    10000  non-null values
2    10000  non-null values
3    10000  non-null values
4    10000  non-null values
5    10000  non-null values
6    10000  non-null values
7    10000  non-null values
8    10000  non-null values
9    10000  non-null values
dtypes: float64(10)

In [12]: %timeit packers.save('foo.msg',df)
10 loops, best of 3: 59.1 ms per loop

In [13]: %timeit df.save('foo.pickle')
100 loops, best of 3: 11 ms per loop

-rw-rw-r--  1 jreback users  880622 May  3 16:21 foo.pickle
-rw-rw-r--  1 jreback users 1300971 May  3 16:21 foo.msg

@cpcloud
Copy link
Member

cpcloud commented May 3, 2013

Hm. I love coding in Cython and it looks like there's a C interface, so I could hack an interface for you if u want. Not sure if it's worth it though. Maybe for 1e6 ord of mag # of rows could dispatch to Cython...

@ghost
Copy link

ghost commented May 3, 2013

well, save/load suggest a privileged status, but we'd like to nudge users
towards the new format when this becomes fully operational, so not sure about the naming.

Also, need to tread very lightly here, data serialization bugs are a big deal, we can't just
play cowboy with people's data. Maybe add an option to embed a version of the pickle data
in the msgpack file for a major version or two? that way people can recover from errors
at the cost of increased storage space.

Or maybe add a "safe=True" option that reads back the data and performs a comparison,
to warn the user if something goes wrong?

@jreback
Copy link
Contributor Author

jreback commented May 3, 2013

so out-of-thebox msgpack IS faster..its my datetimes that are screwing me, pickle is probably
a lot smarter...I was just doing isoformat().... what's fastes way of serializing datestime?

tuple of values, plus tz string?
some ordinal trick?

string index

In [1]: index = [rands(10) for _ in xrange(25000)]

In [2]: df = DataFrame({'float1' : randn(25000),
   ...:                 'float2' : randn(25000)},
   ...:                index=index)

In [3]: %timeit df.to_msgpack('foo.msg')
100 loops, best of 3: 14.2 ms per loop

In [4]: %timeit df.save('foo.pickle')
10 loops, best of 3: 20.1 ms per loop

datetimeindex

In [9]: %timeit df.to_msgpack('foo.msg')
10 loops, best of 3: 120 ms per loop

In [10]: %timeit df.save('foo.pickle')
100 loops, best of 3: 8.2 ms per loop

@ghost
Copy link

ghost commented May 3, 2013

re licensing: IANL, but the license is 2 clause BSD so that's compatible with pandas,
if you're using practically the entire thing, perhaps just make it a dependency, if only parts,
try to put them in a separate file, with the licensing banner at the top.

In any case, a copy of LICENSE goes in LICENSES/, and as a courtesy drop the author an
email to let him know his code is useful and that it's being integrated into a fairly
high-profile library, as a form of thanks.

@jreback
Copy link
Contributor Author

jreback commented May 3, 2013

Better....had to avoid all of the boxing/unboxing in the index......

In [13]: index=date_range('20000101',periods=25000,freq='H')

In [14]: df = DataFrame({'float1' : randn(25000),
                'float2' : randn(25000)},
               index=index)

In [15]: %timeit df.save('foo.pickle')
100 loops, best of 3: 9.38 ms per loop

In [16]: %timeit pd.to_msgpack('foo.msg',df)
100 loops, best of 3: 13 ms per loop

@jreback
Copy link
Contributor Author

jreback commented May 3, 2013

hmmm....ok...I already integrated the code (I had to change it a bit)....so I'll put the license file and the banner at the top....thxs

@jreback
Copy link
Contributor Author

jreback commented May 3, 2013

@y-p how do I get an e-mail address for someone on github?

@wesm
Copy link
Member

wesm commented May 3, 2013

clone a repo they've committed to and use git log

@wesm
Copy link
Member

wesm commented May 3, 2013

This is really awesome by the way. I'm +1000 on having a pickle-independent serialization format. Was waiting for the superhero (you) to do it. I'm gonna try to find some time in the next few days to play with this a bit to look at some of the low level details. It might be worth bringing in the python-msgpack cython code so we have full control over things

@jreback
Copy link
Contributor Author

jreback commented May 3, 2013

@wesm definitley worth investigation. I pushed some docs updates; will finish up the rest of the missing types soon.

@ghost
Copy link

ghost commented May 4, 2013

strongly urge 0.12. big change + more time for tire kicking in master.

@jreback
Copy link
Contributor Author

jreback commented May 4, 2013

@y-p changed to 0.12

@wesm now supports all pandas types (incl Sparse), docs included, added iterator support and added
ability to append to msgpack (you can actually do this with pickle to FYI)...you need to open the file with a+b

still open on whether there can implement some sort of random accessability......

go for perf improvements!

DOC: install.rst mention

DOC: added license from msgpack_numpy

PERF: changed Timestamp and DatetimeIndex serialization for speedups

      add vb_suite benchmarks

ENH: added to_msgpack method in generic.py, and default import into pandas

TST: all packers to always be imported, fail on usage with no msgpack installed
ENH: provide automatic list if multiple args passed to to_msgpack

DOC: changed docs to 0.12

ENH: iterator support for stream unpacking
@cpcloud
Copy link
Member

cpcloud commented May 4, 2013

This is so cool. Can you save arbitrary objects, e.g., a dict of dicts of DataFrames?

@cpcloud
Copy link
Member

cpcloud commented May 4, 2013

With 1e7 rows there is sadness. Also, not sure if this is expected but

df = DataFrame(rand(1e7, 16))
pd.to_msgpack('huge.msg', df) # takes a long time :(
!du huge.msg # 1795796000
df.values.nbytes == 1280000000
# difference of ~500 MB?

@jreback
Copy link
Contributor Author

jreback commented May 4, 2013

yes

@jreback
Copy link
Contributor Author

jreback commented May 4, 2013

I think we need to use the msgpack array type to store the ndarrays
but I have found an example of its use

@cpcloud
Copy link
Member

cpcloud commented May 4, 2013

Are the extra bytes from converting to a list?

@jreback
Copy link
Contributor Author

jreback commented May 4, 2013

not sure (ndarrays r stored as 1-dim lists), so could be

@cpcloud
Copy link
Member

cpcloud commented May 4, 2013

I'm not sure that's strictly true. Conceptually yes, but lists are bulkier because they can be heterogeneously typed, whereas ndarrays are almost always homogeneously typed. Anyway, I'm not trying to be pedantic here, just trying to get to the bottom of efficiently storing arrays.

@jreback
Copy link
Contributor Author

jreback commented May 4, 2013

going to try with an array type, its clear lists are done element by element are prob inefficient in terms of storage size (as they need markers bettween elements), and speed, e.g. 1-by one....

@cpcloud
Copy link
Member

cpcloud commented May 4, 2013

Where do you see arrays in the Python impl? I only see it in the C impl.

@jreback
Copy link
Contributor Author

jreback commented May 4, 2013

pack_array_header is a def, so I should be able to call it from python.....

@cpcloud
Copy link
Member

cpcloud commented May 4, 2013

Ah, yes. silly me.

@jreback
Copy link
Contributor Author

jreback commented May 4, 2013

I think definitly need to drop into the cython code to handle ndarrays. THe issue is that they are converted to lists (which are then packed via arrays It hink). However, can skip this by just definiing a numpy type in the packer
and then write it directly.

@cpcloud
Copy link
Member

cpcloud commented May 4, 2013

Yep I was actually thinking the same thing. Probably need to iterate
explicitly with integers as well
On May 4, 2013 6:44 PM, "jreback" notifications@github.com wrote:

I think definitly need to drop into the cython code to handle ndarrays.
THe issue is that they are converted to lists (which are then packed via
arrays It hink). However, can skip this by just definiing a numpy type in
the packer
and then write it directly.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3525#issuecomment-17442698
.

@cpcloud
Copy link
Member

cpcloud commented May 5, 2013

might need to be careful here. the c code uses the somewhat classic misinterpretation of unions, namely that

union X {
    double x;
    int64_t y;
};

union X u;
u.x = 1.0;
printf("%d", u.y);

is technically not correct. union members can only be read from in order. see here for the occurrence. Though, I'm not sure if something like uint64_t v = *(int64_t*)&my_double; is safer. I'm pretty sure that numpy does it in the same way as the latter, since it calls PyArray_NewFromDescr which takes a raw char* to the data and is then interpreted as the new type.

@jreback
Copy link
Contributor Author

jreback commented May 6, 2013

so I put up a commit, 5a02cdf, which implements on a testing basis, compression using zlib (included in python dist), and block (pip installable), specified with the compress keyword to to_msgpack. unfortunately as I cannot pass the compression variable very easily it is a giant hack to see if this makes any sense.

I am compress just the non-object numpy arrays at the very lowest level (via .tostring, the compression)
see this: https://github.com/FrancescAlted/python-blosc/wiki/Quick-User's-Guide

This is writing a byte string, which I think msgpack handles in cython code directly.....

This is the 10m rows dataset.

In [1]: df = DataFrame({'float1' : randn(10000000),'float2' : randn(10000000)},index=date_range('20000101',periods=10000000,freq='s'))

In [2]: %timeit df.save('foo.pickle')
1 loops, best of 3: 2.45 s per loop

In [3]: %timeit df.to_msgpack('foo.msg.no_compress')
1 loops, best of 3: 4.24 s per loop

In [4]: %timeit df.to_msgpack('foo.msg.zlib',compress='zlib')
1 loops, best of 3: 9.8 s per loop

In [6]: %timeit df.to_msgpack('foo.msg.zlib',compress='zlib')
1 loops, best of 3: 9.68 s per loop

In [7]: %timeit df.to_msgpack('foo.msg.blosc',compress='blosc')
1 loops, best of 3: 3.21 s per loop

In [8]: %timeit df.load('foo.pickle')
1 loops, best of 3: 281 ms per loop

In [9]: %timeit pd.read_msgpack('foo.msg.zlib')
1 loops, best of 3: 2.31 s per loop

In [10]: %timeit pd.read_msgpack('foo.msg.blosc')
1 loops, best of 3: 1.61 s per loop

In [11]: %timeit pd.read_msgpack('foo.msg.no_compress')
1 loops, best of 3: 3.36 s per loop

In [12]: quit()

[goat-jreback-~/pandas] ls -ltr foo.*
-rw-rw-r-- 1 jreback users 240000602 May  5 20:20 foo.pickle
-rw-rw-r-- 1 jreback users 270000301 May  5 20:21 foo.msg.no_compress
-rw-rw-r-- 1 jreback users 243708088 May  5 20:23 foo.msg.zlib
-rw-rw-r-- 1 jreback users 244501486 May  5 20:23 foo.msg.blosc
``
So pickle is still the king, though blosc is pretty close.

I had to turn off the encoding (not even sure it should be set anyhow.....), and the unicode tests fail now.

@jreback
Copy link
Contributor Author

jreback commented May 6, 2013

profile pickls vs using blosc for compression

In [3]: %prun df.save('foo.pickle')
         18 function calls in 2.382 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.171    2.171    2.171    2.171 {method 'close' of 'file' objects}
        1    0.159    0.159    0.191    0.191 {cPickle.dump}
        1    0.032    0.032    0.032    0.032 index.py:519(__reduce__)
        1    0.020    0.020    0.020    0.020 {open}
        2    0.000    0.000    0.000    0.000 index.py:330(__reduce__)

In [2]: %prun df.to_msgpack('foo.msg.blosc',compress='blosc')
         76 function calls in 3.274 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.230    2.230    2.230    2.230 {method 'close' of 'file' objects}
        1    0.509    0.509    0.871    0.871 {method 'pack' of 'msgpack._packer.Packer' objects}
        1    0.156    0.156    0.156    0.156 {method 'tolist' of 'numpy.ndarray' objects}
        1    0.135    0.135    0.135    0.135 {blosc.blosc_extension.compress}
        1    0.090    0.090    0.090    0.090 {method 'write' of 'file' objects}
        1    0.065    0.065    0.065    0.065 {method 'tostring' of 'numpy.ndarray' objects}
        1    0.062    0.062    0.062    0.062 {open}
        1    0.011    0.011    3.274    3.274 packers.py:85(to_msgpack)

@jreback
Copy link
Contributor Author

jreback commented May 6, 2013

for comparison to writing in HDF5 (no compression)

This is pretty comparable (format wise)

In [11]: %timeit df.to_hdf('foo.h5.no_table','df')
1 loops, best of 3: 2.31 s per loop

This is slow (but appendable and in a queryable format)

In [12]: %timeit df.to_hdf('foo.h5','df',table=True)
1 loops, best of 3: 25.2 s per loop

Sizes are comparable

goat-jreback-~/pandas] ls -ltr foo.h*
-rw-rw-r-- 1 jreback users 249344000 May  5 20:43 foo.h5
-rw-rw-r-- 1 jreback users 240007312 May  5 20:44 foo.h5.no_table

@jreback
Copy link
Contributor Author

jreback commented May 6, 2013

The HDF5 reads are blazing fast......

In [7]: %timeit pd.read_hdf('foo.h5','df')
1 loops, best of 3: 302 ms per loop

In [8]: %timeit pd.read_hdf('foo.h5.no_table','df')
10 loops, best of 3: 90 ms per loop

@jreback
Copy link
Contributor Author

jreback commented May 9, 2013

I think an additional feateure can add here is to support pickle objects as a savable type (right now this will
raise that it doesn't know how to pack it)

that way if something doesn't exist natively you could then save it (e.g. say you want to save an XLS file or something)....shouldn't be too hard to do

@jreback
Copy link
Contributor Author

jreback commented Jun 10, 2013

closing in favor of #3831

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create efficient binary storage format alternative to pickle
3 participants