read_msgpack returns a List of length Zero #13362

javierorozco · 2016-06-04T14:55:11Z

I have a very large Data Frame that takes 10.7 GB when serialized with "to_csv", but it takes 6.7 GB when using "to_msgpack". When reading the serialized file with "read_msgpack", I get a type List of length Zero.

The same IO process on 200 MB Data Frame works perfectly, i.e. a Data Frame is returned by the function "read_msgpack".

Code Sample, a copy-pastable example if possible

type(df) -> pd.DataFrame
len(df) -> 39227674
df.to_msgpack('foo.msg')
df2 = pd.read_msgpack('foo.msg')
type(df2) -> List
len(df2) -> 0

Expected Output

type(df2) -> pd.DataFrame
len(df2) -> 39227674

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-49-virtual
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.1.2
pip: 8.1.2
setuptools: 1.4
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.0.0
sphinx: None
patsy: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-06-04T16:32:54Z

pls show df.info() did you serialized in 0.18.1 as well? what is the error you get, exactly, what code are you using exactly.

jreback · 2016-06-04T16:33:38Z

is this the same as #12905 ?

javierorozco · 2016-06-04T18:56:25Z

Hi @jreback,

I encoded / decoded with 0.18.1

This is the output of df.info()

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39227674 entries, 0 to 39227673
Data columns (total 21 columns):
SourceMediaID int64
SessionID int64
AlgorithmID int64
FrameNo int64
AbsoluteStartTime int64
AbsoluteEndTime int64
RelativeStartTime int64
RelativeEndTime int64
RelativeStartSecond int64
Pitch float64
Roll float64
Yaw float64
Happy float64
Surprised float64
Angry float64
Sad float64
Disgusted float64
Scared float64
Attention object
Approach object
Duration int64
dtypes: float64(9), int64(10), object(2)
memory usage: 6.1+ GB

This my code
`df.to_msgpack('foo.msg')
df2 = read_msgpack('foo.msg')
type(df)
Out[5]: pandas.core.frame.DataFrame

type(df2)
Out[6]: List

len(df)
Out[7]: 0
`

I don't get errors from reading or writing, I just get a LIst instead of the Data Frame I saved.

jreback · 2016-06-04T19:07:14Z

no idea - it might be too big to save in this format

jreback · 2016-06-04T19:08:20Z

you'll have to experiment and show a reproducible example.

javierorozco · 2016-06-04T21:34:36Z

I agree with you with the analysis in 12905, the size of the data frame to write at once cannot be larger than 4GB. Therefore, I applied your suggestion of slicing the DF into a dictionary.

In[4]: medias = [df[df.ID == x] for x in df.ID.unique()]
In[5]: pd.to_msgpack('test.pak', { 'chunk_{0}'.format(i):chunk for i, chunk in enumerate(medias)})
In[6]: df = pd.concat(pd.read_msgpack('test.pak').values(), axis=0)
In[7]: df.shape
Out[4]: (39227674, 21)

Then, my 10.7GB CSV file was converted into a 7.0 GB msgpack with a reduction of 86% of loading time. I also find advantageous this solution to slice the data for partial readings.

Thanks

jreback added Can't Repro labels Jun 4, 2016

javierorozco closed this as completed Jun 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_msgpack returns a List of length Zero #13362

read_msgpack returns a List of length Zero #13362

javierorozco commented Jun 4, 2016

jreback commented Jun 4, 2016

jreback commented Jun 4, 2016

javierorozco commented Jun 4, 2016

jreback commented Jun 4, 2016

jreback commented Jun 4, 2016

javierorozco commented Jun 4, 2016

read_msgpack returns a List of length Zero #13362

read_msgpack returns a List of length Zero #13362

Comments

javierorozco commented Jun 4, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Jun 4, 2016

jreback commented Jun 4, 2016

javierorozco commented Jun 4, 2016

jreback commented Jun 4, 2016

jreback commented Jun 4, 2016

javierorozco commented Jun 4, 2016

output of `pd.show_versions()`