Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_msgpack returns a List of length Zero #13362

Closed
javierorozco opened this issue Jun 4, 2016 · 6 comments
Closed

read_msgpack returns a List of length Zero #13362

javierorozco opened this issue Jun 4, 2016 · 6 comments

Comments

@javierorozco
Copy link

I have a very large Data Frame that takes 10.7 GB when serialized with "to_csv", but it takes 6.7 GB when using "to_msgpack". When reading the serialized file with "read_msgpack", I get a type List of length Zero.

The same IO process on 200 MB Data Frame works perfectly, i.e. a Data Frame is returned by the function "read_msgpack".

Code Sample, a copy-pastable example if possible

type(df) -> pd.DataFrame
len(df) -> 39227674
df.to_msgpack('foo.msg')
df2 = pd.read_msgpack('foo.msg')
type(df2) -> List
len(df2) -> 0

Expected Output

type(df2) -> pd.DataFrame
len(df2) -> 39227674

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-49-virtual
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.1.2
pip: 8.1.2
setuptools: 1.4
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.0.0
sphinx: None
patsy: None

@jreback
Copy link
Contributor

jreback commented Jun 4, 2016

pls show df.info() did you serialized in 0.18.1 as well? what is the error you get, exactly, what code are you using exactly.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2016

is this the same as #12905 ?

@javierorozco
Copy link
Author

Hi @jreback,

I encoded / decoded with 0.18.1

This is the output of df.info()

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39227674 entries, 0 to 39227673
Data columns (total 21 columns):
SourceMediaID int64
SessionID int64
AlgorithmID int64
FrameNo int64
AbsoluteStartTime int64
AbsoluteEndTime int64
RelativeStartTime int64
RelativeEndTime int64
RelativeStartSecond int64
Pitch float64
Roll float64
Yaw float64
Happy float64
Surprised float64
Angry float64
Sad float64
Disgusted float64
Scared float64
Attention object
Approach object
Duration int64
dtypes: float64(9), int64(10), object(2)
memory usage: 6.1+ GB

This my code
`df.to_msgpack('foo.msg')
df2 = read_msgpack('foo.msg')
type(df)
Out[5]: pandas.core.frame.DataFrame

type(df2)
Out[6]: List

len(df)
Out[7]: 0
`

I don't get errors from reading or writing, I just get a LIst instead of the Data Frame I saved.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2016

no idea - it might be too big to save in this format

@jreback
Copy link
Contributor

jreback commented Jun 4, 2016

you'll have to experiment and show a reproducible example.

@javierorozco
Copy link
Author

I agree with you with the analysis in 12905, the size of the data frame to write at once cannot be larger than 4GB. Therefore, I applied your suggestion of slicing the DF into a dictionary.

In[4]: medias = [df[df.ID == x] for x in df.ID.unique()]
In[5]: pd.to_msgpack('test.pak', { 'chunk_{0}'.format(i):chunk for i, chunk in enumerate(medias)})
In[6]: df = pd.concat(pd.read_msgpack('test.pak').values(), axis=0)
In[7]: df.shape
Out[4]: (39227674, 21)

Then, my 10.7GB CSV file was converted into a 7.0 GB msgpack with a reduction of 86% of loading time. I also find advantageous this solution to slice the data for partial readings.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants