Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: EXT data is too large #12905

Closed
randomgambit opened this issue Apr 15, 2016 · 22 comments
Closed

ValueError: EXT data is too large #12905

randomgambit opened this issue Apr 15, 2016 · 22 comments
Labels
Compat pandas objects compatability with Numpy or Python functions Error Reporting Incorrect or improved errors from pandas

Comments

@randomgambit
Copy link

Hi guys,

I am happy to help you improve msgpack.
I tried to export my massive dataframe this morning using msgpack, and I got this error.

ValueError: EXT data is too large

What does that mean? Is there a size limit?

@jreback
Copy link
Contributor

jreback commented Apr 16, 2016

@randomgambit if you can post something more.

cc @kawochen

@jreback jreback added Msgpack Compat pandas objects compatability with Numpy or Python functions labels Apr 16, 2016
@randomgambit
Copy link
Author

hi jeff i dont have my computer in front of me. but its the same exact dataframe as my post on the slowness of to_csv

@randomgambit
Copy link
Author

tell me what information you need

@jreback
Copy link
Contributor

jreback commented Apr 16, 2016

ahh ok

can u reference that issue here as well then (and post the df.info()

it may break if u have an object column that actually has object types in it (and not strings)

@randomgambit
Copy link
Author

randomgambit commented Apr 16, 2016

hello @jreback @kawochen

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10762587 entries, 0 to 12864511
Columns: 275 entries, bagent_name to index_month
dtypes: bool(1), datetime64[ns](16), float64(30), int32(1), int64(172), object(53), timedelta64[ns](2)
memory usage: 22.0+ GB

other problems related to this df are discussed here #12885

hope that helps

@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

does this work with a smaller slice of your frame?

Do you have any non-string object data? IOW, run something like

In [1]: df = tm.makeMixedDataFrame()

In [2]: df
Out[2]: 
     A    B     C          D
0  0.0  0.0  foo1 2009-01-01
1  1.0  1.0  foo2 2009-01-02
2  2.0  0.0  foo3 2009-01-05
3  3.0  1.0  foo4 2009-01-06
4  4.0  0.0  foo5 2009-01-07

In [3]: df.apply(pd.lib.infer_dtype)
Out[3]: 
A    floating
B    floating
C      string
D    datetime
dtype: object

If you get anything like mixed-.... then you need to stringfy (or coerce) before this would be expected to work.

storing giant opaque frames like this is generally not very useful as it forces you to load them entirely into memory to work with them.

@randomgambit
Copy link
Author

randomgambit commented Apr 18, 2016

Hi Jeff aka @jreback

Thanks! A couple of points

  • most of my work consists in doing regression analysis on large data samples. That means its actually a necessity for me to load the whole sample in memory (and to have a quick n dirty way to save and load my data at any step of the processing step)
  • I used your df.apply(pd.lib.infer_dtype) and checked for mixed types. BTW THAT is a great function that I did not know about. I recommend you put that in the tutorial (under "do I really know my data types?" ;-)
  • once I ran to_msg it seemed to work, going from 0 to 2 GB on my disk after 30 min of processing. Then it stayed there for a long time so I killed it. I dont know if msgpackis supposed to write continuously (so hitting F5 like crazy actually shows the msg file growing) or is doing some sort of chunk-by-chuck exporting. I ll try asap with a smaller sample

@randomgambit
Copy link
Author

@jreback is it possible to write to msgpackby iterating over the dataframe? I suspect trying to export the whole dataframe at once is too heavy a task

@jreback
Copy link
Contributor

jreback commented Apr 19, 2016

it's possible but not efficient
I think this is because of an unknown datatype

pls see if u can narrow down exactly when his occurs (eg remove columns until it works)

@kawochen
Copy link
Contributor

this line . It's not clear to me why it's testing for (2**32)-1. i think we can try (size_t) -1 instead.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

I wonder if msgpack uses a single byte for size which is limited

@kawochen
Copy link
Contributor

xref msgpack/msgpack-python#181

@kawochen
Copy link
Contributor

In the C code size_t is used.

@randomgambit
Copy link
Author

hi @kawochen and @jreback . sorry I was really busy recently. Do you want me to try something on my side?

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

so the bigger issue here is whether we should actually allow bigger than 2*32-1 (4GB) total bytes in a single write. The short answer is no. The long answer is to do multiple writes. Most chunked systems don't even let you do this kind of size in a single write, though they let you *do it, they will chunk write it (and reassemble before handing it back to you).

All that said, msgpack is not a chunked system. So we could simply document this limitation and have the user do whatever chunking is necessary. A user could do this. Of course this could actually be an interface, maybe inside pandas), but could live as a separate library.

In [8]: N = int(1e6)

In [9]: df = DataFrame({'A' : np.random.randint(0,10,size=N), 'B' : np.random.randn(N)})

In [10]: chunks = 10
In [12]: pd.to_msgpack('test.pak', { 'chunk_{0}'.format(i):chunk for i, chunk in enumerate(np.split(df, chunks)) })

In [13]: pd.read_msgpack('test.pak').keys()
Out[13]: 
['chunk_1',
 'chunk_0',
 'chunk_3',
 'chunk_2',
 'chunk_5',
 'chunk_4',
 'chunk_7',
 'chunk_6',
 'chunk_9',
 'chunk_8']

In [15]: pd.read_msgpack('test.pak').values()[0].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 100000 to 199999
Data columns (total 2 columns):
A    100000 non-null int64
B    100000 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.5 MB

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

cc @llllllllll do you do anything like this?

@randomgambit
Copy link
Author

so my understanding is that msgpackis not ready for writing such large files in one go, correct? This memory thing you are mentioning is responsible for not writing the whole dataframe I guess?

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

no it won't work, maybe we could change it. but there are good reasons not to. you are much better off chunking when storing. large opaque stores are not good for lots of reasons. as I said maybe we could support a chunking layer on top, but I am a bit reluctant to creating yet another thing that is somewhat endemic to pandas (and not a 'standard')

@randomgambit
Copy link
Author

you say large opaque store but I dont have mixed types anymore! I cleaned everything and stringifiedand astypifiedall the rogue columns (yes these are new words)

@jreback
Copy link
Contributor

jreback commented Apr 22, 2016

@randomgambit opaque as in a binary blob. It is not indexable. You can retreive the entire blob or not.

as your data gets bigger this becomes a more desirable property.

@llllllllll
Copy link
Contributor

@jreback We haven't needed to send anything larger than an int32 could hold so we are not chunking it up. I plan to do chunking on the blaze server for other reasons though.

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Difficulty Intermediate labels Oct 11, 2016
@jreback jreback added this to the Next Major Release milestone Oct 11, 2016
@simonjayhawkins
Copy link
Member

msgpack is deprecated #30112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

No branches or pull requests

6 participants