ValueError: EXT data is too large #12905

randomgambit · 2016-04-15T16:45:07Z

Hi guys,

I am happy to help you improve msgpack.
I tried to export my massive dataframe this morning using msgpack, and I got this error.

ValueError: EXT data is too large

What does that mean? Is there a size limit?

The text was updated successfully, but these errors were encountered:

jreback · 2016-04-16T01:26:51Z

@randomgambit if you can post something more.

cc @kawochen

randomgambit · 2016-04-16T01:48:28Z

hi jeff i dont have my computer in front of me. but its the same exact dataframe as my post on the slowness of to_csv

randomgambit · 2016-04-16T01:48:46Z

tell me what information you need

jreback · 2016-04-16T01:57:05Z

ahh ok

can u reference that issue here as well then (and post the df.info()

it may break if u have an object column that actually has object types in it (and not strings)

randomgambit · 2016-04-16T11:50:52Z

hello @jreback @kawochen

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10762587 entries, 0 to 12864511
Columns: 275 entries, bagent_name to index_month
dtypes: bool(1), datetime64[ns](16), float64(30), int32(1), int64(172), object(53), timedelta64[ns](2)
memory usage: 22.0+ GB

other problems related to this df are discussed here #12885

hope that helps

jreback · 2016-04-17T14:02:45Z

does this work with a smaller slice of your frame?

Do you have any non-string object data? IOW, run something like

In [1]: df = tm.makeMixedDataFrame()

In [2]: df
Out[2]: 
     A    B     C          D
0  0.0  0.0  foo1 2009-01-01
1  1.0  1.0  foo2 2009-01-02
2  2.0  0.0  foo3 2009-01-05
3  3.0  1.0  foo4 2009-01-06
4  4.0  0.0  foo5 2009-01-07

In [3]: df.apply(pd.lib.infer_dtype)
Out[3]: 
A    floating
B    floating
C      string
D    datetime
dtype: object

If you get anything like mixed-.... then you need to stringfy (or coerce) before this would be expected to work.

storing giant opaque frames like this is generally not very useful as it forces you to load them entirely into memory to work with them.

randomgambit · 2016-04-18T12:25:44Z

Hi Jeff aka @jreback

Thanks! A couple of points

most of my work consists in doing regression analysis on large data samples. That means its actually a necessity for me to load the whole sample in memory (and to have a quick n dirty way to save and load my data at any step of the processing step)
I used your df.apply(pd.lib.infer_dtype) and checked for mixed types. BTW THAT is a great function that I did not know about. I recommend you put that in the tutorial (under "do I really know my data types?" ;-)
once I ran to_msg it seemed to work, going from 0 to 2 GB on my disk after 30 min of processing. Then it stayed there for a long time so I killed it. I dont know if msgpackis supposed to write continuously (so hitting F5 like crazy actually shows the msg file growing) or is doing some sort of chunk-by-chuck exporting. I ll try asap with a smaller sample

randomgambit · 2016-04-18T20:46:47Z

@jreback is it possible to write to msgpackby iterating over the dataframe? I suspect trying to export the whole dataframe at once is too heavy a task

jreback · 2016-04-19T21:47:58Z

it's possible but not efficient
I think this is because of an unknown datatype

pls see if u can narrow down exactly when his occurs (eg remove columns until it works)

kawochen · 2016-04-22T13:12:10Z

this line . It's not clear to me why it's testing for (2**32)-1. i think we can try (size_t) -1 instead.

jreback · 2016-04-22T13:15:38Z

I wonder if msgpack uses a single byte for size which is limited

kawochen · 2016-04-22T13:16:25Z

xref msgpack/msgpack-python#181

kawochen · 2016-04-22T13:17:06Z

In the C code size_t is used.

randomgambit · 2016-04-22T13:24:39Z

hi @kawochen and @jreback . sorry I was really busy recently. Do you want me to try something on my side?

jreback · 2016-04-22T15:33:40Z

so the bigger issue here is whether we should actually allow bigger than 2*32-1 (4GB) total bytes in a single write. The short answer is no. The long answer is to do multiple writes. Most chunked systems don't even let you do this kind of size in a single write, though they let you *do it, they will chunk write it (and reassemble before handing it back to you).

All that said, msgpack is not a chunked system. So we could simply document this limitation and have the user do whatever chunking is necessary. A user could do this. Of course this could actually be an interface, maybe inside pandas), but could live as a separate library.

In [8]: N = int(1e6)

In [9]: df = DataFrame({'A' : np.random.randint(0,10,size=N), 'B' : np.random.randn(N)})

In [10]: chunks = 10
In [12]: pd.to_msgpack('test.pak', { 'chunk_{0}'.format(i):chunk for i, chunk in enumerate(np.split(df, chunks)) })

In [13]: pd.read_msgpack('test.pak').keys()
Out[13]: 
['chunk_1',
 'chunk_0',
 'chunk_3',
 'chunk_2',
 'chunk_5',
 'chunk_4',
 'chunk_7',
 'chunk_6',
 'chunk_9',
 'chunk_8']

In [15]: pd.read_msgpack('test.pak').values()[0].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 100000 to 199999
Data columns (total 2 columns):
A    100000 non-null int64
B    100000 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.5 MB

jreback · 2016-04-22T15:34:11Z

cc @llllllllll do you do anything like this?

randomgambit · 2016-04-22T15:41:05Z

so my understanding is that msgpackis not ready for writing such large files in one go, correct? This memory thing you are mentioning is responsible for not writing the whole dataframe I guess?

jreback · 2016-04-22T15:51:17Z

no it won't work, maybe we could change it. but there are good reasons not to. you are much better off chunking when storing. large opaque stores are not good for lots of reasons. as I said maybe we could support a chunking layer on top, but I am a bit reluctant to creating yet another thing that is somewhat endemic to pandas (and not a 'standard')

randomgambit · 2016-04-22T15:52:56Z

you say large opaque store but I dont have mixed types anymore! I cleaned everything and stringifiedand astypifiedall the rogue columns (yes these are new words)

jreback · 2016-04-22T15:54:21Z

@randomgambit opaque as in a binary blob. It is not indexable. You can retreive the entire blob or not.

as your data gets bigger this becomes a more desirable property.

llllllllll · 2016-04-23T01:42:42Z

@jreback We haven't needed to send anything larger than an int32 could hold so we are not chunking it up. I plan to do chunking on the blaze server for other reasons though.

simonjayhawkins · 2019-12-11T13:29:51Z

msgpack is deprecated #30112

jreback added Msgpack Compat pandas objects compatability with Numpy or Python functions labels Apr 16, 2016

jreback mentioned this issue Jun 4, 2016

read_msgpack returns a List of length Zero #13362

Closed

chris-b1 mentioned this issue Oct 11, 2016

read_msgpack is returning empty list #14394

Closed

jreback added Error Reporting Incorrect or improved errors from pandas Difficulty Intermediate labels Oct 11, 2016

jreback added this to the Next Major Release milestone Oct 11, 2016

jbrockmendel removed the Effort Medium label Oct 21, 2019

jbrockmendel removed the Difficulty Intermediate label Oct 21, 2019

simonjayhawkins closed this as completed Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: EXT data is too large #12905

ValueError: EXT data is too large #12905

randomgambit commented Apr 15, 2016

jreback commented Apr 16, 2016

randomgambit commented Apr 16, 2016

randomgambit commented Apr 16, 2016

jreback commented Apr 16, 2016

randomgambit commented Apr 16, 2016 •

edited

Loading

jreback commented Apr 17, 2016

randomgambit commented Apr 18, 2016 •

edited

Loading

randomgambit commented Apr 18, 2016

jreback commented Apr 19, 2016

kawochen commented Apr 22, 2016

jreback commented Apr 22, 2016

kawochen commented Apr 22, 2016

kawochen commented Apr 22, 2016

randomgambit commented Apr 22, 2016

jreback commented Apr 22, 2016 •

edited

Loading

jreback commented Apr 22, 2016

randomgambit commented Apr 22, 2016

jreback commented Apr 22, 2016

randomgambit commented Apr 22, 2016

jreback commented Apr 22, 2016 •

edited

Loading

llllllllll commented Apr 23, 2016

simonjayhawkins commented Dec 11, 2019

ValueError: EXT data is too large #12905

ValueError: EXT data is too large #12905

Comments

randomgambit commented Apr 15, 2016

jreback commented Apr 16, 2016

randomgambit commented Apr 16, 2016

randomgambit commented Apr 16, 2016

jreback commented Apr 16, 2016

randomgambit commented Apr 16, 2016 • edited Loading

jreback commented Apr 17, 2016

randomgambit commented Apr 18, 2016 • edited Loading

randomgambit commented Apr 18, 2016

jreback commented Apr 19, 2016

kawochen commented Apr 22, 2016

jreback commented Apr 22, 2016

kawochen commented Apr 22, 2016

kawochen commented Apr 22, 2016

randomgambit commented Apr 22, 2016

jreback commented Apr 22, 2016 • edited Loading

jreback commented Apr 22, 2016

randomgambit commented Apr 22, 2016

jreback commented Apr 22, 2016

randomgambit commented Apr 22, 2016

jreback commented Apr 22, 2016 • edited Loading

llllllllll commented Apr 23, 2016

simonjayhawkins commented Dec 11, 2019

randomgambit commented Apr 16, 2016 •

edited

Loading

randomgambit commented Apr 18, 2016 •

edited

Loading

jreback commented Apr 22, 2016 •

edited

Loading

jreback commented Apr 22, 2016 •

edited

Loading