improving the speed of to_csv #12885

randomgambit · 2016-04-13T00:17:59Z

Hello,

I dont know if that is possible, but it would great to find a way to speed up the to_csv method in Pandas.

In my admittedly large dataframe with 20 million observations and 50 variables, it takes literally hours to export the data to a csv file.

Reading the csv in Pandas is much faster though. I wonder what is the bottleneck here and what can be done to improve the data transfer.

Csv files are ubiquitous, and a great way to share data (without being too nerdy with hdf5and other subtleties). What do you think?

The text was updated successfully, but these errors were encountered:

jreback · 2016-04-13T00:41:45Z

duplicate of #3186

Well using 1/10 the rows, about 800MB in memory

In [5]: df = DataFrame(np.random.randn(2000000,50))

In [6]: df.memory_usage().sum()
Out[6]: 800000072

In [7]: df.memory_usage().sum()/1000000
Out[7]: 800

In [8]: %time df.to_csv('test.csv')
CPU times: user 2min 53s, sys: 3.71 s, total: 2min 56s
Wall time: 2min 57s

about 8.5MB/sec in raw throughput, way below IO speeds, so obviously quite some room to improve.
you might be interested in this blog here

Of course there IS really no reason at all to use CSV unless you are forced.

In [11]: %time df.to_hdf('test.hdf','df')
CPU times: user 10.5 ms, sys: 588 ms, total: 599 ms
Wall time: 2.22 s

jreback · 2016-04-13T00:51:11Z

This is also part of the documenation: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

randomgambit · 2016-04-13T01:08:01Z

hi @jreback thanks for your help!

unfortunately

yes i am forced
i have mixed types in my columns and somehow to hdf fails

jreback · 2016-04-13T01:21:58Z

what do you mean by mixed types?

you should NEVER have mixed types (e.g. python objects). that is a big big no-no.

jreback · 2016-04-13T01:22:36Z

show a df.info(); you can pass format='table' which is somewhat slower but much more flexible format for writing HDF5

randomgambit · 2016-04-13T01:25:53Z

I will run the to hdf line tomorrow and tell you my error message. it may be due to the fact that i have string columns and or some miscoded observations...

randomgambit · 2016-04-13T01:26:37Z

yes thanks i will run the df info. keep in touch. Thanks jeff

randomgambit · 2016-04-13T12:57:13Z

good morning jeff (@jreback))

Please find my df.info

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10762587 entries, 0 to 12864511
Columns: 275 entries, bagent_name to index_month
dtypes: bool(1), datetime64[ns](16), float64(30), int32(1), int64(172), object(53), timedelta64[ns](2)
memory usage: 22.0+ GB

Do you see anything wrong? I have many string columns here

randomgambit · 2016-04-13T13:21:18Z

and this is what I get when I try

df.to_hdf(path + 'main.h5','stored',format='table',mode='w')

TypeError: Cannot serialize the column [zip_dw] because
its data contents are [mixed-integer-float] object dtype

jreback · 2016-04-13T15:50:26Z

So you probably have something like this. Its actually pretty hard to construct this. If you are not doing this explicity then pls show how its contructed as this should always be converted to a concrete dtype.
You NEVER want object dtypes, except if they are strings. Yes you can use them, but they are completely non-performant and generally cannot be serialized. These are by definition python objects.

In [7]: df = DataFrame({'A' : [1,2,3], 'B' : Series([1,2.0,3],dtype=object)})

In [8]: df
Out[8]: 
   A  B
0  1  1
1  2  2
2  3  3

In [9]: df.dtypes
Out[9]: 
A     int64
B    object
dtype: object

In [10]: df.to_hdf('test.h5','df',format='fixed')
pandas/core/generic.py:1096: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer-float,key->block1_values] [items->['B']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

In [11]: df.to_hdf('test.h5','df',format='table')
TypeError: Cannot serialize the column [B] because
its data contents are [mixed-integer-float] object dtype

randomgambit · 2016-04-13T16:00:59Z

Hi Jeff, Yes I think you are right. zip_dw is a variable that contains a zipcode.
In the data, it looks like 95966 but the dtype is object.

The key question are:

is there something in Pandas to force a (hdf5 compliant) conversion for these object columns?
i actually thought hdf5 could handle strings and fast-searching them. Am I wrong?

Thanks again @jreback

jreback · 2016-04-13T16:22:14Z

of course it can handle strings, but it makes them fixed width. But that's not your issue. you have actual Python objects (an actual integer and a float object), that are NOT in numpy (e.g. that's why the column is object).

YOU are in charge of your dtypes. So you need you either need to stringify them, or you can leave them as object and use only the fixed format (it will work, but it pickles the data).

@randomgambit you have to understand and be cognizant of your data and types. Pandas provides lots of tools and mostly does not allow you to shoot yourself, but it cannot do 'everything' even though it will try very hard to infer things.

randomgambit · 2016-04-13T16:33:03Z

thanks @jreback and sorry if I bother you with such basic questions. I come from a language (Stata) that is much less flexible and where everything is either a float or a string.

To recap, the problem here is that I have a column that contains some numbers (floats and integers) and some strings. This is why pandas treat them as objects. You are saying that this mixed-type column generates performance issues and cannot be well stored in hdf5. To fix this, I should probably either use to_numeric or astype('str') on this column, in order to have a single dtype for every value in that column.

Is this correct?

randomgambit · 2016-04-13T18:23:17Z

Following our chat, I have bought a book about hdf5 and python. that will help me understand this storage system better. @jreback , if you can just tell me if my reasoning above is correct that would help. Thanks and keep up the great work with Pandas!

jreback · 2016-04-13T18:48:11Z

so you would typically do something like to this. pandas support many dtypes; you want to type as much as possible. All of these types are supported when serialized to HDF5 (table format), though for example strings become fixed width and nans are replaced with a string.
``
In [31]: s = Series(['a',1,2.0])

In [32]: s
Out[32]:
0 a
1 1
2 2
dtype: object

In [33]: pd.to_numeric(s, errors='coerce')
Out[33]:
0 NaN
1 1.0
2 2.0
dtype: float64

PCJimmmy · 2019-08-05T22:28:02Z

So how about an answer to the initial question rather than getting off track.

I need cvs format

Can pandas improve speed of writing that file format?

awildturtok · 2022-01-28T09:52:48Z

I've managed to reduce write time by 90% using pyarrow to write the pandas data frame:

out = pa.Table.from_pandas(out_pd)
del out_pd
csv.write_csv(out, out_file)

jreback closed this as completed Apr 13, 2016

jreback added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Apr 13, 2016

randomgambit mentioned this issue Apr 16, 2016

ValueError: EXT data is too large #12905

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improving the speed of to_csv #12885

improving the speed of to_csv #12885

randomgambit commented Apr 13, 2016

jreback commented Apr 13, 2016

jreback commented Apr 13, 2016

randomgambit commented Apr 13, 2016

jreback commented Apr 13, 2016

jreback commented Apr 13, 2016

randomgambit commented Apr 13, 2016

randomgambit commented Apr 13, 2016

randomgambit commented Apr 13, 2016

randomgambit commented Apr 13, 2016

jreback commented Apr 13, 2016

randomgambit commented Apr 13, 2016

jreback commented Apr 13, 2016

randomgambit commented Apr 13, 2016

randomgambit commented Apr 13, 2016

jreback commented Apr 13, 2016

PCJimmmy commented Aug 5, 2019

awildturtok commented Jan 28, 2022

improving the speed of to_csv #12885

improving the speed of to_csv #12885

Comments

randomgambit commented Apr 13, 2016

jreback commented Apr 13, 2016

jreback commented Apr 13, 2016

randomgambit commented Apr 13, 2016

jreback commented Apr 13, 2016

jreback commented Apr 13, 2016

randomgambit commented Apr 13, 2016

randomgambit commented Apr 13, 2016

randomgambit commented Apr 13, 2016

randomgambit commented Apr 13, 2016

jreback commented Apr 13, 2016

randomgambit commented Apr 13, 2016

jreback commented Apr 13, 2016

randomgambit commented Apr 13, 2016

randomgambit commented Apr 13, 2016

jreback commented Apr 13, 2016

PCJimmmy commented Aug 5, 2019

awildturtok commented Jan 28, 2022