Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improving the speed of to_csv #12885

Closed
randomgambit opened this issue Apr 13, 2016 · 17 comments
Closed

improving the speed of to_csv #12885

randomgambit opened this issue Apr 13, 2016 · 17 comments
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance

Comments

@randomgambit
Copy link

Hello,

I dont know if that is possible, but it would great to find a way to speed up the to_csv method in Pandas.

In my admittedly large dataframe with 20 million observations and 50 variables, it takes literally hours to export the data to a csv file.

Reading the csv in Pandas is much faster though. I wonder what is the bottleneck here and what can be done to improve the data transfer.

Csv files are ubiquitous, and a great way to share data (without being too nerdy with hdf5and other subtleties). What do you think?

@jreback
Copy link
Contributor

jreback commented Apr 13, 2016

duplicate of #3186

Well using 1/10 the rows, about 800MB in memory

In [5]: df = DataFrame(np.random.randn(2000000,50))

In [6]: df.memory_usage().sum()
Out[6]: 800000072

In [7]: df.memory_usage().sum()/1000000
Out[7]: 800

In [8]: %time df.to_csv('test.csv')
CPU times: user 2min 53s, sys: 3.71 s, total: 2min 56s
Wall time: 2min 57s

about 8.5MB/sec in raw throughput, way below IO speeds, so obviously quite some room to improve.
you might be interested in this blog here

Of course there IS really no reason at all to use CSV unless you are forced.

In [11]: %time df.to_hdf('test.hdf','df')
CPU times: user 10.5 ms, sys: 588 ms, total: 599 ms
Wall time: 2.22 s

@jreback jreback closed this as completed Apr 13, 2016
@jreback jreback added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Apr 13, 2016
@jreback
Copy link
Contributor

jreback commented Apr 13, 2016

This is also part of the documenation: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

@randomgambit
Copy link
Author

hi @jreback thanks for your help!

unfortunately

  • yes i am forced
  • i have mixed types in my columns and somehow to hdf fails

@jreback
Copy link
Contributor

jreback commented Apr 13, 2016

what do you mean by mixed types?

you should NEVER have mixed types (e.g. python objects). that is a big big no-no.

@jreback
Copy link
Contributor

jreback commented Apr 13, 2016

show a df.info(); you can pass format='table' which is somewhat slower but much more flexible format for writing HDF5

@randomgambit
Copy link
Author

I will run the to hdf line tomorrow and tell you my error message. it may be due to the fact that i have string columns and or some miscoded observations...

@randomgambit
Copy link
Author

yes thanks i will run the df info. keep in touch. Thanks jeff

@randomgambit
Copy link
Author

good morning jeff (@jreback))

Please find my df.info

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10762587 entries, 0 to 12864511
Columns: 275 entries, bagent_name to index_month
dtypes: bool(1), datetime64[ns](16), float64(30), int32(1), int64(172), object(53), timedelta64[ns](2)
memory usage: 22.0+ GB

Do you see anything wrong? I have many string columns here

@randomgambit
Copy link
Author

and this is what I get when I try

df.to_hdf(path + 'main.h5','stored',format='table',mode='w')

TypeError: Cannot serialize the column [zip_dw] because
its data contents are [mixed-integer-float] object dtype

@jreback
Copy link
Contributor

jreback commented Apr 13, 2016

So you probably have something like this. Its actually pretty hard to construct this. If you are not doing this explicity then pls show how its contructed as this should always be converted to a concrete dtype.
You NEVER want object dtypes, except if they are strings. Yes you can use them, but they are completely non-performant and generally cannot be serialized. These are by definition python objects.

In [7]: df = DataFrame({'A' : [1,2,3], 'B' : Series([1,2.0,3],dtype=object)})

In [8]: df
Out[8]: 
   A  B
0  1  1
1  2  2
2  3  3

In [9]: df.dtypes
Out[9]: 
A     int64
B    object
dtype: object

In [10]: df.to_hdf('test.h5','df',format='fixed')
pandas/core/generic.py:1096: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer-float,key->block1_values] [items->['B']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

In [11]: df.to_hdf('test.h5','df',format='table')
TypeError: Cannot serialize the column [B] because
its data contents are [mixed-integer-float] object dtype

@randomgambit
Copy link
Author

Hi Jeff, Yes I think you are right. zip_dw is a variable that contains a zipcode.
In the data, it looks like 95966 but the dtype is object.

The key question are:

  • is there something in Pandas to force a (hdf5 compliant) conversion for these object columns?
  • i actually thought hdf5 could handle strings and fast-searching them. Am I wrong?

Thanks again @jreback

@jreback
Copy link
Contributor

jreback commented Apr 13, 2016

of course it can handle strings, but it makes them fixed width. But that's not your issue. you have actual Python objects (an actual integer and a float object), that are NOT in numpy (e.g. that's why the column is object).

YOU are in charge of your dtypes. So you need you either need to stringify them, or you can leave them as object and use only the fixed format (it will work, but it pickles the data).

@randomgambit you have to understand and be cognizant of your data and types. Pandas provides lots of tools and mostly does not allow you to shoot yourself, but it cannot do 'everything' even though it will try very hard to infer things.

@randomgambit
Copy link
Author

thanks @jreback and sorry if I bother you with such basic questions. I come from a language (Stata) that is much less flexible and where everything is either a float or a string.

To recap, the problem here is that I have a column that contains some numbers (floats and integers) and some strings. This is why pandas treat them as objects. You are saying that this mixed-type column generates performance issues and cannot be well stored in hdf5. To fix this, I should probably either use to_numeric or astype('str') on this column, in order to have a single dtype for every value in that column.

Is this correct?

@randomgambit
Copy link
Author

Following our chat, I have bought a book about hdf5 and python. that will help me understand this storage system better. @jreback , if you can just tell me if my reasoning above is correct that would help. Thanks and keep up the great work with Pandas!

@jreback
Copy link
Contributor

jreback commented Apr 13, 2016

so you would typically do something like to this. pandas support many dtypes; you want to type as much as possible. All of these types are supported when serialized to HDF5 (table format), though for example strings become fixed width and nans are replaced with a string.
``
In [31]: s = Series(['a',1,2.0])

In [32]: s
Out[32]:
0 a
1 1
2 2
dtype: object

In [33]: pd.to_numeric(s, errors='coerce')
Out[33]:
0 NaN
1 1.0
2 2.0
dtype: float64

@PCJimmmy
Copy link

PCJimmmy commented Aug 5, 2019

So how about an answer to the initial question rather than getting off track.

I need cvs format

Can pandas improve speed of writing that file format?

@awildturtok
Copy link

I've managed to reduce write time by 90% using pyarrow to write the pandas data frame:

out = pa.Table.from_pandas(out_pd)
del out_pd
csv.write_csv(out, out_file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants