-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improving the speed of to_csv #12885
Comments
duplicate of #3186 Well using 1/10 the rows, about 800MB in memory
about 8.5MB/sec in raw throughput, way below IO speeds, so obviously quite some room to improve. Of course there IS really no reason at all to use CSV unless you are forced.
|
This is also part of the documenation: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations |
hi @jreback thanks for your help! unfortunately
|
what do you mean by you should NEVER have mixed types (e.g. python objects). that is a big big no-no. |
show a |
I will run the to hdf line tomorrow and tell you my error message. it may be due to the fact that i have string columns and or some miscoded observations... |
yes thanks i will run the df info. keep in touch. Thanks jeff |
good morning jeff (@jreback)) Please find my
Do you see anything wrong? I have many string columns here |
and this is what I get when I try
|
So you probably have something like this. Its actually pretty hard to construct this. If you are not doing this explicity then pls show how its contructed as this should always be converted to a concrete dtype.
|
Hi Jeff, Yes I think you are right. zip_dw is a variable that contains a zipcode. The key question are:
Thanks again @jreback |
of course it can handle strings, but it makes them fixed width. But that's not your issue. you have actual Python objects (an actual integer and a float object), that are NOT in numpy (e.g. that's why the column is YOU are in charge of your dtypes. So you need you either need to stringify them, or you can leave them as object and use only the @randomgambit you have to understand and be cognizant of your data and types. Pandas provides lots of tools and mostly does not allow you to shoot yourself, but it cannot do 'everything' even though it will try very hard to infer things. |
thanks @jreback and sorry if I bother you with such basic questions. I come from a language (Stata) that is much less flexible and where everything is either a float or a string. To recap, the problem here is that I have a column that contains some numbers (floats and integers) and some strings. This is why pandas treat them as objects. You are saying that this mixed-type column generates performance issues and cannot be well stored in hdf5. To fix this, I should probably either use Is this correct? |
Following our chat, I have bought a book about hdf5 and python. that will help me understand this storage system better. @jreback , if you can just tell me if my reasoning above is correct that would help. Thanks and keep up the great work with Pandas! |
so you would typically do something like to this. pandas support many dtypes; you want to type as much as possible. All of these types are supported when serialized to HDF5 (table format), though for example strings become fixed width and nans are replaced with a string. In [32]: s In [33]: pd.to_numeric(s, errors='coerce')
|
So how about an answer to the initial question rather than getting off track. I need cvs format Can pandas improve speed of writing that file format? |
I've managed to reduce write time by 90% using pyarrow to write the pandas data frame: out = pa.Table.from_pandas(out_pd)
del out_pd
csv.write_csv(out, out_file) |
Hello,
I dont know if that is possible, but it would great to find a way to speed up the
to_csv
method in Pandas.In my admittedly large dataframe with 20 million observations and 50 variables, it takes literally hours to export the data to a csv file.
Reading the
csv
in Pandas is much faster though. I wonder what is the bottleneck here and what can be done to improve the data transfer.Csv files are ubiquitous, and a great way to share data (without being too nerdy with
hdf5
and other subtleties). What do you think?The text was updated successfully, but these errors were encountered: