REGR: to_csv problems with zip compression and large dataframes #38714

chmielcode · 2020-12-27T06:53:59Z

Code Sample, a copy-pastable example

import pandas as pd
import io
f = io.BytesIO()
d = pd.DataFrame({'a':[1]*5000})
d.to_csv(f, compression='zip')
f.seek(0)
pd.read_csv(f, compression='zip')

Problem description

Writing large (over 1163 rows) dataframes to csv with zip compression (inferred or explicit; to file or io.BytesIO) creates a corrupted zip file.
ValueError: Multiple files found in ZIP file. Only one file per ZIP: ['zip', 'zip', 'zip', 'zip', 'zip']

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.8.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Polish_Poland.1250

pandas : 1.2.0
numpy : 1.19.3
pytz : 2020.5
dateutil : 2.8.1
pip : 20.3.3
setuptools : 51.1.0.post20201221
Cython : None
pytest : 6.2.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : 1.3.2
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.4
sqlalchemy : None
tables : None
tabulate : None
xarray : 0.16.2
xlrd : None
xlwt : None
numba : 0.52.0

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2020-12-27T11:17:13Z

Thanks @chmielcode for the report.

The code sample failed on previous versions with TypeError: a bytes-like object is required, not 'str'

Writing large (over 1163 rows) dataframes to csv with zip compression (inferred or explicit; to file or io.BytesIO) creates a corrupted zip file.

is there a combination of inferred/explicit compression and buffer type that worked previously and now fails?

chmielcode · 2020-12-27T11:34:14Z

@simonjayhawkins Thank you for quick response. I only noticed this problem after upgrading to 1.2.0 when my data caching system started failing. No issues with 1.1.5.

The same happens with string path as first argument and this is how I normally use this method. BytesIO in the example code was to make it as clean (no write to disk) as possible.

d = pd.DataFrame({'a':[1]*2188})
p = R"T:\test.csv.zip"  # replace with available path
d.to_csv(p)
pd.read_csv(p)

Output: Multiple files found in ZIP file. Only one file per ZIP: ['T:/test.csv.zip', 'T:/test.csv.zip']

Error message informs about 2 files for 1164-2188 rows, 3 files for 2189-3213 (+1/1024 rows). The larger the frame the more reported files in the zip archive.

simonjayhawkins · 2020-12-27T13:46:09Z

No issues with 1.1.5.

first bad commit: [3b88446] support binary file handles in to_csv (#35129) cc @twoertwein

twoertwein · 2020-12-27T15:58:01Z

sorry about that, I will look into it! I assume that write() is called multiple times for large files. ZipFile creates a new file (with the same name) within the zipfile for each write call.

twoertwein · 2020-12-27T19:07:13Z

I made a PR.

@chmielcode your initial example needs a call to seek.

import pandas as pd
import io
f = io.BytesIO()
d = pd.DataFrame({'a':[1]*5000})
d.to_csv(f, compression='zip')
f.seek(0)
pd.read_csv(f, compression='zip')

chmielcode · 2020-12-27T20:46:34Z

@twoertwein Thank you very much. I've updated the example. It works without seek(0), but now it's clear that missing seek is not the cause.

twoertwein · 2020-12-27T21:17:59Z

I was testing whether setting chunksize to a large value could be a temporary workaround before 1.2.1 releases d.to_csv(f, compression='zip', chunksize=1000000000000000000000). Unfortunately, It seems that https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/writers.pyx#L14 calls writerows() for each chunk potentially multiple times :( So that doesn't help

twoertwein · 2020-12-27T21:22:54Z

@chmielcode yes you are right, for zip compression you don't need a seek (it seems that ZipFile seeks internally). But if you use gzip/bz2/xz/no compression, then you need a seek(0).

chmielcode added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 27, 2020

simonjayhawkins added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 27, 2020

simonjayhawkins added this to the Contributions Welcome milestone Dec 27, 2020

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 27, 2020

code sample for pandas-dev#38714

1f3fa0c

simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Dec 27, 2020

simonjayhawkins changed the title ~~BUG: to_csv problems with zip compression and large dataframes~~ REGR: to_csv problems with zip compression and large dataframes Dec 27, 2020

twoertwein mentioned this issue Dec 27, 2020

REGR: to_csv created corrupt ZIP files when chunksize<rows #38728

Merged

5 tasks

simonjayhawkins modified the milestones: Contributions Welcome, 1.2.1 Dec 27, 2020

jreback closed this as completed in #38728 Dec 29, 2020

simonjayhawkins mentioned this issue Jan 15, 2021

BUG: pd.to_csv(..., compression="zip") creates multiple files in zip archive that cannot be read by pd.read_csv() #39190

Closed

3 tasks

This was referenced Jan 24, 2021

BUG:to_pickle() raises TypeError when compressing large dataframe #39002

Closed

BUG: DataFrame to_csv compression with 'zip' use zipfilename as archive name #39465

Closed

ggold7046 mentioned this issue Aug 10, 2023

Modified doc/make.py to run sphinx-build -b linkcheck #54265

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: to_csv problems with zip compression and large dataframes #38714

REGR: to_csv problems with zip compression and large dataframes #38714

chmielcode commented Dec 27, 2020 •

edited

Loading

INSTALLED VERSIONS

simonjayhawkins commented Dec 27, 2020

chmielcode commented Dec 27, 2020 •

edited

Loading

simonjayhawkins commented Dec 27, 2020

twoertwein commented Dec 27, 2020

twoertwein commented Dec 27, 2020

chmielcode commented Dec 27, 2020 •

edited

Loading

twoertwein commented Dec 27, 2020

twoertwein commented Dec 27, 2020

REGR: to_csv problems with zip compression and large dataframes #38714

REGR: to_csv problems with zip compression and large dataframes #38714

Comments

chmielcode commented Dec 27, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

simonjayhawkins commented Dec 27, 2020

chmielcode commented Dec 27, 2020 • edited Loading

simonjayhawkins commented Dec 27, 2020

twoertwein commented Dec 27, 2020

twoertwein commented Dec 27, 2020

chmielcode commented Dec 27, 2020 • edited Loading

twoertwein commented Dec 27, 2020

twoertwein commented Dec 27, 2020

chmielcode commented Dec 27, 2020 •

edited

Loading

Output of `pd.show_versions()`

chmielcode commented Dec 27, 2020 •

edited

Loading

chmielcode commented Dec 27, 2020 •

edited

Loading