Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: to_csv created corrupt ZIP files when chunksize<rows #38728

Merged
merged 1 commit into from
Dec 29, 2020
Merged

REGR: to_csv created corrupt ZIP files when chunksize<rows #38728

merged 1 commit into from
Dec 29, 2020

Conversation

twoertwein
Copy link
Member

@twoertwein twoertwein commented Dec 27, 2020

When ZipFile's write is called multiple times, it will create multiple files within the zip file (with the same filename).

Edit: This also happens independently of chunksize as https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/writers.pyx#L14 calls writerows multiple times.

@simonjayhawkins simonjayhawkins added this to the 1.2.1 milestone Dec 27, 2020
@simonjayhawkins simonjayhawkins added the IO CSV read_csv, to_csv label Dec 27, 2020
)
self.multiple_write_buffer.write(data)

def _write(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this actually be called .flush()? or is that conflicting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think about that, flush might be a simpler solution! I will test that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to flush, but we still have to overwrite close. Some tests (weirdly not all tests) were failing without overwriting close.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ping on green.

@jreback jreback merged commit fb35344 into pandas-dev:master Dec 29, 2020
@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

thanks @twoertwein very nice

@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

@meeseeksdev backport 1.2.x

@lumberbot-app
Copy link

lumberbot-app bot commented Dec 29, 2020

Something went wrong ... Please have a look at my logs.

jreback pushed a commit that referenced this pull request Dec 29, 2020
…size<rows (#38767)

Co-authored-by: Torsten Wörtwein <twoertwein@users.noreply.github.com>
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
@twoertwein twoertwein deleted the multiple_zip_write_calls branch February 8, 2021 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: to_csv problems with zip compression and large dataframes
3 participants