Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/REG: file-handle object handled incorrectly in to_csv #21478

Merged
merged 19 commits into from
Jun 18, 2018

Conversation

minggli
Copy link
Contributor

@minggli minggli commented Jun 14, 2018

This error related to PR #21249 and #21227. This is never supported use case and to use file-handle in to_csv with compression, the file-object itself should be a compression archive such as:

with gzip.open('test.txt.gz', 'wt') as f:
    pd.DataFrame([0,1],index=['a','b'], columns=['c']).to_csv(f, sep='\t')

Regressed to 0.22 to_csv with support for zipfile. zipfile doesn't support writing csv strings to a zip archive using a file-handle. So buffer is used to catch the writing and dump into zip archive in one go. The other scenarios remain unchanged.

@codecov
Copy link

codecov bot commented Jun 14, 2018

Codecov Report

Merging #21478 into master will increase coverage by 0.03%.
The diff coverage is 95.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21478      +/-   ##
==========================================
+ Coverage   91.89%   91.92%   +0.03%     
==========================================
  Files         153      153              
  Lines       49604    49599       -5     
==========================================
+ Hits        45584    45595      +11     
+ Misses       4020     4004      -16
Flag Coverage Δ
#multiple 90.32% <95.65%> (+0.03%) ⬆️
#single 41.88% <43.47%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 97.23% <ø> (ø) ⬆️
pandas/core/series.py 94.17% <ø> (ø) ⬆️
pandas/io/common.py 70.28% <100%> (+0.24%) ⬆️
pandas/io/formats/csvs.py 97.67% <95.23%> (-0.48%) ⬇️
pandas/core/indexes/base.py 96.62% <0%> (ø) ⬆️
pandas/core/algorithms.py 94.83% <0%> (+0.01%) ⬆️
pandas/io/json/json.py 92.47% <0%> (+0.23%) ⬆️
pandas/util/testing.py 85.96% <0%> (+1.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf1c3dc...cf3afac. Read the comment docs.

@jorisvandenbossche jorisvandenbossche added this to the 0.23.2 milestone Jun 14, 2018
@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version IO CSV read_csv, to_csv labels Jun 14, 2018
@littleK0i
Copy link

littleK0i commented Jun 14, 2018

I am not a proper reviewer, but I am suspicious about the whole "ZipFile magic" part.

It looks really ugly to me, and this code may potentially cause spike of 2x memory usage due to StringIO accumulation and "getvalue()" call.

I do not fully understand what was the problem with 0.22 implementation. Yes, "read_csv" and "to_csv" were inconsistent, but it contained less "magic", and users could always wrap anything they need with few lines of code.

Also, compression algos usually have additional parameters. Users might want to change defaults later on (e.g. compression level), and they'll need manual wrapping for that anyway.

Just thoughts.

@minggli
Copy link
Contributor Author

minggli commented Jun 14, 2018

agree that for extremely large zip output, memory usage will increase when writing csv to StringIO first and then call getvalue(). With 8G RAM MacOS, enough to write a [150m, 3] dataframe of 3.4g memory. the memory increase from getvalue call is only when zip compression, for other use cases, there should be no difference as 0.22. When path_or_buf is None, to_csv uses StringIO and call getvalue() already.

if path_or_buf is None:

the custom class just provides a file-like that accepts string into zip archive when ZipFile class doesn't really offer that.

The ability to write zip archive is added #17778 where only read zip was supported, whereas we have round-trip ability for others. the motivation is consistency but also zip compression format is quite common and strange not to be supported.

the zip class is not only for to_csv but provides utility for to_json and to_pickle case too.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. mostly doc comments. ping on green.

@@ -56,7 +56,7 @@ Bug Fixes

**I/O**

-
- Bug in :meth:`to_csv` when handling file-like object incorrectly (:issue:`21471`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to the regression section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

# path_or_buf is file handle
path_or_buf = self.path_or_buf.name
if self.compression and hasattr(self.path_or_buf, 'write'):
import warnings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can move this import to the top

elif hasattr(self.path_or_buf, 'name'):
# path_or_buf is file handle
path_or_buf = self.path_or_buf.name
if self.compression and hasattr(self.path_or_buf, 'write'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can uou add a comment here (sure the warning says it all, but helpful nonetheless), also an issue reference

"object as input.")
warnings.warn(msg, RuntimeWarning, stacklevel=2)

if isinstance(self.path_or_buf, ZipFile) or (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do

is_zip = isinstance(......) or (.....)

"object as input.")
warnings.warn(msg, RuntimeWarning, stacklevel=2)

if isinstance(self.path_or_buf, ZipFile) or (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment here on what is_zip means

try:
self.path_or_buf.write(buf)
except AttributeError:
f, handles = _get_handle(self.path_or_buf, self.mode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on when the except happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

f, _handles = _get_handle(filename, 'w', compression=compression,
encoding=encoding)
with f:
s.to_csv(f, encoding=encoding, header=True)
result_fh = pd.read_csv(filename, compression=compression,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can rename this to result (and below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.



def test_compression_warning(compression_only):
df = DataFrame(100 * [[0.123456, 0.234567, 0.567567],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@jreback jreback requested a review from WillAyd June 15, 2018 17:10
@pep8speaks
Copy link

pep8speaks commented Jun 15, 2018

Hello @minggli! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 16, 2018 at 13:40 Hours UTC

@minggli minggli closed this Jun 15, 2018
@minggli minggli reopened this Jun 15, 2018
# GH 17778 handles zip compression separately.
buf = f.getvalue()
try:
self.path_or_buf.write(buf)
Copy link
Member

@gfyoung gfyoung Jun 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They say that it's sometimes easier to ask for forgiveness than it is for permission, hence the try/except block that you choose here. That being said, your treatment of self.path_or_buf.write is inconsistent here compared to:

https://github.com/pandas-dev/pandas/pull/21478/files#diff-eaa887c826bfb361d98db1ddb668c7deR150

Where you ask for permission instead of forgiveness. In addition, the logic seems somewhat similar. Can we potentially deduplicate it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see your point. changed to if else to make it consistent.

@@ -16,7 +16,7 @@ and bug fixes. We recommend that all users upgrade to this version.
Fixed Regressions
~~~~~~~~~~~~~~~~~

-
- Fixed Regression in :meth:`to_csv` when handling file-like object incorrectly (:issue:`21471`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Regression" --> "regression"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@minggli
Copy link
Contributor Author

minggli commented Jun 16, 2018

@jreback comments carried out.

@jreback jreback merged commit 91451cb into pandas-dev:master Jun 18, 2018
@jreback
Copy link
Contributor

jreback commented Jun 18, 2018

thanks @minggli nice followup patch!

@jreback
Copy link
Contributor

jreback commented Jun 18, 2018

@TomAugspurger this might be tricky to backport

@minggli minggli deleted the bugfix/to_csv branch June 19, 2018 07:01
@minggli
Copy link
Contributor Author

minggli commented Jun 19, 2018

@jreback @gfyoung happy to help. thanks for reviewing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants