-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/REG: file-handle object handled incorrectly in to_csv #21478
Conversation
Codecov Report
@@ Coverage Diff @@
## master #21478 +/- ##
==========================================
+ Coverage 91.89% 91.92% +0.03%
==========================================
Files 153 153
Lines 49604 49599 -5
==========================================
+ Hits 45584 45595 +11
+ Misses 4020 4004 -16
Continue to review full report at Codecov.
|
I am not a proper reviewer, but I am suspicious about the whole "ZipFile magic" part. It looks really ugly to me, and this code may potentially cause spike of 2x memory usage due to StringIO accumulation and "getvalue()" call. I do not fully understand what was the problem with 0.22 implementation. Yes, "read_csv" and "to_csv" were inconsistent, but it contained less "magic", and users could always wrap anything they need with few lines of code. Also, compression algos usually have additional parameters. Users might want to change defaults later on (e.g. compression level), and they'll need manual wrapping for that anyway. Just thoughts. |
agree that for extremely large zip output, memory usage will increase when writing csv to StringIO first and then call Line 1747 in bf1c3dc
the custom class just provides a file-like that accepts string into zip archive when ZipFile class doesn't really offer that. The ability to write zip archive is added #17778 where only read zip was supported, whereas we have round-trip ability for others. the motivation is consistency but also zip compression format is quite common and strange not to be supported. the zip class is not only for to_csv but provides utility for to_json and to_pickle case too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. mostly doc comments. ping on green.
doc/source/whatsnew/v0.23.2.txt
Outdated
@@ -56,7 +56,7 @@ Bug Fixes | |||
|
|||
**I/O** | |||
|
|||
- | |||
- Bug in :meth:`to_csv` when handling file-like object incorrectly (:issue:`21471`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move to the regression section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
pandas/io/formats/csvs.py
Outdated
# path_or_buf is file handle | ||
path_or_buf = self.path_or_buf.name | ||
if self.compression and hasattr(self.path_or_buf, 'write'): | ||
import warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can move this import to the top
pandas/io/formats/csvs.py
Outdated
elif hasattr(self.path_or_buf, 'name'): | ||
# path_or_buf is file handle | ||
path_or_buf = self.path_or_buf.name | ||
if self.compression and hasattr(self.path_or_buf, 'write'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can uou add a comment here (sure the warning says it all, but helpful nonetheless), also an issue reference
pandas/io/formats/csvs.py
Outdated
"object as input.") | ||
warnings.warn(msg, RuntimeWarning, stacklevel=2) | ||
|
||
if isinstance(self.path_or_buf, ZipFile) or ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just do
is_zip = isinstance(......) or (.....)
pandas/io/formats/csvs.py
Outdated
"object as input.") | ||
warnings.warn(msg, RuntimeWarning, stacklevel=2) | ||
|
||
if isinstance(self.path_or_buf, ZipFile) or ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment here on what is_zip means
pandas/io/formats/csvs.py
Outdated
try: | ||
self.path_or_buf.write(buf) | ||
except AttributeError: | ||
f, handles = _get_handle(self.path_or_buf, self.mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment on when the except happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
pandas/tests/series/test_io.py
Outdated
f, _handles = _get_handle(filename, 'w', compression=compression, | ||
encoding=encoding) | ||
with f: | ||
s.to_csv(f, encoding=encoding, header=True) | ||
result_fh = pd.read_csv(filename, compression=compression, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can rename this to result (and below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
pandas/tests/test_common.py
Outdated
|
||
|
||
def test_compression_warning(compression_only): | ||
df = DataFrame(100 * [[0.123456, 0.234567, 0.567567], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
Hello @minggli! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on June 16, 2018 at 13:40 Hours UTC |
pandas/io/formats/csvs.py
Outdated
# GH 17778 handles zip compression separately. | ||
buf = f.getvalue() | ||
try: | ||
self.path_or_buf.write(buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They say that it's sometimes easier to ask for forgiveness than it is for permission, hence the try/except
block that you choose here. That being said, your treatment of self.path_or_buf.write
is inconsistent here compared to:
https://github.com/pandas-dev/pandas/pull/21478/files#diff-eaa887c826bfb361d98db1ddb668c7deR150
Where you ask for permission instead of forgiveness. In addition, the logic seems somewhat similar. Can we potentially deduplicate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see your point. changed to if else to make it consistent.
doc/source/whatsnew/v0.23.2.txt
Outdated
@@ -16,7 +16,7 @@ and bug fixes. We recommend that all users upgrade to this version. | |||
Fixed Regressions | |||
~~~~~~~~~~~~~~~~~ | |||
|
|||
- | |||
- Fixed Regression in :meth:`to_csv` when handling file-like object incorrectly (:issue:`21471`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Regression" --> "regression"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@jreback comments carried out. |
thanks @minggli nice followup patch! |
@TomAugspurger this might be tricky to backport |
(cherry picked from commit 91451cb)
(cherry picked from commit 91451cb)
git diff upstream/master -u -- "*.py" | flake8 --diff
This error related to PR #21249 and #21227. This is never supported use case and to use file-handle in to_csv with compression, the file-object itself should be a compression archive such as:
Regressed to 0.22 to_csv with support for zipfile. zipfile doesn't support writing csv strings to a zip archive using a file-handle. So buffer is used to catch the writing and dump into zip archive in one go. The other scenarios remain unchanged.