ENH: Add 'infer' option to compression in _get_handle() #17900

Dobatymo · 2017-10-17T02:49:57Z

Added 'infer' option to compression in _get_handle().
This way for example .to_csv() is made more similar to pandas.read_csv().
The default value remains None to keep backward compatibility.

closes #xxxx
tests passed ( 14046 passed, 1637 skipped, 11 xfailed, 1 xpassed, 7 warnings)
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

my first pull request. if there is something wrong, i'd be happy to change that.
I am not sure about the xfailed entries, I ran pytest --skip-slow --skip-network pandas

jreback · 2017-10-17T12:03:39Z

is there a related issue?

Dobatymo · 2017-10-18T07:09:00Z

This is related #15008, which is a much larger project.

jorisvandenbossche · 2017-10-18T15:11:33Z

@Dobatymo The general idea looks good to me, but can you add some new test to confirm the behaviour? (for those read/write functions that now gained the ability to infer the compression type)

codecov · 2017-10-20T04:48:38Z

Codecov Report

Merging #17900 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17900      +/-   ##
==========================================
- Coverage   91.23%   91.21%   -0.03%     
==========================================
  Files         163      163              
  Lines       50105    50105              
==========================================
- Hits        45715    45704      -11     
- Misses       4390     4401      +11

Flag	Coverage Δ
#multiple	`89.02% <100%> (-0.01%)`	⬇️
#single	`40.31% <0%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/pickle.py	`80.43% <ø> (-0.82%)`	⬇️
pandas/core/frame.py	`97.75% <ø> (-0.1%)`	⬇️
pandas/io/common.py	`68.9% <100%> (-0.59%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5bf7f9a...6c0873f. Read the comment docs.

codecov · 2017-10-20T04:48:40Z

Codecov Report

Merging #17900 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17900      +/-   ##
==========================================
- Coverage   91.95%   91.94%   -0.01%     
==========================================
  Files         160      160              
  Lines       49858    49860       +2     
==========================================
  Hits        45845    45845              
- Misses       4013     4015       +2

Flag	Coverage Δ
#multiple	`90.32% <100%> (-0.01%)`	⬇️
#single	`42.08% <50%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/generic.py	`96.45% <ø> (ø)`	⬆️
pandas/core/frame.py	`97.19% <ø> (ø)`	⬆️
pandas/io/common.py	`70.65% <100%> (-0.55%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cb5880...651e43f. Read the comment docs.

Dobatymo · 2017-10-20T04:53:56Z

I didn't find any tests for compression in to_csv() so I added a simple test for that and 'infer'.

gfyoung · 2017-10-20T15:19:17Z

pandas/tests/io/formats/test_to_csv.py

+
+    def test_to_csv_compression(self):
+        import gzip
+        import bz2


I think we can just import at the top. You're not polluting the namespace with a from import. Also, reference the issue that you mention in the discussion under the function definition.

gfyoung · 2017-10-20T15:19:31Z

pandas/tests/io/formats/test_to_csv.py

+            with bz2.BZ2File(path) as f:
+                assert f.read() == exp
+
+    def test_to_csv_compression_lzma(self):


Reference the issue that you mention in the discussion under the function definition.

jreback · 2017-10-20T16:33:06Z

pandas/tests/io/formats/test_to_csv.py

@@ -223,3 +224,46 @@ def test_to_csv_multi_index(self):

        exp = "foo\nbar\n1\n"
        assert df.to_csv(index=False) == exp
+
+    def test_to_csv_compression(self):


follow the style of the read_csv tests for this, IOW you need to parameterize these, skipping as needed

jreback · 2017-10-28T00:31:28Z

can you rebase and update

Dobatymo · 2017-10-30T02:11:57Z

Sorry for the late reply.

I rebased, moved the imports and added a reference to #15008

I checked the tests for read_csv compression in pandas/tests/io/parser/compression.py, but and I don't see any parameterization...

EDIT: Ok, I parameterized the tests. But I kept the lzma test separate as I didn't know how to do the skipping stuff with the parameterization included.

jorisvandenbossche

Can you add a note in the v0.22.0.txt whatsnew file in the docs?

Further, Travis is failing due to this linting error:

pandas/io/pickle.py:7:1: F401 'pandas.io.common._infer_compression' imported but unused

jorisvandenbossche · 2017-10-30T09:05:35Z

pandas/core/frame.py

-            a string representing the compression to use in the output file,
-            allowed values are 'gzip', 'bz2', 'xz',
-            only used when the first argument is a filename
+        compression : {'infer', 'gzip', 'bz2', 'xz', None}, default None


'zip' is missing here?
(unless it is wrongly included in the docstring of get_handle)

I already removed 'zip' from the docstring.

What do you mean exactly? Where did you remove it?

I don't know whether zip is working for to_csv, therefore asking it. I just see that the line below mentions 'zip', but the line above not. And get_handle docstring also mentions 'zip'

Quickly tested, and 'zip' is not supported for writing, only reading. So I think it should be removed from the line below.

oh sorry for the misunderstanding, i did just remove the first mention of zip from that docstring, not the second one. I missed that. Like you said zip is only supported for reading, not writing.

jorisvandenbossche · 2017-10-30T09:11:12Z

pandas/tests/io/formats/test_to_csv.py

+        # see GH issue 15008
+
+        df = DataFrame([1])
+        exp = "".join((",0", os.linesep, "0,1", os.linesep)).encode("ascii")


why are we encoding as ascii?

I need the result to be bytes not string, and in a python 2 and 3 compatible way. os.linesep is string (bytes) in python 2 and string (unicode) in python 3.

jorisvandenbossche · 2017-10-30T09:46:04Z

doc/source/whatsnew/v0.22.0.txt

@@ -44,7 +44,7 @@ Other API Changes
 - All-NaN levels in ``MultiIndex`` are now assigned float rather than object dtype, coherently with flat indexes (:issue:`17929`).
 - :class:`Timestamp` will no longer silently ignore unused or invalid `tz` or `tzinfo` arguments (:issue:`17690`)
 - :class:`CacheableOffset` and :class:`WeekDay` are no longer available in the `tseries.offsets` module (:issue:`17830`)
-
+- all methods which use `_get_handle()` internally, now support "infer" as compression argument (eg. `to_csv()`)


Can you rewrite this a bit from a user standpoint of view? (who does not know about _get_handle). So if it is only to_csv, just mention that to_csv gained the option of 'infer' for the compression keyword.

(you can also put it in the 'Other enhancements' section instead of api changes.

jreback · 2017-10-31T01:01:49Z

doc/source/whatsnew/v0.22.0.txt

@@ -22,7 +22,7 @@ New features
 Other Enhancements
 ^^^^^^^^^^^^^^^^^^

-
+- `to_csv()` and `to_json()` now support 'infer' as `compression` argument (was already supported by `to_pickle()` before)


as a compression= kwarg. you don't need to mention pickle. add the issue number.

use :func:~DataFrame.to_csv` and similar toto_json``

can you add tests for to_json?

jreback · 2017-10-31T01:04:10Z

pandas/tests/io/formats/test_to_csv.py

+        df = DataFrame([1])
+        exp = "".join((",0", os.linesep, "0,1", os.linesep)).encode("ascii")
+
+        with tm.ensure_clean("test.xz") as path:


add an additional comparison that reads back these files (you can simply use infer), then assert_frame_equal to the original.

Dobatymo · 2017-10-31T04:07:35Z

This is related as well #17262

jreback · 2017-11-25T16:22:16Z

can you rebase

jreback · 2017-12-28T12:40:00Z

can you rebase

Dobatymo · 2018-01-10T04:36:03Z

rebased, squashed commits and fixed lzma compat issues which arose from rebase

jreback · 2018-01-10T13:14:30Z

can you rebase once again, having some CI issues

jreback · 2018-02-11T15:28:39Z

@Dobatymo can you rebase

jorisvandenbossche · 2018-02-11T21:28:21Z

@Dobatymo tip: if you rebase and push, we are not notified of that. So if you do that, it is best to put a short comment that you did it (ideally after the CI passed), so we can merge if possible, and don't forget it until it needs to be rebased again ..
(or, you can also merge master instead of rebasing, and then we actually get notified)

@jreback I fixed the merge conflicts. Did you have any further comments?

jreback · 2018-02-11T21:29:25Z

let me have a look

jreback

pls rebase

jreback · 2018-02-11T21:31:25Z

pandas/tests/io/formats/test_to_csv.py

+                assert f.read() == exp
+
+            tm.assert_frame_equal(pd.read_csv(path, index_col=0,
+                                              compression="infer"), df)


rather than hard coding this pls make a new compression fixture

jreback · 2018-02-15T12:03:58Z

pandas/tests/io/json/test_compression.py

+        tm.assert_frame_equal(pd.read_json(path, compression="infer"), df)
+
+
+@td.skip_if_no_lzma


compression is now a top level fixture so this needs to be updated

jreback · 2018-07-07T16:03:12Z

@Dobatymo can you rebase

jreback · 2018-07-07T23:00:44Z

doc/source/whatsnew/v0.24.0.txt

@@ -83,6 +83,7 @@ Other Enhancements
 - :func:`read_html` copies cell data across ``colspan``s and ``rowspan``s, and it treats all-``th`` table rows as headers if ``header`` kwarg is not given and there is no ``thead`` (:issue:`17054`)
 - :meth:`Series.nlargest`, :meth:`Series.nsmallest`, :meth:`DataFrame.nlargest`, and :meth:`DataFrame.nsmallest` now accept the value ``"all"`` for the ``keep` argument. This keeps all ties for the nth largest/smallest value (:issue:`16818`)
 - :class:`IntervalIndex` has gained the :meth:`~IntervalIndex.set_closed` method to change the existing ``closed`` value (:issue:`21670`)
+- :func:`~DataFrame.to_csv` and :func:`~DataFrame.to_json` now support ``compression='infer'`` to infer compression based on filename (:issue:`15008`)


csv, json, pickle as well?

and this is for both reading and writing?

Not sure about pickle, as it wasn't tested in the original diff. Can check.

Nope, just for writing. Reading already has support (why we didn't support it for both reading and writing in the first place befuddles me a little...)

Ha, to_pickle already has support for infer per the docs. Now I'm really confused how we didn't have this for to_csv and to_json...

jreback · 2018-07-07T23:01:59Z

hmm, isn't infer de-facto equivalent to None? (e.g. we always infer)? why do we need an extra option here?

gfyoung · 2018-07-07T23:04:06Z

hmm, isn't infer de-facto equivalent to None? (e.g. we always infer)? why do we need an extra option here?

@jreback :

None means the file isn't compressed
infer means there's compression, but we're asking pandas to figure it out

xref gh-15008 xref gh-17262

jreback · 2018-07-08T13:02:59Z

pandas/io/pickle.py

@@ -67,9 +67,8 @@ def to_pickle(obj, path, compression='infer', protocol=pkl.HIGHEST_PROTOCOL):
    >>> os.remove("./dummy.pkl")
    """
    path = _stringify_path(path)
-    inferred_compression = _infer_compression(path, compression)


@gfyoung I think the answer is here, this was special cases on pickle.......

jreback · 2018-07-08T13:05:18Z

thanks @Dobatymo and @gfyoung for the fixup!

dhimmel · 2018-07-09T14:08:15Z

Nice inferring compression when writing dataframes will be really useful.

The default value remains None to keep backward compatibility.

Unfortunately, much of the convenience of compression='infer' is lost if you have to explicitly specify it. Defaulting to infer would only affect users who are currently using paths with compression extensions but not actually compressing. That's pretty bad practice IMO. Hence, I'm in favor of breaking backwards compatibility and changing the default for compression to infer. It looks like this is going into a major release 0.24?

Update: I've opened an issue at #22004

xref pandas-devgh-15008 xref pandas-devgh-17262

gfyoung added Enhancement IO CSV read_csv, to_csv labels Oct 20, 2017

gfyoung reviewed Oct 20, 2017

View reviewed changes

jreback requested changes Oct 20, 2017

View reviewed changes

jorisvandenbossche reviewed Oct 30, 2017

View reviewed changes

jreback requested changes Oct 31, 2017

View reviewed changes

jreback requested changes Feb 15, 2018

View reviewed changes

jreback requested changes Jul 7, 2018

View reviewed changes

gfyoung added this to the 0.24.0 milestone Jul 7, 2018

gfyoung changed the title ~~added 'infer' option to compression in _get_handle()~~ ENH: Add 'infer' option to compression in _get_handle() Jul 7, 2018

ENH: support 'infer' compression in _get_handle()

651e43f

xref gh-15008 xref gh-17262

jreback reviewed Jul 8, 2018

View reviewed changes

jreback approved these changes Jul 8, 2018

View reviewed changes

jreback merged commit 6008d75 into pandas-dev:master Jul 8, 2018

gfyoung mentioned this pull request Jul 8, 2018

TST: Compression Inference Tests for read_* #17262

Closed

Dobatymo deleted the infer_compression branch July 9, 2018 00:24

dhimmel mentioned this pull request Jul 20, 2018

Defaulting to_csv to infer compression #22004

Closed

This was referenced Jul 21, 2018

Default to_* methods to compression='infer' #22011

Merged

DOC: consistent docstring for compression kwarg #22066

Merged

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

ENH: support 'infer' compression in _get_handle() (pandas-dev#17900)

7703867

xref pandas-devgh-15008 xref pandas-devgh-17262

		tm.assert_frame_equal(pd.read_json(path, compression="infer"), df)


		@td.skip_if_no_lzma

ENH: Add 'infer' option to compression in _get_handle() #17900

ENH: Add 'infer' option to compression in _get_handle() #17900

Conversation

Dobatymo commented Oct 17, 2017 • edited by jreback Loading

jreback commented Oct 17, 2017

Dobatymo commented Oct 18, 2017 • edited Loading

jorisvandenbossche commented Oct 18, 2017

codecov bot commented Oct 20, 2017

Codecov Report

codecov bot commented Oct 20, 2017 • edited Loading

Codecov Report

Dobatymo commented Oct 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 28, 2017

Dobatymo commented Oct 30, 2017 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dobatymo Oct 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dobatymo commented Oct 31, 2017

jreback commented Nov 25, 2017

jreback commented Dec 28, 2017

Dobatymo commented Jan 10, 2018

jreback commented Jan 10, 2018

jreback commented Feb 11, 2018

jorisvandenbossche commented Feb 11, 2018

jreback commented Feb 11, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 7, 2018

gfyoung commented Jul 7, 2018 • edited Loading

Choose a reason for hiding this comment

jreback commented Jul 8, 2018

dhimmel commented Jul 9, 2018 • edited Loading

Dobatymo commented Oct 17, 2017 •

edited by jreback

Loading

Dobatymo commented Oct 18, 2017 •

edited

Loading

codecov bot commented Oct 20, 2017 •

edited

Loading

Dobatymo commented Oct 30, 2017 •

edited

Loading

Dobatymo Oct 31, 2017 •

edited

Loading

gfyoung commented Jul 7, 2018 •

edited

Loading

dhimmel commented Jul 9, 2018 •

edited

Loading