DataFrame to_csv line_terminator inconsistency when using compression #25311

jointfull · 2019-02-13T23:42:04Z

Code Sample, a copy-pastable example if possible

df.to_csv('uncompressed.csv')
df.to_csv('compressed-wrong-line-terminator.csv.gz')
df.to_csv('compressed-good-line-terminator.csv.gz', line_terminator='\n')

Problem description

Current line_terminator defaults when using compression and when not using compression are different (Windows OS, pandas 0.24.1).

When uncompressing the gzip file created using the default line_terminator, we can clearly see that the files are different (compressed-wrong-line-terminator.csv vs uncompressed.csv); only when using the explicit line_termintor='\n' the uncompressed file is identical to the not compressed file (compressed-good-line-terminator.csv.gz vs. uncompressed.csv).

It is emphasized that if we use the explicit line_terminator='\n' for non-compressed files, the output file is different than the ones created without explicit assignment of the line_terminator - forcing the user the need to explicitly specify the line_terminator only for compressed files.

This behavior is problematic, especially using the latest pandas version, where compression is inferred from the file extension, and one would expect that also the line_separator will undergo the same inference.

Expected Output

As stated above, it is expected that the command in python line 2 (after uncompressing it) will produce the same file as produced by the command in python line 1.
However, we see that only the command in python line 3 (after uncompressing it) produces the same file as produced by the command in python line 1.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.1
pytest: None
pip: 19.0.1
setuptools: 40.4.3
Cython: None
numpy: 1.15.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-02-14T06:33:45Z

Hmm OK. Can you provide code to roundtrip back from the compressed file just so nothing is ambiguous here?

CaselIT · 2019-02-21T13:12:37Z

I've the same problem.
Code snippet:

import pandas as pd
import numpy as np
d = pd.DataFrame(np.random.randint(1,10,size=(10,10)), columns=list('qwertyuiop'))
d.to_csv('foo.csv.gz', index=False)

The saved files has two line terminators in each line

q,w,e,r,t,y,u,i,o,p

6,7,9,9,1,7,2,6,9,8

9,9,1,8,2,7,2,5,9,9

4,8,3,8,1,3,9,3,4,1

4,3,8,4,6,6,9,5,2,6

4,2,7,3,3,4,4,7,5,3

8,2,5,8,8,6,9,5,6,3

3,1,8,6,9,7,9,6,8,3

1,6,7,8,6,7,5,3,5,3

8,4,9,4,8,3,5,5,6,2

6,7,8,6,6,2,8,2,3,4

The saved file in the example:
foo.csv.gz

Saving without compression does works as expected.

Setting the line_terminator resolves it

It seems to be limited to Windows, I've tried on Linux and it does not have the same problem, using the same pandas version

INSTALLED VERSIONS ------------------ commit: None python: 3.6.8.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.24.1
pytest: 4.2.0
pip: 19.0.1
setuptools: 40.7.3
Cython: 0.29.4
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.2.16
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

TomAugspurger · 2019-03-06T15:34:00Z

This looks like a duplicate of #25048. LMK if not.

jointfull · 2019-03-06T15:52:50Z

After carefully reading the details of #25048, it seems that sed task refers to a scenario where a file handler is passed to pandas.to_csv().
However, in my case, the call to pandas.to_csv() is with a filename (and not a file handler), and behave different when giving a filename that ends with .gz (inferring a request for a compressed file).
I believe we are talking about two different problems here (unless proven that they originate from the same bug and that fixing one fixes the other too).

TomAugspurger · 2019-03-06T15:54:44Z

I believe that we create a file handler internally when we detect a compression scheme, and so go down the same path as the other issue. Is that right?

…

On Wed, Mar 6, 2019 at 9:52 AM jointfull ***@***.***> wrote: After carefully reading the details of *#25048 <#25048>*, it seems that sed task refers to a scenario where a *file handler* is passed to pandas.to_csv(). However, *in my case*, the call to pandas.to_csv() is with a filename (and *not a file handler*), and behave different when giving a filename that ends with .gz (inferring a request for a compressed file). I believe we are talking about *two different problems* here (unless proven that they originate from the same *bug* and that fixing one fixes the other too). — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#25311 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIqumnI1b7cUyMC7kgthzGp60Lylrks5vT-RagaJpZM4a6gZw> .

TomAugspurger · 2019-03-06T16:01:36Z

Though, perhaps we can work around for this specific case, by supplying the line_terminator for the user if necessary? cc @gfyoung @chris-b1.

gfyoung · 2019-03-06T23:27:41Z

I am pretty certain that these two issues are tied to the same underlying issue, so let's see what comes of #25048 and then return to this one if need be.

jointfull · 2019-03-11T16:56:29Z

👏

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Feb 14, 2019

CaselIT mentioned this issue Feb 22, 2019

line terminator issue (again) on windows when writing to_csv() #25048

Closed

TomAugspurger closed this as completed Mar 6, 2019

TomAugspurger added Duplicate Report Duplicate issue or pull request and removed Needs Info Clarification about behavior needed to assess issue labels Mar 6, 2019

TomAugspurger added this to the No action milestone Mar 6, 2019

TomAugspurger reopened this Mar 6, 2019

gfyoung removed this from the No action milestone Mar 6, 2019

gfyoung added IO CSV read_csv, to_csv and removed Duplicate Report Duplicate issue or pull request labels Mar 6, 2019

chris-b1 mentioned this issue Mar 10, 2019

BUG: to_csv line endings with compression #25625

Merged

4 tasks

jreback added this to the 0.25.0 milestone Mar 10, 2019

jreback added the Compat pandas objects compatability with Numpy or Python functions label Mar 10, 2019

jorisvandenbossche modified the milestones: 0.25.0, 0.24.2 Mar 10, 2019

jorisvandenbossche closed this as completed in #25625 Mar 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame to_csv line_terminator inconsistency when using compression #25311

DataFrame to_csv line_terminator inconsistency when using compression #25311

jointfull commented Feb 13, 2019

INSTALLED VERSIONS

WillAyd commented Feb 14, 2019

CaselIT commented Feb 21, 2019 •

edited

Loading

TomAugspurger commented Mar 6, 2019

jointfull commented Mar 6, 2019

TomAugspurger commented Mar 6, 2019 via email

TomAugspurger commented Mar 6, 2019

gfyoung commented Mar 6, 2019

jointfull commented Mar 11, 2019

DataFrame to_csv line_terminator inconsistency when using compression #25311

DataFrame to_csv line_terminator inconsistency when using compression #25311

Comments

jointfull commented Feb 13, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Feb 14, 2019

CaselIT commented Feb 21, 2019 • edited Loading

TomAugspurger commented Mar 6, 2019

jointfull commented Mar 6, 2019

TomAugspurger commented Mar 6, 2019 via email

TomAugspurger commented Mar 6, 2019

gfyoung commented Mar 6, 2019

jointfull commented Mar 11, 2019

Output of `pd.show_versions()`

CaselIT commented Feb 21, 2019 •

edited

Loading