Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

line terminator issue (again) on windows when writing to_csv() #25048

Closed
briochh opened this issue Jan 31, 2019 · 11 comments · Fixed by #25624
Closed

line terminator issue (again) on windows when writing to_csv() #25048

briochh opened this issue Jan 31, 2019 · 11 comments · Fixed by #25624
Labels
IO CSV read_csv, to_csv Windows Windows OS
Milestone

Comments

@briochh
Copy link

briochh commented Jan 31, 2019

Another addition to windows line-end saga:
Changes in Pandas 0.24.0, related to #20353, seem to cause some strange behaviour in this not so uncommom edge case:

Problem description

When using df.to_csv() to write to an exisitng file handle without explicitly setting newline='' (or newline='\n') in the open statement an extra '\r' is added to the default (on windows) '\r\n' line-end:

data = pd.DataFrame({"string_with_lf": ["a\nbc"], "string_with_crlf": ["a\r\nbc"]})
with open("test2.csv", mode='w') as f:
    data.to_csv(f, index=False)
with open("test2.csv", mode='rb') as f:
    print(f.read())
Expected output

b'string_with_lf,string_with_crlf\r\n"a\nbc","a\r\nbc"\r\n'

Current output

b'string_with_lf,string_with_crlf\r\r\n"a\r\nbc","a\r\r\nbc"\r\r\n'

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.0
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml.etree: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@gfyoung gfyoung added IO CSV read_csv, to_csv Windows Windows OS labels Jan 31, 2019
@gfyoung
Copy link
Member

gfyoung commented Jan 31, 2019

cc @deflatSOCO @jreback

So the reason for this behavior is two-fold:

  1. We explicitly set the terminator (in this case, \r\n) when writing to the file because this is specified when you called to_csv without a specific line_terminator.
  2. Because you are using f as a file handler with newline=None, Python then proceeds to replace each \n with \r\n. Thus, what you get then is \r\r\n.

Admittedly, this is an unfortunate corner case because we have no information going in what the newline terminator should be i.e., how would we know that we should use \n instead of \r\n for the line terminator (so that we avoid \r\r\n).

I'm inclined to suggest that we document this behavior in more detail for Windows because the workaround for your case is clear. Not sure if there is a straightforward fix to make all parties happy here. Thoughts?

@jorisvandenbossche
Copy link
Member

cc @chris-b1

@chris-b1
Copy link
Contributor

Is there some way to introspect a file object for its newlines mode? Not finding an obvious way. I do think we need to fix this

@gfyoung
Copy link
Member

gfyoung commented Jan 31, 2019

I do think we need to fix this

@chris-b1 : I don't think anyone is disagreeing with you on that. However, having read the documentation for open, introspection does not look easy, if possible, and if it isn't, we find ourselves caught between a rock and a hard place regarding this issue and #20353.

At the very least, something should be documented to address this caveat for Windows if we can't find a way to introspect on a file object.

@gfyoung gfyoung added this to the 0.24.2 milestone Feb 7, 2019
@gfyoung
Copy link
Member

gfyoung commented Feb 7, 2019

Given this was a behavior change in 0.24.0, while I don't see an easy fix (if any) for this, we should at least document this caveat for Windows in 0.24.2.

@CaselIT
Copy link

CaselIT commented Feb 22, 2019

This may be related to #25311

@TomAugspurger
Copy link
Contributor

Are people on-board with documentation this as the expected behavior, and giving guidance for how to avoid it? If so, does anyone have time to put up a PR in time for 0.24.2?

@chris-b1
Copy link
Contributor

chris-b1 commented Mar 6, 2019

If we're shooting for next for 0.24.2 , I'll have time this weekend to look.

@chris-b1
Copy link
Contributor

It does seem like there isn't a way to work around this. python's documentation puts it on the user to pass newline='' if using a file object with the csv writer, I'll have a PR to do the same here
https://docs.python.org/3/library/csv.html#id3

@cliclpt
Copy link

cliclpt commented Apr 1, 2019

I am experiencing similar issue when using to_csv to writing dat file to a linux server from my local windows machine. line_terminator does not seem to help.

@jorisvandenbossche
Copy link
Member

@nospoon81 can you give a bit more details? The new explanation added in #25624 does not help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Windows Windows OS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants