Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG:DataFrame.to_csv not using correct line terminator on Windows within open block #38551

Closed
2 of 3 tasks
cheitzig opened this issue Dec 17, 2020 · 16 comments
Closed
2 of 3 tasks
Labels
Bug IO CSV read_csv, to_csv Windows Windows OS

Comments

@cheitzig
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

students = [('Charlie', 'A'), 
           ('Rich', 'B'), 
           ('Katie', 'A'),
           ('Tommy', 'B'),
           ]

df = pd.DataFrame(students, columns =['Name', 'Grade']) 

with open('file1.csv',mode='w') as f:
    df.to_csv(path_or_buf=f)

df.to_csv(path_or_buf='file2.csv')

Problem description

Pandas currently seems inconsitent in terms of how it writes data to csv files using the to_csv method. I've looked at previous issues #20353 and #25048. Both are closed, but this issue seems to happen even right now in production Python 3

Above is a snippet of code, similar to #20353, that reproduces the issue. The result is that file1.csv has line endings that are \r\r\n and file2.csv has line endings that are as expected-- \r\n

Expected Output

In both cases, I'd expect:
,Name,Grade\r\n
0,Charlie,A\r\n
1,Rich,B\r\n
2,Katie,A\r\n
3,Tommy,B\r\n

Instead, in file1.csv, we see each row with a line ending of \r\r\n

Output of pd.show_versions()

INSTALLED VERSIONS

commit : b5958ee
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 23 Stepping 7, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.1.5
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.7.5
pip : 20.3.3
setuptools : 50.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@cheitzig cheitzig added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 17, 2020
@twoertwein
Copy link
Member

Does it work when creating the file handle with newline=''. The documentation mentions that this might help.

@cheitzig
Copy link
Author

cheitzig commented Dec 18, 2020

[Edited so the answer is more clear higher-up in the answers]
Yes it does-- adding newline='' to the external file object open resolves this issue. As stated below also, this same behavior happens in csv.writer, and the solution is the same. So the code to make it work correctly:

with open('file1.csv',mode='w',newline='') as f:
    df.to_csv(path_or_buf=f)

I literally read that section of the documentation page a bunch of times because I was mis-spelling path_or_bufstr... but I missed that comment somehow.

Generally, shouldn't it work without having to do that? I read through many of the previous comments, but believe I missed comments about newline=''.

@simonjayhawkins simonjayhawkins added Usage Question IO CSV read_csv, to_csv Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 18, 2020
@cheitzig
Copy link
Author

I think I'd argue we should close given that the documentation shows a way to address this. It feels like it should "work" normally correctly without the newline='' on the open file handler, but if that's complicated or impossible, we should close maybe?

@simonjayhawkins simonjayhawkins added the Windows Windows OS label Dec 18, 2020
@twoertwein
Copy link
Member

one option would be to make a PR that tests whether the handle has the newline attribute and whether the platform is windows to then create a warning to re-emphasize that newline='' is suggested (it works without it but it adds a couple of empty lines).

@cheitzig
Copy link
Author

I like that idea. I can commit to trying to figure out how to do that, but what happens next? A PR is a pull request, right? Meaning someone has already written the code? But how do we turn this reported bug into a request for a feature /change? (I'm sorry I'm not more familiar with how this all works).

@twoertwein
Copy link
Member

No one has written code for that yet - you are very welcome to fork pandas, make the necessary changes, and then create a pull request. I assume this also affects to_json.

I assume that a good place for this feature would be at the top of get_handle in pandas/io/common.py. Only throw a warning when all of the following conditions are met (there might be more, these are the once that come to my mind):

  • is_text is True
  • mode is not binary
  • mode contains 'w'
  • getattr(filepath_or_buffer, '') returns None
  • we are on windows

@cheitzig
Copy link
Author

I realize I miscommunicated-- I know no one has written this code. I thought a PR (pull request, right?) meant that someone has already written the code. Which I know is not the case. My question is what do we do with this open bug while I get around to writing some warning code? If anything-- do we just leave it open?

Thanks, btw, for your notes on where to make code changes-- I'm (perhaps obviously) not familiar with the inner workings of pandas, and that gives me a good start. I'll give it a try. Two comments about your comments:

  • I think mode could be 'w' or 'a' (append).
  • I'm not as convinced it affects to_json, but maybe. I thought json didn't care about whitespace (i.e,. whitespace is explictly ignored?) Maybe I'm mistaken, but I'll look into it.

@asishm
Copy link
Contributor

asishm commented Dec 18, 2020

FWIW, newline="" is what the python stdlib csv module also recommends.
See: https://docs.python.org/3/library/csv.html#csv.writer

If csvfile is a file object, it should be opened with newline=''

and the linked footnote: https://docs.python.org/3/library/csv.html#id3

This is probably why pandas also has this behavior.

@twoertwein
Copy link
Member

My question is what do we do with this open bug while I get around to writing some warning code? If anything-- do we just leave it open?

I would just leave it open so that you can then reference this issue in your pull request. I think it is worth to add a warning, as multiple people have had this issue before you.

* I think mode could be 'w' or 'a' (append).

Yes, testing if 'r' is not in mode is probably better.

* I'm not as convinced it affects to_json, but maybe. I thought json didn't care about whitespace (i.e,. whitespace is explictly ignored?) Maybe I'm mistaken, but I'll look into it.

You are probably right that read_json, doesn't care about empty lines. If you change get_handle, it will affect most of Pandas's IO (I think the notable to_* that use text are csv, json, and markdown), so your warning message might also be triggered for to_json/to_markdown.

@simonjayhawkins simonjayhawkins removed Closing Candidate May be closeable, needs more eyeballs Usage Question labels Dec 19, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Dec 19, 2020
@cheitzig
Copy link
Author

Finally got a working pandas development environment-- once I figured out the Dockerfile stuff, it's pretty cool with VS Code!

Question for @twoertwein: what were you thinking getattr(path_or_buf, '') should do? I get an error: object has no attribute ''

I'm sure I'm missing something?

@twoertwein
Copy link
Member

oh, I'm sorry, I meant: getattr(filepath_or_buffer, "newline", "")

@cheitzig
Copy link
Author

Got it. I figured out last night that that's probably what you meant. I need to determine first if path_or_buf / filepath_or_buffer is a string (in which case it's not a previously opened file handle) and if it's not a string, use getattr(path_or_buf,"newline") to check if they're already passed '' or None as the newline value, in which case the warning isn't required. Also, I was thinking of adding the warning further up the call stack-- perhaps just in core/generic.py in to_csv. I'll get it working in the next few days now that I can debug / walk through the code.

@cheitzig
Copy link
Author

Update. I've run into a bit of a problem, and would ask for help if anyone has thoughts. I'll keep digging. It doesn't seem, per this Stackexchange conversation, that the newline parameter when a file is opened is actually available as an attribute anywhere. I thought it was-- I thought the file objects newlines attribute was this, but it doesn't seem to be. PEP-0278 says "A file object that has been opened in universal newline mode gets a new attribute "newlines" which reflects the newline convention used in the file. The value for this attribute is one of None (no newline read yet), "\r", "\n", "\r\n" or a tuple containing all the newline types seen." This only seems to apply to files opened for read though. Files written always seems to have a value of None (in the code below, and every other combination I've tried, it's None)

print("file4 / file handle opened separately with newline =\\r")
with open(file='file4.csv', mode='w', newline='\r') as f:
    print(f'newlines before file write: {f.newlines}')
    df.to_csv(path_or_buf=f)
    print(f'newlines after file write: {f.newlines}')

LMK if anyone has thoughts?

@twoertwein
Copy link
Member

sorry, I didn't know that. I assume the only option would be to warn the user when using read_csv?

@cheitzig
Copy link
Author

A bit more information. As @asishm said above, csv.writer recommends the same thing (newline=''), which is maybe not surprising since pandas uses csv.writer under the covers (see line 238 in io/formats/csvs.py).

Agree, @twoertwein we could warn the user whenever a person on Windows calls to_csv() with a file object not opened in binary mode, but that seems like it sort of punishes everyone using the to_csv() method with a warning, literally every time they use it, for those not reading the documentation and fixing it once.

I suggest that I edit my post above that starts with "Yes it does. That resolves it." to provide a bit more clarification on how to resolve (although the post above and below are already good/clear), and that maybe we close or table this issue. If anything, my conclusion now is that it should be corrected in csv.writer, not in pandas, but it'd be good for people searching for this issue to be able to quickly find the solution higher-up in the post.

Thoughts?

@twoertwein
Copy link
Member

@cheitzig Feel free to close it. One option might be to have a warning in read_csv after reading for the first time.

@cheitzig cheitzig closed this as completed Jan 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Windows Windows OS
Projects
None yet
Development

No branches or pull requests

4 participants