Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: line terminator and '\n to \r\n' problem in Windows(Issue #20353) #21406

Merged
merged 1 commit into from
Oct 19, 2018

Conversation

deflatSOCO
Copy link
Contributor

@gfyoung gfyoung added IO CSV read_csv, to_csv Windows Windows OS labels Jun 10, 2018
@@ -147,7 +147,7 @@ MultiIndex
I/O
^^^

-
- (Only for Windows) Bug in :meth:`DataFrame.to_csv`, in which all `\n`s are converted to `\r\n` (:issue:`20353`)
Copy link
Member

@gfyoung gfyoung Jun 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the Windows portion as follows:

Bug in :meth:`DataFrame.to_csv`, in which all `\n`s are converted to `\r\n` on Windows (:issue:`20353`)

b'int,str_crlf\n' \
b'1,abc\n' \
b'2,"d\r\nef"\n' \
b'3,"g\r\nh\r\n\r\ni"\n'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use parentheses instead of slashes for multi-line strings.

@@ -394,13 +395,13 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None,
elif is_path:
if compat.PY2:
# Python 2
f = open(path_or_buf, mode)
f = io.open(path_or_buf, mode, newline='\n')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to use io.open for this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I referenced this stackoverflow article, but it was not appropreate as all CI tests failed in py2 environment.
I will fix it.

@pep8speaks
Copy link

pep8speaks commented Jun 10, 2018

Hello @deflatSOCO! Thanks for updating the PR.

file_to_check.py:2521:-258: W605 invalid escape sequence '('
file_to_check.py:2521:-255: W605 invalid escape sequence ')'


Comment last updated on October 19, 2018 at 05:43 Hours UTC

@codecov
Copy link

codecov bot commented Jun 10, 2018

Codecov Report

Merging #21406 into master will decrease coverage by <.01%.
The diff coverage is 88.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21406      +/-   ##
==========================================
- Coverage   92.19%   92.19%   -0.01%     
==========================================
  Files         169      169              
  Lines       50950    50956       +6     
==========================================
+ Hits        46975    46980       +5     
- Misses       3975     3976       +1
Flag Coverage Δ
#multiple 90.61% <88.88%> (-0.01%) ⬇️
#single 42.27% <33.33%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 96.65% <ø> (ø) ⬆️
pandas/util/testing.py 86.5% <100%> (+0.05%) ⬆️
pandas/io/formats/csvs.py 98.22% <100%> (+0.01%) ⬆️
pandas/io/common.py 70.76% <66.66%> (-0.28%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ecc5cbc...23ce5c6. Read the comment docs.

f = open(path_or_buf, mode)
elif encoding:
# Python 3 and encoding
f = open(path_or_buf, mode, encoding=encoding)
f = open(path_or_buf, mode, encoding=encoding, newline='\n')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deflatSOCO Is there a specific reason for not making newline a function parameter? I would suggest that you make newline a _get_handle parameter and default it to None then let open handle it from there.
Then update formats.csvs.CSVFormatter.save to pass self.line_terminator to the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bibenb1 Thanks for feedback.

Python document says,

When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.

formats.csvs.CSVFormatter.save applies self.line_terminator to csv.writer(in line157) before calling get_handle(in line163), so it should not be passed in open() again.
From this thing, I thought newline value should ALWAYS be '' or '\n', to avoid different behavior between Windows and others, so I did not make it a function parameter.

@chris-b1
Copy link
Contributor

Willing to be convinced otherwise but I don't think the default should change to \n on Windows. It's an annoying convention, but convention all the same.

@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

@chris-b1 : The issue more relates to an inconsistency, as here.

@chris-b1
Copy link
Contributor

Right, and I'm good with fixing the original bug (\r\n terminator when \n explicitly passed), but this PR would, if I'm understanding correctly, also change the default to \n when no line terminator is passed.

@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

but this PR would, if I'm understanding correctly, also change the default to \n when no line terminator is passed.

@chris-b1 : Ah, I see what you mean. That is indeed correct based on the tests.

@deflatSOCO : Could you change it so that the default remains \r\n unless otherwise specified?

@deflatSOCO
Copy link
Contributor Author

deflatSOCO commented Jun 11, 2018

@chris-b1 @gfyoung Please let me convince.
The current document says the default value of line_terminator in to_csv is '\n' no matter what OS is, so I just followed that in this PR.
I suppose that you mean that should be changed to '\r\n' only in Windows.

If so, the expected output undet the Windows environment should be like this. Is that right?

  • code
data = pd.DataFrame({
    "integer":[1,2,3],
    "string_lf":["abc","d\nef","gh\r\ni"]
})
data.to_csv('test.txt', index=False)
print(pd.read_csv('test.csv'))
data.to_csv('test.csv', index=False)
print(pd.read_csv('test.csv'))
with open("test.csv", mode='rb') as f:
    print(f.read())
  • current output(v0.23)
   integer  string_lf
0        1        abc
1        2    d\r\nef
2        3  gh\r\r\ni
b'integer,string_lf\r\n1,abc\r\n2,"d\r\nef"\r\n3,"gh\r\r\ni"\r\n'
  • current output(this PR)
   integer  string_lf
0        1        abc
1        2      d\nef
2        3    gh\r\ni
b'integer,string_lf\n1,abc\n2,"d\nef"\n3,"gh\ni"\n'
  • expected output
   integer  string_lf
0        1        abc
1        2      d\nef
2        3    gh\r\ni
b'integer,string_lf\r\n1,abc\r\n2,"d\nef"\r\n3,"gh\ni"\r\n'

@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

@chris-b1 @gfyoung Please let me convince.
The current document says the default value of line_terminator in to_csv is '\n' no matter what OS is, so I just followed that in this PR.

@chris-b1 : Thoughts? IMO @deflatSOCO has a point here.

@gfyoung
Copy link
Member

gfyoung commented Jun 19, 2018

@jreback @chris-b1 : Could you take a look at this?

@chris-b1
Copy link
Contributor

I know the current docs say contrary, but my vote is that we continue defaulting to the platform default line terminator (\r\n on Windows) and updating the docs as such.

It preserves backwards compat - so if someone was relying on the \r being there, their code won't break, and is consistent with other programs on Windows that save csvs, including Excel, and python's csv writer.

In [1]: import csv

In [3]: with open('tmp.csv', 'wt', newline='') as f:
   ...:     writer = csv.writer(f)
   ...:     writer.writerow(['a', 'b'])
   ...:     writer.writerow([1, 2])
   ...:     writer.writerow([3, 4])

In [4]: open('tmp.csv', 'rb').read()
Out[4]: b'a,b\r\n1,2\r\n3,4\r\n'

(again fully in support of merging this with bugfix for case with explicit ending)

@jreback
Copy link
Contributor

jreback commented Jun 23, 2018

can u rebase

@gfyoung
Copy link
Member

gfyoung commented Jun 24, 2018

I know the current docs say contrary, but my vote is that we continue defaulting to the platform default line terminator (\r\n on Windows) and updating the docs as such.

Fair enough. @deflatSOCO : Can you update?

(If you can't, I can pick this up)

@gfyoung gfyoung added this to the 0.24.0 milestone Jun 24, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 24, 2018

Marking for 0.24.0, as I can bring this to the finish line if need be.

deflatSOCO added a commit to deflatSOCO/pandas that referenced this pull request Jun 24, 2018
* re-defined testcases that suits conversations in PR pandas-dev#21406
* changed default value of line_terminator to os.linesep
* changed API document of DataFrame.to_csv
* changed "newline" value of "open()" from '\n' to ''
* Updated whatsnew document

related pages:
* Issue pandas-dev#20353
* PR pandas-dev#21406
deflatSOCO added a commit to deflatSOCO/pandas that referenced this pull request Jun 26, 2018
* Updates:
  * Updated expected values for some tests about 'to_csv()' method, to 
deal with new default value of 'line_terminator' arg.

* Related Issue:
  * Issue pandas-dev#20353
  * PR pandas-dev#21406
expected = ('"A","B"' + os.linesep +
'1,"foo"' + os.linesep +
'2,"bar"' + os.linesep +
'3,"baz"' + os.linesep)
Copy link
Member

@gfyoung gfyoung Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can make a testing utility (i.e. function) for this os.linesep "magic"

with open(path, mode='rb') as f:
assert f.read() == expected

df.to_csv(path, line_terminator='\n')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do separate ensure_cleans() for each case & put a comment for each case

b'2,"d\nef"' + os_linesep +
b'3,"g\nh\n\ni"' + os_linesep
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make separate ensure_cleans here

df.to_csv(sys.stdout, encoding='ascii')
output = sys.stdout.getvalue()
assert output == expected_ascii
assert not sys.stdout.closed

@pytest.mark.xfail(
os.name == 'nt',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use compat.is_windows_platform()



def convert_rows_list_to_csv_str(rows_list):
csv_str = ''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a mini-doc-string

csv_str = ''
for row in rows_list:
csv_str += row + os.linesep
return csv_str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use

return os.linesep.join(rows_list) ?

deflatSOCO added a commit to deflatSOCO/pandas that referenced this pull request Jul 7, 2018
@@ -89,6 +89,8 @@ Other Enhancements

Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- When file path is given by string format in :func:`DataFrame.to_csv`, it disables universal newline support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear form this whether disabling universal newline support is the previous or new behavior. I think I would phrase it as

:func:`DataFrame.to_csv` now uses :func:`os.linesep` rather than ``'\n'`` for the default line terminator.

@@ -409,13 +409,15 @@ def _get_handle(path_or_buf, mode, encoding=None, compression=None,
elif is_path:
if compat.PY2:
# Python 2
if mode == 'w':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change necessary?

Copy link
Contributor Author

@deflatSOCO deflatSOCO Aug 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late reply.

Document of csv.writer() in python2 says:

If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.

Document of open() in python2 says:

Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written.

From these two, I thought binary mode should always be activated in python2 when writing.

pandas/util/testing.py Show resolved Hide resolved
@gfyoung
Copy link
Member

gfyoung commented Aug 25, 2018

@deflatSOCO : Could you address the merge conflicts here?

@TomAugspurger @jreback : Can you have a look again at this?

@deflatSOCO
Copy link
Contributor Author

@TomAugspurger @jreback : Can you have a look again at this?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok to me. i would just request an exapanded whatsnew note. maybe a small sub-section on when a user should care and what they need to do about it. e.g. i write csv's on windows, do I need to do anything?what if I explicity pass line_terminator

@@ -189,6 +189,7 @@ Other Enhancements

Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- :func:`DataFrame.to_csv` now uses :func:`os.linesep` rather than ``'\n'`` for the default line terminator.(:issue:`20353`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elabortate on what if anything a user needs to do (e.g. when should they care about this)

@@ -731,6 +732,7 @@ MultiIndex
I/O
^^^

- Bug in :meth:`DataFrame.to_csv`, in which all `\n`s are converted to `\r\n` on Windows (:issue:`20353`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a repeat?

deflatSOCO added a commit to deflatSOCO/pandas that referenced this pull request Sep 24, 2018
PR pandas-dev#21406
* Expanded whatsnew documents about the change of to_csv
* Resolved duplication
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:func:`DataFrame.to_csv` now uses :func:`os.linesep` rather than ``'\n'``
for the default line terminator(:issue:`20353`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after terminator

doc/source/whatsnew/v0.24.0.txt Show resolved Hide resolved

.. code-block:: ipython

In [1]: import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can skip the pd import


In [5]: with open("test.csv", mode='rb') as f:
...: print(f.read())
b'string_with_lf,string_with_crlf\r\nabc,abc\r\n"d\r\nef","d\r\r\nef"\r\n"g\r\nh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make this simpler, just show [3] , [4]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I make data smaller, and keep [5] instead of [4]?
With [4], I can't show the terminator of each line in CSV, which is the main point of this change. That's why I want to keep [5].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the point is we want a small example that shows the most common case where a user would want to know what is changing. Too long and you lose the reader, too short and you don't know if the change applies to youl.


In [1]: import pandas as pd

In [2]: data = pd.DataFrame({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

...: data.to_csv(f,index=False,line_terminator='\n')

In [7]: pd.read_csv("test2.csv")
Out[7]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for writing all this, but just show a single, most important change

deflatSOCO added a commit to deflatSOCO/pandas that referenced this pull request Sep 25, 2018
doc/source/whatsnew/v0.24.0.txt Show resolved Hide resolved
doc/source/whatsnew/v0.24.0.txt Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Oct 7, 2018

@deflatSOCO can you update the whatsnew as discussed above, and rebase on master.

@deflatSOCO
Copy link
Contributor Author

@jreback I applied in commit e4badc4, and rebased. Please check them.

b'string_with_lf,string_with_crlf\n"a\nbc","a\r\nbc"\n'


- On windows, the value of ``os.linesep`` is ``'\r\n'``,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

windows --> Windows



- As default value of ``line_terminator`` changes, just passing file object with ``newline='\n'`` does not set ``'\n'`` to line terminator.
Pass ``line_terminator='\n'`` explicitly.
Copy link
Member

@gfyoung gfyoung Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can reword this a little. Say something like:

For files objects, specifying newline is not sufficient to set the line terminator. You must pass in the line terminator explicitly, even in this case.

deflatSOCO added a commit to deflatSOCO/pandas that referenced this pull request Oct 10, 2018
@jreback
Copy link
Contributor

jreback commented Oct 11, 2018

just a couple of minor comments / doc-string things. @gfyoung ping when happy.

@jreback
Copy link
Contributor

jreback commented Oct 18, 2018

@gfyoung you ok with this, can you rebase this? or @deflatSOCO

@gfyoung
Copy link
Member

gfyoung commented Oct 18, 2018

@jreback : Can do, given that no reply how has come since the latest comment in > 1 week.

@gfyoung
Copy link
Member

gfyoung commented Oct 18, 2018

Alas, there were too many merge conflicts with master that I had to nuke the old commits and replace them with a squashed version.

* Use OS line terminator if none is provided
* Enforce line terminator selection if one is

Originally authored by @deflatSOCO, but reapplied
by @gfyoung due to enormous merge conflicts.

Closes pandas-devgh-20353.
@gfyoung
Copy link
Member

gfyoung commented Oct 19, 2018

@jreback : Addressed all comments, rebased, and all is green! PTAL.

@jreback jreback merged commit 31f86d6 into pandas-dev:master Oct 19, 2018
@jreback
Copy link
Contributor

jreback commented Oct 19, 2018

thanks @deflatSOCO and @gfyoung

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
* Use OS line terminator if none is provided
* Enforce line terminator selection if one is

Originally authored by @deflatSOCO, but reapplied
by @gfyoung due to enormous merge conflicts.

Closes pandas-devgh-20353.
@BenRussert BenRussert mentioned this pull request Dec 14, 2018
2 tasks
BenRussert added a commit to BenRussert/pandas that referenced this pull request Dec 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Windows Windows OS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.to_csv not using correct line terminator value
7 participants